1574 Commits

Author SHA1 Message Date
Adrian Chadd
6520495abc Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.

* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
  the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
  if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
  policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
  policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
  set in a variety of methods.

This is only relevant for very specific workloads.

This doesn't pretend to be a final NUMA solution.

The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.

This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.

Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.

Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.

Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.

Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!

Tested:

* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)

* Peter Holm ran a stress test suite on this work and found one
  issue, but has not been able to verify it (it doesn't look NUMA
  related, and he only saw it once over many testing runs.)

* I've tested bhyve instances running in fixed NUMA domains and cpusets;
  all seems to work correctly.

Verified:

* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
  NUMA policies for processes under test.

Review:

This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@.  The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).

This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus.  My hope is that with further
exposure and testing more functionality can be implemented and evaluated.

Notes:

* The VM doesn't handle unbalanced domains very well, and if you have an overly
  unbalanced memory setup whilst under high memory pressure, VM page allocation
  may fail leading to a kernel panic.  This was a problem in the past, but it's
  much more easily triggered now with these tools.

* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
  affect contigmalloc, KVA placement, UMA, etc.  So, driver placement of memory
  isn't really guaranteed in any way.  That's next on my plate.

Sponsored by:	Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
Edward Tomasz Napierala
a238a79872 Fix markup.
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2015-07-07 19:23:59 +00:00
Konstantin Belousov
eb89622653 Grammar and language fixes.
Submitted by:	wblock
Review:	https://reviews.freebsd.org/D2969
MFC after:	12 days
2015-07-03 17:30:31 +00:00
Konstantin Belousov
23e1c1251c Document x86 machine-specific ptrace(2) requests. Provide list of the
ppc requests.

Reviewed by:	brueffer, emaste, gjb (previous version)
Sponsored by:	The FreeBSD Foundation
Review:	https://reviews.freebsd.org/D2962
MFC after:	2 weeks
2015-06-30 18:53:42 +00:00
Jeremie Le Hen
b7c4ed65cc NetBSD commit log:
Use a constant array for the MIB. Newer LLVM decided that mib[] warranted
  stack protections, with the obvious crash after the setup was done.
  As a positive side effect, code size shrinks a bit.

I'm not sure why this hasn't bitten us yes, but it is certainly possible and
there are no real drawbacks to this change anyway.

Submitted by:	pfg
Obtained from:	NetBSD
MFC after:	1 week
2015-06-14 07:47:18 +00:00
John Baldwin
196cd80898 Various updates to the ftruncate(2) documentation:
- Note that ftruncate(2) can operate on shared memory objects and cross
  reference shm_open(2).
- Note that ftruncate(2) does not change the file position pointer (aka
  seek pointer) of the file descriptor.
- ftruncate(2) will fail with EINVAL for all sorts of other fd types than
  just sockets, so instead note that it fails for all but regular files and
  shared memory objects.
- Note that ftruncate(2) also appeared in 4.2BSD along with truncate(2).
  (Or at least the manpage for both appeared in 4.2, I did not check the
  kernel code itself to see if either predated 4.2.)

PR:		199472 (2)
Submitted by:	andrew@ugh.net.au (2)
MFC after:	1 week
2015-05-04 14:47:00 +00:00
John Baldwin
afa94a3f97 Partially revert r255486, the first argument to socketpair() is a socket
domain, not a file descriptor.  Use 'domain' instead of the original 'd'
for this argument to match socket(2).

PR:		199491
Reported by:	sp55aa@qq.com
MFC after:	1 week
2015-05-04 14:23:31 +00:00
Mark Johnston
93c9677b94 fork(2): Add a note to the effect that kqueue descriptors, unlike other
descriptor types, are not inherited from the parent process.

Reported by:	kmacy
MFC after:	1 week
2015-05-02 00:29:27 +00:00
Baptiste Daroussin
18c5321d06 Escape "Ed" 2015-04-26 10:52:37 +00:00
John Baldwin
179fa75e6e Reassign copyright statements on several files from Advanced
Computing Technologies LLC to Hudson River Trading LLC.

Approved by:	Hudson River Trading LLC (who owns ACT LLC)
MFC after:	1 week
2015-04-23 14:22:20 +00:00
Konstantin Belousov
0538aafc41 The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter.  The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development.  The
shims were reasonable to allow easier revert to the older kernel at
that time.

Now, two or three major releases later, shims do not serve any
purpose.  Such old kernels cannot handle current libc, so revert the
compatibility code.

Make padded syscalls support conditional under the COMPAT6 config
option.  For COMPAT32, the syscalls were under COMPAT6 already.

Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.

Reviewed by:	jhb, imp (previous versions)
Discussed with:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-18 21:50:13 +00:00
Konstantin Belousov
3d0045bb2b Make wait6(2), waitid(3) and ppoll(2) cancellation points. The
waitid() function is required to be cancellable by the standard.  The
wait6() and ppoll() follow the other syscalls in their groups.

Reviewed by:	jhb, jilles (previous versions)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-18 21:35:41 +00:00
Sergey Kandaurov
d359191f7e Remove obsolete bits about maximum number of file systems.
NMOUNT has gone together with static mount table in 4.3BSD-Reno.

MFC after:	1 week
2015-04-12 21:14:58 +00:00
John Baldwin
b871bfa1a2 vfork() first appeared in 3BSD which pre-dates 2.9BSD. Verified via the
copy of 3BSD on disc 1 of "The CSRG Archives".

PR:		198612
MFC after:	1 week
2015-04-06 20:40:01 +00:00
Ed Maste
541236cf60 libc: Eliminate duplicate copies of __vdso_gettc.c
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D2152
2015-04-02 21:18:11 +00:00
Edward Tomasz Napierala
522196b5ed Update open(2) to make it more obvious that O_NOCTTY and O_TTY_INIT
are ignored.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-04-02 11:41:04 +00:00
Konstantin Belousov
1849df3006 Correctly handle __fcntl_compat symbol for the !SYSCALL_COMPAT case.
Both .weak and .alias assembler directives only work when assembling
the file which defines the symbol.

Reported and tested by:	andrew
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-01 16:55:30 +00:00
Konstantin Belousov
b072e86d09 Make kevent(2) a cancellation point.
Note that to cancel blocked kevent(2) call, changelist must be empty,
since we cannot cancel a call which already made changes to the
process state.  And in reverse, call which only makes changes to the
kqueue state, without waiting for an event, is not cancellable.  This
makes a natural usage model to migrate kqueue loop to support
cancellation, where existing single kevent(2) call must be split into
two: first uncancellable update of kqueue, then cancellable wait for
events.

Note that this is ABI-incompatible change, but it is believed that
there is no cancel-safe code that relies on kevent(2) not being a
cancellation point.  Option to preserve the ABI would be to keep
kevent(2) as is, but add new call with flags to specify cancellation
behaviour, which only value seems to add complications.

Suggested and reviewed by:	jilles
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-03-29 19:14:41 +00:00
John-Mark Gurney
32d52c275d forgot to bump date, and replace contraction (igor)... 2015-03-07 03:48:32 +00:00
John-Mark Gurney
4a46673183 make things a bit more clear.. we worked together on language..
Submitted by:	Justin Cormack
2015-03-06 23:17:18 +00:00
John-Mark Gurney
b759b8aa44 fix spelling, add comma and remove BUGS section.. it provided no useful
information, and is not really bugs, but limitations for other reasons...
2015-02-19 01:51:17 +00:00
Marius Strobl
aed116911d Unbreak sparc64 after r276630 by calling __sparc_sigtramp_setup signal
trampoline as part of the MD __sys_sigaction again.

Submitted by:	kib (initial versions)
MFC after:	3 days
2015-02-16 22:13:03 +00:00
Konstantin Belousov
45468c5356 Properly interpose libc spinlocks, was missed in r276630. In
particular, stdio locking was affected.

Reported and tested by:	"Matthew D. Fuller" <fullermd@over-yonder.net>
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-02-14 11:47:40 +00:00
Edward Tomasz Napierala
6c316535e2 Remove useless comment.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-02-07 13:11:45 +00:00
Jilles Tjoelker
2205e0d1bd Add futimens and utimensat system calls.
The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.

A new UTIME_* constant might allow setting birthtimes in future.

Differential Revision:	https://reviews.freebsd.org/D1426
Submitted by:	pluknet (partially)
Reviewed by:	delphij, pluknet, rwatson
Relnotes:	yes
2015-01-23 21:07:08 +00:00
Konstantin Belousov
677258f7e7 Add procctl(2) PROC_TRACE_CTL command to enable or disable debugger
attachment to the process.  Note that the command is not intended to
be a security measure, rather it is an obfuscation feature,
implemented for parity with other operating systems.

Discussed with:	jilles, rwatson
Man page fixes by:	rwatson
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-18 15:13:11 +00:00
Konstantin Belousov
397d851d66 Reduce the size of the interposing table and amount of
cancellation-handling code in the libthr.  Translate some syscalls
into their more generic counterpart, and remove translated syscalls
from the table.

List of the affected syscalls:
creat, open -> openat
raise -> thr_kill
sleep, usleep -> nanosleep
pause -> sigsuspend
wait, wait3, waitpid -> wait4

Suggested and reviewed by:	jilles (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-11 22:16:31 +00:00
John Baldwin
e275993995 Document CPU_WHICH_DOMAIN and bump Dd for cpuset.1.
Missed in:	r276829
2015-01-08 18:53:11 +00:00
Konstantin Belousov
1a744fefc2 Avoid calling internal libc function through PLT or accessing data
though GOT, by staticizing and hiding.  Add setter for
__error_selector to hide it as well.

Suggested and reviewed by:	jilles
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-05 01:06:54 +00:00
Konstantin Belousov
8495e8b1e9 Fix known issues which blow up the process after dlopen("libthr.so")
(or loading a dso linked to libthr.so into process which was not
linked against threading library).

- Remove libthr interposers of the libc functions, including
  __error(). Instead, functions calls are indirected through the
  interposing table, similar to how pthread stubs in libc are already
  done.  Libc by default points either to syscall trampolines or to
  existing libc implementations.  On libthr load, libthr rewrites the
  pointers to the cancellable implementations already in libthr.  The
  interposition table is separate from pthreads stubs indirection
  table to not pull pthreads stubs into static binaries.

- Postpone the malloc(3) internal mutexes initialization until libthr
  is loaded.  This avoids recursion between calloc(3) and static
  pthread_mutex_t initialization.

- Reinstall signal handlers with wrapper on libthr load.  The
  _rtld_is_dlopened(3) is used to avoid useless calls to sigaction(2)
  when libthr is statically referenced from the main binary.

In the process, fix openat(2), swapcontext(2) and setcontext(2)
interposing.  The libc symbols were exported at different versions
than libthr interposers.  Export both libc and libthr versions from
libc now, with default set to the higher version from libthr.

Remove unused and disconnected swapcontext(3) userspace implementation
from libc/gen.

No objections from:	deischen
Tested by:	pho, antoine (exp-run) (previous versions)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-03 18:38:46 +00:00
Christian Brueffer
0aee91e1fb Various mdoc fixes and a few EOL whitespace removals.
Found with:	mandoc -Tlint
2014-12-21 12:36:36 +00:00
Bryan Drewery
4bb90cbe18 Bump Dd for r275846
MFC after:	3 weeks
2014-12-17 01:36:00 +00:00
Kirk McKusick
27ae6f4af7 Add some additional clarification and fix a few gammer nits.
Reviewed by: kib
MFC after:   3 weeks
2014-12-17 01:32:27 +00:00
Konstantin Belousov
19eaed5353 Markup fixes for kqueue(2), no content changes.
Reviewed by:	brueffer (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-12-15 14:58:10 +00:00
Konstantin Belousov
237623b028 Add a facility for non-init process to declare itself the reaper of
the orphaned descendants.  Base of the API is modelled after the same
feature from the DragonFlyBSD.

Requested by:	bapt
Reviewed by:	jilles (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2014-12-15 12:01:42 +00:00
Ed Maste
294246bb7d Revert r274772: it is not valid on MIPS
Reported by:	sbruno
2014-11-25 03:50:31 +00:00
Baptiste Daroussin
2c900f1907 Ta is only allowed with Bl -column not in Bl -item 2014-11-23 23:35:16 +00:00
Joel Dahl
d4d112e34a Misc mdoc fixes:
- Remove superfluous paragraph macros.
- Remove/fix empty or incorrect macros.
- Sort sections into conventional order.
- Terminate quoted strings properly.
- Remove EOL whitespace.
2014-11-23 21:00:00 +00:00
Ed Maste
688fd61ae8 Use canonical __PIC__ flag
It is automatically set when -fPIC is passed to the compiler.

Reviewed by:	dim, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D1179
2014-11-21 02:05:48 +00:00
Dmitry Chagin
186d9c3473 Add the ppoll() system call.
Export kern_poll() needed by an upcoming Linuxulator change.

Differential Revision:	https://reviews.freebsd.org/D1133
Reviewed by:	kib, wblock
MFC after:	1 month
2014-11-13 05:26:14 +00:00
Dag-Erling Smørgrav
4b52e0d84f <sys/param.h> is a superset of <sys/types.h> and must always come
first.  Coincidentally, today is the 11th anniversary of this man
page's (and this bug's) first appearance in FreeBSD.

MFC after:	3 days
2014-11-01 09:10:21 +00:00
Gavin Atkinson
ff8d5270bc Slightly improve grammar in EAGAIN description.
PR:		176806
Submitted by:	Jeremy Chadwick
MFC after:	3 days
2014-10-15 23:39:47 +00:00
Xin LI
b888b86e6f accept(2) may and can return EAGAIN, document it.
MFC after:	1 week
2014-10-10 03:05:55 +00:00
Bryan Drewery
c1efb88730 Document [EPERM] for UNIX sockets.
MFC after:	2 weeks
2014-09-30 00:06:53 +00:00
John Baldwin
7a9f047ba7 - Remove mention of MAP_INHERIT. It hasn't been implemented for thirteen
years.
- Remove mention of unimplemented MAP_SWAP.  There are no future plans to
  implement it.

Submitted by:	alc (2)
2014-09-17 19:45:34 +00:00
Enji Cooper
2bcdab32c3 Bump .Dd for the content change done to access(2) in r271655
PR: 181155
Sponsored by: EMC / Isilon Storage Division
2014-09-16 00:59:08 +00:00
Enji Cooper
257597a434 Validate the mode argument in access, eaccess, and faccessat for optional
POSIX compliance and to improve compatibility with Linux and NetBSD

The issue was identified with lib/libc/sys/t_access:access_inval from
NetBSD

Update the manpage accordingly

PR: 181155
Reviewed by: jilles (code), jmmv (code), wblock (manpage), wollman (code)
MFC after: 4 weeks
Phabric: D678 (code), D786 (manpage)
Sponsored by: EMC / Isilon Storage Division
2014-09-16 00:56:47 +00:00
John-Mark Gurney
e2cc4003e2 document mqueuefs is required for mq_open... 2014-09-15 22:32:35 +00:00
John Baldwin
5fd3f8b3b6 Add stricter checking of some mmap() arguments:
- Fail with EINVAL if an invalid protection mask is passed to mmap().
- Fail with EINVAL if an unknown flag is passed to mmap().
- Fail with EINVAL if both MAP_PRIVATE and MAP_SHARED are passed to mmap().
- Require one of either MAP_PRIVATE or MAP_SHARED for non-anonymous
  mappings.

Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D698
2014-09-15 17:20:13 +00:00
Joel Dahl
cfaa96f327 Minor mdoc nit. 2014-09-09 14:34:54 +00:00