Commit Graph

202115 Commits

Author SHA1 Message Date
Mateusz Guzik
8a08cec166 Create a dedicated function for ensuring that cdir and rdir are populated.
Previously several places were doing it on its own, partially
incorrectly (e.g. without the filedesc locked) or even actively harmful
by populating jdir or assigning rootvnode without vrefing it.

Reviewed by:	kib
2015-07-11 16:22:48 +00:00
Mateusz Guzik
f0725a8e1e Move chdir/chroot-related fdp manipulation to kern_descrip.c
Prefix exported functions with pwd_.

Deduplicate some code by adding a helper for setting fd_cdir.

Reviewed by:	kib
2015-07-11 16:19:11 +00:00
Andrew Turner
70915d1289 Always send a SIGSEGV on a map failure. Use the code to tell the reason
for the signal.

Sponsored by:	ABT Systems Ltd
2015-07-11 16:02:06 +00:00
Adrian Chadd
871ef8b0d8 Regenerate syscalls. 2015-07-11 15:22:11 +00:00
Adrian Chadd
6520495abc Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.

* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
  the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
  if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
  policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
  policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
  set in a variety of methods.

This is only relevant for very specific workloads.

This doesn't pretend to be a final NUMA solution.

The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.

This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.

Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.

Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.

Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.

Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!

Tested:

* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)

* Peter Holm ran a stress test suite on this work and found one
  issue, but has not been able to verify it (it doesn't look NUMA
  related, and he only saw it once over many testing runs.)

* I've tested bhyve instances running in fixed NUMA domains and cpusets;
  all seems to work correctly.

Verified:

* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
  NUMA policies for processes under test.

Review:

This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@.  The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).

This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus.  My hope is that with further
exposure and testing more functionality can be implemented and evaluated.

Notes:

* The VM doesn't handle unbalanced domains very well, and if you have an overly
  unbalanced memory setup whilst under high memory pressure, VM page allocation
  may fail leading to a kernel panic.  This was a problem in the past, but it's
  much more easily triggered now with these tools.

* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
  affect contigmalloc, KVA placement, UMA, etc.  So, driver placement of memory
  isn't really guaranteed in any way.  That's next on my plate.

Sponsored by:	Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
Baptiste Daroussin
d7852cbcf2 Since sh(1) now supports mulitbyte (only UTF-8) clarify the related BUGS
section in wordexp(3) manual page

Discussed with:	jilles
2015-07-11 13:07:50 +00:00
Jilles Tjoelker
8956c4ec63 sh(1): libedit has supported multibyte encodings for a while. 2015-07-11 13:07:26 +00:00
Konstantin Belousov
cf88021ab1 Do not allow creation of the dirty buffers for the dead buffer
objects, i.e. for buffer objects which vnode was reclaimed.  Buffer
cache cannot write such buffers.  Return the error and discard the
buffer immediately on write attempt.

BO_DIRTY now always set during vnode reclamation, since it is used not
only for the INVARIANTS checks.  Do allow placement of the clean
buffers on dead bufobj list, otherwise filesystems cannot use bufcache
at all after the devvp reclaim.

Reported and tested by:	trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-11 11:21:56 +00:00
John-Mark Gurney
f405d8eb61 some additional improvements to the documentation...
Sponsored by:	Netflix, Inc.
2015-07-11 04:20:56 +00:00
John-Mark Gurney
2ff9c4f915 Complete the move that was started w/ r263218.. For some reason I
didn't delete the files, so that means we need to bring the changes in
r282726 to the correct files..

make tinderbox completed with this patch...
2015-07-11 03:12:34 +00:00
Pawel Jakub Dawidek
4273d41299 Spoil even can happen for some time now even on providers opened exclusively
(on the media change event). Update GELI to handle that situation.

PR:		201185
Submitted by:	Matthew D. Fuller
2015-07-10 19:27:19 +00:00
Luigi Rizzo
4af7aed7c6 assorted algorithmic fixes from Paolo Valente (one of my qfq coauthors):
- use 1ULL to avoid shift truncations
- recompute the sum of weight dynamically to provide better fairness
- fix an erroneous constant in the computation of the slot
- preserve timestamp correctness when the old timestamp is stale.
2015-07-10 19:24:36 +00:00
Luigi Rizzo
e38e277fc4 one more warning suppression when compiling the test code in userspace. 2015-07-10 19:18:49 +00:00
Luigi Rizzo
e25716b7cc add code to compute fairness indexes;
cleanups to remove compile warnings.
2015-07-10 18:10:40 +00:00
Luigi Rizzo
8fd44c9395 staticize functions only used in netmap.c
(detected by jenkins run with gcc 4.9)

Update documentation on the use of netmap_priv_d,
rename the refcount and use the same structure in
FreeBSD and linux

No functional changes.
2015-07-10 16:05:24 +00:00
Ed Schouten
ea566832d7 Add missing const keyword to kern_sigaction()'s 'act' parameter.
This structure is not modified by the function. Also add const to
sigact_flag_test(), as it is called by kern_sigaction().
2015-07-10 14:39:46 +00:00
Mateusz Guzik
9a1ad66fb5 fd: further cleanup of kern_dup
- make mode enum start from 0 so that the assertion covers all cases [1]
- rename prefix _CLOEXEC flag with _FLAG
- postpone fhold on the old file descriptor, which eliminates the need to fdrop
  in error cases.
- fixup FDDUP_FCNTL check missed in the previous commit

This removes 'fp == oldfde->fde_file' assertion which had little value. kern_dup
only calls fd-related functions which cannot drop the lock or a whole lot of
races would be introduced.

Noted by: kib [1]
2015-07-10 13:54:03 +00:00
Mateusz Guzik
5fe97c20dc fd: split kern_dup flags argument into actual flags and a mode
Tidy up the code inside to switch on the mode.
2015-07-10 11:01:30 +00:00
Konstantin Belousov
85237e335d Convert between abridged (from FXSAVE) and unabridged (from FSAVE)
versions of the x87 tags.  The conversion is naive, used abridged tag
is converted to valid unabridged, without additional checks for zero
and special values.

Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-10 09:20:13 +00:00
Konstantin Belousov
bcfc2be186 Duplicate the copyright from the i386/i386/machdep.c into
i386/include/frame.h after a code was moved from machdep.c to frame.h
in r284925.

Use include guards style similar to other guards.

Noted by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-10 09:15:06 +00:00
Konstantin Belousov
e8677f3885 Change the mb() use in the sched_ult tdq_notify() and sched_idletd()
to more C11-ish atomic_thread_fence_seq_cst().

Note that on PowerPC, which currently uses lwsync for mb(), the change
actually fixes the missed store/load barrier, intended by r271604 [*].

Reviewed by:	alc
Noted by:	alc [*]
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-07-10 08:54:12 +00:00
Andrew Turner
249d5c7acc Add support for makecontext. This supports up to 8 arguments as this
simplifies the code, these can be passed in registers.

Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
2015-07-10 08:36:22 +00:00
Luigi Rizzo
0fdeab7bc5 add netmap dependency when compiled as a module 2015-07-10 07:13:14 +00:00
Ed Schouten
47a84387ad Let listen() return EDESTADDRREQ when not bound.
We currently return EINVAL when calling listen() on a UNIX socket that
has not been bound to a pathname. If my interpretation of POSIX is
correct, we should return EDESTADDRREQ: "The socket is not bound to a
local address, and the protocol does not support listening on an unbound
socket."

Return EDESTADDRREQ instead when not bound and not connected.

Differential Revision:	https://reviews.freebsd.org/D3038
Reviewed by:	gnn, network
2015-07-10 06:47:14 +00:00
Luigi Rizzo
847bf38369 Sync netmap sources with the version in our private tree.
This commit contains large contributions from Giuseppe Lettieri and
Stefano Garzarella, is partly supported by grants from Verisign and Cisco,
and brings in the following:

- fix zerocopy monitor ports and introduce copying monitor ports
  (the latter are lower performance but give access to all traffic
  in parallel with the application)

- exclusive open mode, useful to implement solutions that recover
  from crashes of the main netmap client (suggested by Patrick Kelsey)

- revised memory allocator in preparation for the 'passthrough mode'
  (ptnetmap) recently presented at bsdcan. ptnetmap is described in
        S. Garzarella, G. Lettieri, L. Rizzo;
        Virtual device passthrough for high speed VM networking,
        ACM/IEEE ANCS 2015, Oakland (CA) May 2015
        http://info.iet.unipi.it/~luigi/research.html

- fix rx CRC handing on ixl

- add module dependencies for netmap when building drivers as modules

- minor simplifications to device-specific routines (*txsync, *rxsync)

- general code cleanup (remove unused variables, introduce macros
  to access rings and remove duplicate code,

Applications do not need to be recompiled, unless of course
they want to use the new features (monitors and exclusive open).

Those willing to try this code on stable/10 can just update the
sys/dev/netmap/*, sys/net/netmap* with the version in HEAD
and apply the small patches to individual device drivers.

MFC after:	1 month
Sponsored by:	(partly) Verisign, Cisco
2015-07-10 05:51:36 +00:00
Luigi Rizzo
9d73ee0f82 rev.284898 removed _SHLIBDIRPREFIX so we need to reconstruct its value
to properly locate libraries created in the buildworld phase.
2015-07-10 05:07:18 +00:00
George V. Neville-Neil
0b75d21e18 Summary: Fix LINT build. The names of the new AES modes were not
correctly used under the REGRESSION kernel option.
2015-07-10 02:23:50 +00:00
Dimitry Andric
4afafe0c86 Fix swapped copyin(9) arguments in cxgb's iwch_arm_cq() function.
Detected by clang 3.7.0 with the warning:

sys/dev/cxgb/ulp/iw_cxgb/iw_cxgb_provider.c:309:18: error: variable
'rptr' is uninitialized when used here [-Werror,-Wuninitialized]
                chp->cq.rptr = rptr;
                               ^~~~

MFC after:	1 week
2015-07-09 22:13:23 +00:00
Mariusz Zaborski
306a82f8f4 Rename zfs nvpair files to not colidate with our nvlist.
PR:		201356
Approved by:	pjd (mentor)
2015-07-09 21:53:40 +00:00
Andrew Turner
3303004f1a Remove checks for __ARM_EABI__, we only build for EABI now.
Sponsored by:	ABT Systems Ltd
2015-07-09 21:02:40 +00:00
Andrew Turner
6c50960be6 Add support for __aeabi_memclr4, clang 3.7 calls it.
Sponsored by:	ABT Systems Ltd
2015-07-09 20:54:38 +00:00
George V. Neville-Neil
16de9ac1b5 Add support for AES modes to IPSec. These modes work both in software only
mode and with hardware support on systems that have AESNI instructions.

Differential Revision:	D2936
Reviewed by:	jmg, eri, cognet
Sponsored by:	Rubicon Communications (Netgate)
2015-07-09 18:16:35 +00:00
Andrew Turner
bf1717e566 Clear the carry bit on the saved program state register when asked to
clear the return value, it's used to indicate an error.

Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
2015-07-09 17:26:56 +00:00
Glen Barber
c9969eba96 Document r285329, OpenSSL update to 1.0.1p.
Sponsored by:	The FreeBSD Foundation
2015-07-09 17:24:54 +00:00
Mateusz Guzik
318b946321 vfs: cosmetic changes to namei and namei_handle_root
- don't initialize cnp during declaration
- don't test error/!error, compare to 0 instead
2015-07-09 17:17:26 +00:00
Jung-uk Kim
45c1772ea0 Merge OpenSSL 1.0.1p. 2015-07-09 17:07:45 +00:00
Jung-uk Kim
c07d7b3a38 Import OpenSSL 1.0.1p. 2015-07-09 16:41:34 +00:00
Mateusz Guzik
d177f49f6f vfs: simplify error handling in namei
The logic is reorganised so that there is one exit point prior to the
lookup loop. This is an intermediate step to making audit logging
functions use found vnode instead of translating ni_dirfd on their own.

ni_startdir validation is removed. The only in-tree consumer is nfs
which already makes sure it is a directory.

Reviewed by:	kib
2015-07-09 16:32:58 +00:00
Ermal Luçi
56844a6203 Correct issue presented in r285051,
apparently neither clang nor gcc complain about this.
But clang intis the var to NULL correctly while gcc on at least mips does not.
Correct the undefined behavior by initializing the variable properly.

PR:		201371
Differential Revision:	 https://reviews.freebsd.org/D3036
Reviewed by:	gnn
Approved by:	gnn(mentor)
2015-07-09 16:28:36 +00:00
John-Mark Gurney
7b7254e71f increase buffer size to significantly increase performance...
see:
https://docs.freebsd.org/cgi/mid.cgi?20150513080342.GE37063@funkthat.com

for benchmarks...
2015-07-09 16:13:05 +00:00
Ed Schouten
2491302a04 Add implementations for some of the CloudABI file descriptor system calls.
All of the CloudABI system calls that operate on file descriptors of an
arbitrary type are prefixed with fd_. This change adds wrappers for
most of these system calls around their FreeBSD equivalents.

The dup2() system call present on CloudABI deviates from POSIX, in the
sense that it can only be used to replace existing file descriptor. It
cannot be used to create new ones. The reason for this is that this is
inherently thread-unsafe. Furthermore, there is no need on CloudABI to
use fixed file descriptor numbers. File descriptors 0, 1 and 2 have no
special meaning.

This change exposes the kern_dup() through <sys/syscallsubr.h> and puts
the FDDUP_* flags in <sys/filedesc.h>. It then adds a new flag,
FDDUP_MUSTREPLACE to force that file descriptors are replaced -- not
allocated.

Differential Revision:	https://reviews.freebsd.org/D3035
Reviewed by:	mjg
2015-07-09 16:07:01 +00:00
Mateusz Guzik
efdc25304c fd: prepare do_dup for being exported
- rename it to kern_dup.
- prefix flags with FD
- assert that correct flags were passed
2015-07-09 15:19:45 +00:00
Mateusz Guzik
d19ba50e12 vfs: avoid spurious vref/vrele for absolute lookups
namei used to vref fd_cdir, which was immediatley vrele'd on entry to
the loop.

Check for absolute lookup and vref the right vnode the first time.

Reviewed by:	kib
2015-07-09 15:06:58 +00:00
Mateusz Guzik
a03f1b2970 vfs: plug a use-after-free of fd_rdir in namei
fd_rdir vnode was stored in ni_rootdir without refing it in any way,
after which the filedsc lock was being dropped.

The vnode could have been freed by mountcheckdirs or another thread doing
chroot.

VREF the vnode while the lock is held.

Reviewed by:	kib
MFC after:	1 week
2015-07-09 15:06:24 +00:00
Baptiste Daroussin
0fc58d1446 Do not try to set password on group if the group is added as a consequence of
of creating a user (regression from r285136)

Reported by:	Fabian Keil <fk@fabiankeil.de>
2015-07-09 14:14:44 +00:00
Andrew Turner
b2b5507779 Add support for SMP. This uses the FDT data to find the CPUs to start on,
and psci to start them. I expect ACPI support to be added later.

This has been tested on qemu with 2 cpus as that is the current value of
MAXCPUS. This is expected to be increased in the future as FreeBSD has
been tested on 48 cores on the Cavium ThunderX hardware.

Partially based on a patch from Robin Randhawa from ARM.

Approved by:	ABT Systems Ltd
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3024
2015-07-09 13:23:29 +00:00
Andrew Turner
3ad7e84ef5 Add logging of synchronous exceptions.
Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
2015-07-09 13:07:12 +00:00
Andrew Turner
7df38eabdc Add the definition of the shareable bits in the pagetables
Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
2015-07-09 12:56:09 +00:00
Andrew Turner
144aa0b7f5 Clean up the types used in <machine/ucontext.h> on arm64. As some ports
include this file without first including the headers needed for uint32_t
and the like use the __foo type.

Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
2015-07-09 12:51:50 +00:00
Ed Schouten
3a41ec6af7 Don't clobber td->td_retval[0] in proc_reap().
While writing tests for CloudABI, I noticed that close() on process
descriptors returns the process ID of the child process. This is
interesting, as close() is only allowed to return 0 or -1. It turns out
that we clobber td->td_retval[0] in proc_reap(), so that wait*()
properly returns the process ID.

Change proc_reap() to leave td->td_retval[0] alone. Set the return value
in kern_wait6() instead, by keeping track of the PID before we
(potentially) reap the process.

Differential Revision:	https://reviews.freebsd.org/D3032
Reviewed by:	kib
2015-07-09 12:04:45 +00:00