Commit Graph

14383 Commits

Author SHA1 Message Date
Ed Schouten
5a170c1b0e Add an API for easily creating userspace threads in kernelspace.
This change refactors the existing create_thread() function to be more
generic. It replaces almost all of its arguments by a callback that can
be used to extract the thread ID and copy it out to the right place, but
also to perform additional initialization steps, such as setting the
trapframe. This also makes the difference between thr_new() and
thr_create() more clear in my opinion.

This function is going to be used by the CloudABI compatibility layer.

It looks like the OpenSolaris compatibility framework already provides a
function called thread_create(). Rename this function to
do_thread_create() and use a macro to deal with the namespacing
conflict. A similar approach is already used for thread_exit().

MFC after:	1 month
2015-07-20 10:20:04 +00:00
Alexander Motin
d3e2e28e74 Fix typo in comment.
Submitted by:	Masao Uebayashi
2015-07-20 09:37:42 +00:00
Mark Johnston
97cc6870f6 Don't increment the spin count until after the first attempt to acquire a
rwlock read lock. Otherwise the lockstat:::rw-spin probe will fire
spuriously.

MFC after:	1 week
2015-07-19 22:26:02 +00:00
Kirk McKusick
1b79b9498b Restructure code for readability improvement. No functional change.
Reviewed by: kib
2015-07-19 22:25:16 +00:00
Mark Johnston
de2c95cc00 Consistently use a reader/writer flag for lockstat probes in rwlock(9) and
sx(9), rather than using the probe function name to determine whether a
given lock is a read lock or a write lock. Update lockstat(1) accordingly.
2015-07-19 22:24:33 +00:00
Mark Johnston
32cd0147fa Implement the lockstat provider using SDT(9) instead of the custom provider
in lockstat.ko. This means that lockstat probes now have typed arguments and
will utilize SDT probe hot-patching support when it arrives.

Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D2993
2015-07-19 22:14:09 +00:00
Marcelo Araujo
f19e47d691 Add support to the jail framework to be able to mount linsysfs(5) and
linprocfs(5).

Differential Revision:	D2846
Submitted by:		Nikolai Lifanov <lifanov@mail.lifanov.com>
Reviewed by:		jamie
2015-07-19 08:52:35 +00:00
Konstantin Belousov
283dfee925 Further cleanup after r285607.
Remove useless release semantic for some stores to it_need.  For
stores where the release is needed, add a comment explaining why.

Fence after the atomic_cmpset() op on the it_need should be acquire
only, release is not needed (see above).  The combination of
atomic_cmpset() + fence_acq() is better expressed there as
atomic_cmpset_acq().

Use atomic_cmpset() for swi' ih_need read and clear.

Discussed with:	alc, bde
Reviewed by:	bde
Comments wording provided by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-18 19:59:29 +00:00
Konstantin Belousov
b4490c6e93 The si_status field of the siginfo_t, provided by the waitid(2) and
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).

Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig.  Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig.  p_xexit contains complete status
and copied out into si_status.

Requested by:	Joerg Schilling
Reviewed by:	jilles (previous version), pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-07-18 09:02:50 +00:00
Mark Johnston
c6d48c8752 Fix the !KDTRACE_HOOKS build.
X-MFC-With:	r285664
2015-07-18 04:38:11 +00:00
Mark Johnston
e2b25737ee Pass the lock object to lockstat_nsecs() and return immediately if
LO_NOPROFILE is set. Some timecounter handlers acquire a spin mutex, and
we don't want to recurse if lockstat probes are enabled.

PR:		201642
Reviewed by:	avg
MFC after:	3 days
2015-07-18 00:57:30 +00:00
Mark Johnston
efe8b26b82 Modify lockstat_nsecs() to just return unless lockstat probes are actually
enabled. The cost of a timecounter read can be quite significant, and the
problem became more apparent after r284297, since that change resulted in
a call to lockstat_nsecs() for each acquisition of an rwlock read lock.

PR:		201642
Reviewed by:	avg
Tested by:	Jason Unovitch
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D3073
2015-07-18 00:22:00 +00:00
Ed Schouten
fd054c2df9 Undo r285656.
It turns out that the CDDL sources already introduce a function called
thread_create(). I'll investigate what we can do to make these functions
coexist.

Reported by: Ivan Klymenko
2015-07-17 22:26:45 +00:00
Ed Schouten
82a3d2cbfc Add an API for easily creating userspace threads in kernelspace.
This change refactors the existing create_thread() function to be more
generic. It replaces almost all of its arguments by a callback that can
be used to extract the thread ID and copy it out to the right place, but
also to perform additional initialization steps, such as setting the
trapframe. This also makes the difference between thr_new() and
thr_create() more clear in my opinion.

This function is going to be used by the CloudABI compatibility layer.

Reviewed by:	kib
MFC after:	1 month
2015-07-17 16:34:01 +00:00
Mateusz Guzik
2919a0c5c1 fd: partially deduplicate fdescfree and fdescfree_remapped
This also moves vrele of cdir/rdir/jdir vnodes earlier, which should not
matter.
2015-07-16 15:26:37 +00:00
Mateusz Guzik
cd672ca60f Get rid of lim_update_thread and cred_update_thread.
Their primary use was in thread_cow_update to free up old resources.
Freeing had to be done with proc lock held and _cow_ funcs already knew
how to free old structs.
2015-07-16 14:30:11 +00:00
Mateusz Guzik
752fc07d33 vfs: implement v_holdcnt/v_usecount manipulation using atomic ops
Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.

Reviewed by:	kib (earlier version)
Tested by:	pho
2015-07-16 13:57:05 +00:00
Ed Schouten
457f7e23b1 Implement CloudABI's exec() call.
Summary:
In a runtime that is purely based on capability-based security, there is
a strong emphasis on how programs start their execution. We need to make
sure that we execute an new program with an exact set of file
descriptors, ensuring that credentials are not leaked into the process
accidentally.

Providing the right file descriptors is just half the problem. There
also needs to be a framework in place that gives meaning to these file
descriptors. How does a CloudABI mail server know which of the file
descriptors corresponds to the socket that receives incoming emails?
Furthermore, how will this mail server acquire its configuration
parameters, as it cannot open a configuration file from a global path on
disk?

CloudABI solves this problem by replacing traditional string command
line arguments by tree-like data structure consisting of scalars,
sequences and mappings (similar to YAML/JSON). In this structure, file
descriptors are treated as a first-class citizen. When calling exec(),
file descriptors are passed on to the new executable if and only if they
are referenced from this tree structure. See the cloudabi-run(1) man
page for more details and examples (sysutils/cloudabi-utils).

Fortunately, the kernel does not need to care about this tree structure
at all. The C library is responsible for serializing and deserializing,
but also for extracting the list of referenced file descriptors. The
system call only receives a copy of the serialized data and a layout of
what the new file descriptor table should look like:

    int proc_exec(int execfd, const void *data, size_t datalen, const int *fds,
              size_t fdslen);

This change introduces a set of fd*_remapped() functions:

- fdcopy_remapped() pulls a copy of a file descriptor table, remapping
  all of the file descriptors according to the provided mapping table.
- fdinstall_remapped() replaces the file descriptor table of the process
  by the copy created by fdcopy_remapped().
- fdescfree_remapped() frees the table in case we aborted before
  fdinstall_remapped().

We then add a function exec_copyin_data_fds() that builds on top these
functions. It copies in the data and constructs a new remapped file
descriptor. This is used by cloudabi_sys_proc_exec().

Test Plan:
cloudabi-run(1) is capable of spawning processes successfully, providing
it data and file descriptors. procstat -f seems to confirm all is good.
Regular FreeBSD processes also work properly.

Reviewers: kib, mjg

Reviewed By: mjg

Subscribers: imp

Differential Revision: https://reviews.freebsd.org/D3079
2015-07-16 07:05:42 +00:00
Konstantin Belousov
70a3efc14f Do not use atomic_swap_int(9), it is not available on all
architectures.  Atomic_cmpset_int(9) is a direct replacement, due to
loop.  The change fixes arm, arm64, mips an sparc64, which lack
atomic_swap().

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-15 21:44:16 +00:00
Konstantin Belousov
615b6ea2c8 Reset non-zero it_need indicator to zero atomically with fetching its
current value.  It is believed that the change is the real fix for the
issue which was covered over by the r252683.

With the current code, if the interrupt handler sets it_need between
read and consequent reset, the update could be lost and
ithread_execute_handlers() would not be called in response to the lost
update.

The r252683 could have hide the issue since at the moment of commit,
atomic_load_acq_int() did locked cmpxchg on the variable, which puts
the cache line into the exclusive owned state and clears store
buffers.  Then the immediate store of zero has very high chance of
reusing the exclusive state of the cache line and make the load and
store sequence operate as atomic swap.

For now, add the acq+rel fence immediately after the swap, to not
disturb current (but excessive) ordering.  Acquire is needed for the
ih_need reads after the load, while release does not serve a useful
purpose [*].

Reviewed by:	alc
Noted by:	alc [*]
Discussed with:	bde
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-15 17:36:35 +00:00
Konstantin Belousov
03bbcb2f0c Style. Remove excessive brackets. Compare non-boolean with zero.
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-15 17:14:05 +00:00
Mark Johnston
02d131ad11 Fix some error-handling bugs when core dump compression is enabled:
- Ensure that core dump parameters are initialized in the error path.
- Don't call gzio_fini() on a NULL stream.

Reported by:	rpaulo
2015-07-14 18:24:05 +00:00
Conrad Meyer
0c40f3532d Fix cleanup race between unp_dispose and unp_gc
unp_dispose and unp_gc could race to teardown the same mbuf chains, which
can lead to dereferencing freed filedesc pointers.

This patch adds an IGNORE_RIGHTS flag on unpcbs marking the unpcb's RIGHTS
as invalid/freed. The flag is protected by UNP_LIST_LOCK.

To serialize against unp_gc, unp_dispose needs the socket object. Change the
dom_dispose() KPI to take a socket object instead of an mbuf chain directly.

PR:		194264
Differential Revision:	https://reviews.freebsd.org/D3044
Reviewed by:	mjg (earlier version)
Approved by:	markj (mentor)
Obtained from:	mjg
MFC after:	1 month
Sponsored by:	EMC / Isilon Storage Division
2015-07-14 02:00:50 +00:00
Mateusz Guzik
6161705823 exec: textvp -> oldtextvp; binvp -> newtextvp
This makes it consistent with the rest of the naming in do_execve.

No functional changes.
2015-07-14 01:13:37 +00:00
Mateusz Guzik
853be5ffef exec plug a redundant vref + vrele of the image vnode 2015-07-14 00:43:08 +00:00
Mateusz Guzik
e94e50af1d racct: perform a lockless check for p_throttled
This reduces proc lock contention.

Reviewed by:	trasz
2015-07-13 22:52:11 +00:00
Conrad Meyer
c578e0fb48 pipe_direct_write: Fix mismatched pipelock/unlock
If a signal is caught in pipelock, causing it to fail, pipe_direct_write
should not try to pipeunlock.

Reported by:	pho
Differential Revision:	https://reviews.freebsd.org/D3069
Reviewed by:	kib
Approved by:	markj (mentor)
MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2015-07-13 17:45:22 +00:00
Ian Lepore
969fc29e0b Use the monotonic (uptime) counter rather than time-of-day to measure elapsed
time between ntp_adjtime() clock offset adjustments.  This eliminates spurious
frequency steering after a large clock step (such as a 1970->2015 step on a
system with no battery-backed clock hardware).

This problem was discovered after the import of ntpd 4.2.8, which does things
in a slightly different (but still correct) order than the 4.2.4 we had
previously.  In particular, 4.2.4 would step the clock then immediately after
use ntp_adjtime() to set the frequency and offset to zero, which captured the
post-step time-of-day as a side effect.  In 4.2.8, ntpd sets frequency and
offset to zero before any initial clock step, capturing the time as 1970-ish,
then when it next calls ntp_adjtime() it's with a non-zero offset measurement.
This non-zero value gets multiplied by the apparent 45-year interval, which
blows up into a completely bogus frequency steer.  That gets clamped to
500ppm, but that's still enough to make the clock drift so fast that ntpd has
to keep stepping it every few minutes to compensate.
2015-07-12 18:38:17 +00:00
Bjoern A. Zeeb
97fc027722 Try to unbreak the build after r285390 removing the obsolete static
declaration.
2015-07-12 00:26:22 +00:00
Mateusz Guzik
c634b75204 vfs: always clear VI_OWEINACT in consumers bumping v_usecount
Previously vputx would detect the condition and clear the flag.

With this change it is invalid to have both v_usecount > 0 and the flag
set. Assert the condition is met in all revlevant places.

Reviewed by:	kib
2015-07-11 16:28:55 +00:00
Mateusz Guzik
2d1ca3cdff vfs: move si_usecount manipulation to dedicated functions
Reviewed by:	kib
2015-07-11 16:28:12 +00:00
Mateusz Guzik
8a08cec166 Create a dedicated function for ensuring that cdir and rdir are populated.
Previously several places were doing it on its own, partially
incorrectly (e.g. without the filedesc locked) or even actively harmful
by populating jdir or assigning rootvnode without vrefing it.

Reviewed by:	kib
2015-07-11 16:22:48 +00:00
Mateusz Guzik
f0725a8e1e Move chdir/chroot-related fdp manipulation to kern_descrip.c
Prefix exported functions with pwd_.

Deduplicate some code by adding a helper for setting fd_cdir.

Reviewed by:	kib
2015-07-11 16:19:11 +00:00
Adrian Chadd
871ef8b0d8 Regenerate syscalls. 2015-07-11 15:22:11 +00:00
Adrian Chadd
6520495abc Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.

* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
  the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
  if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
  policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
  policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
  set in a variety of methods.

This is only relevant for very specific workloads.

This doesn't pretend to be a final NUMA solution.

The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.

This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.

Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.

Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.

Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.

Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!

Tested:

* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)

* Peter Holm ran a stress test suite on this work and found one
  issue, but has not been able to verify it (it doesn't look NUMA
  related, and he only saw it once over many testing runs.)

* I've tested bhyve instances running in fixed NUMA domains and cpusets;
  all seems to work correctly.

Verified:

* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
  NUMA policies for processes under test.

Review:

This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@.  The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).

This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus.  My hope is that with further
exposure and testing more functionality can be implemented and evaluated.

Notes:

* The VM doesn't handle unbalanced domains very well, and if you have an overly
  unbalanced memory setup whilst under high memory pressure, VM page allocation
  may fail leading to a kernel panic.  This was a problem in the past, but it's
  much more easily triggered now with these tools.

* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
  affect contigmalloc, KVA placement, UMA, etc.  So, driver placement of memory
  isn't really guaranteed in any way.  That's next on my plate.

Sponsored by:	Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
Konstantin Belousov
cf88021ab1 Do not allow creation of the dirty buffers for the dead buffer
objects, i.e. for buffer objects which vnode was reclaimed.  Buffer
cache cannot write such buffers.  Return the error and discard the
buffer immediately on write attempt.

BO_DIRTY now always set during vnode reclamation, since it is used not
only for the INVARIANTS checks.  Do allow placement of the clean
buffers on dead bufobj list, otherwise filesystems cannot use bufcache
at all after the devvp reclaim.

Reported and tested by:	trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-11 11:21:56 +00:00
Ed Schouten
ea566832d7 Add missing const keyword to kern_sigaction()'s 'act' parameter.
This structure is not modified by the function. Also add const to
sigact_flag_test(), as it is called by kern_sigaction().
2015-07-10 14:39:46 +00:00
Mateusz Guzik
9a1ad66fb5 fd: further cleanup of kern_dup
- make mode enum start from 0 so that the assertion covers all cases [1]
- rename prefix _CLOEXEC flag with _FLAG
- postpone fhold on the old file descriptor, which eliminates the need to fdrop
  in error cases.
- fixup FDDUP_FCNTL check missed in the previous commit

This removes 'fp == oldfde->fde_file' assertion which had little value. kern_dup
only calls fd-related functions which cannot drop the lock or a whole lot of
races would be introduced.

Noted by: kib [1]
2015-07-10 13:54:03 +00:00
Mateusz Guzik
5fe97c20dc fd: split kern_dup flags argument into actual flags and a mode
Tidy up the code inside to switch on the mode.
2015-07-10 11:01:30 +00:00
Konstantin Belousov
e8677f3885 Change the mb() use in the sched_ult tdq_notify() and sched_idletd()
to more C11-ish atomic_thread_fence_seq_cst().

Note that on PowerPC, which currently uses lwsync for mb(), the change
actually fixes the missed store/load barrier, intended by r271604 [*].

Reviewed by:	alc
Noted by:	alc [*]
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2015-07-10 08:54:12 +00:00
Ed Schouten
47a84387ad Let listen() return EDESTADDRREQ when not bound.
We currently return EINVAL when calling listen() on a UNIX socket that
has not been bound to a pathname. If my interpretation of POSIX is
correct, we should return EDESTADDRREQ: "The socket is not bound to a
local address, and the protocol does not support listening on an unbound
socket."

Return EDESTADDRREQ instead when not bound and not connected.

Differential Revision:	https://reviews.freebsd.org/D3038
Reviewed by:	gnn, network
2015-07-10 06:47:14 +00:00
Mateusz Guzik
318b946321 vfs: cosmetic changes to namei and namei_handle_root
- don't initialize cnp during declaration
- don't test error/!error, compare to 0 instead
2015-07-09 17:17:26 +00:00
Mateusz Guzik
d177f49f6f vfs: simplify error handling in namei
The logic is reorganised so that there is one exit point prior to the
lookup loop. This is an intermediate step to making audit logging
functions use found vnode instead of translating ni_dirfd on their own.

ni_startdir validation is removed. The only in-tree consumer is nfs
which already makes sure it is a directory.

Reviewed by:	kib
2015-07-09 16:32:58 +00:00
Ed Schouten
2491302a04 Add implementations for some of the CloudABI file descriptor system calls.
All of the CloudABI system calls that operate on file descriptors of an
arbitrary type are prefixed with fd_. This change adds wrappers for
most of these system calls around their FreeBSD equivalents.

The dup2() system call present on CloudABI deviates from POSIX, in the
sense that it can only be used to replace existing file descriptor. It
cannot be used to create new ones. The reason for this is that this is
inherently thread-unsafe. Furthermore, there is no need on CloudABI to
use fixed file descriptor numbers. File descriptors 0, 1 and 2 have no
special meaning.

This change exposes the kern_dup() through <sys/syscallsubr.h> and puts
the FDDUP_* flags in <sys/filedesc.h>. It then adds a new flag,
FDDUP_MUSTREPLACE to force that file descriptors are replaced -- not
allocated.

Differential Revision:	https://reviews.freebsd.org/D3035
Reviewed by:	mjg
2015-07-09 16:07:01 +00:00
Mateusz Guzik
efdc25304c fd: prepare do_dup for being exported
- rename it to kern_dup.
- prefix flags with FD
- assert that correct flags were passed
2015-07-09 15:19:45 +00:00
Mateusz Guzik
d19ba50e12 vfs: avoid spurious vref/vrele for absolute lookups
namei used to vref fd_cdir, which was immediatley vrele'd on entry to
the loop.

Check for absolute lookup and vref the right vnode the first time.

Reviewed by:	kib
2015-07-09 15:06:58 +00:00
Mateusz Guzik
a03f1b2970 vfs: plug a use-after-free of fd_rdir in namei
fd_rdir vnode was stored in ni_rootdir without refing it in any way,
after which the filedsc lock was being dropped.

The vnode could have been freed by mountcheckdirs or another thread doing
chroot.

VREF the vnode while the lock is held.

Reviewed by:	kib
MFC after:	1 week
2015-07-09 15:06:24 +00:00
Ed Schouten
3a41ec6af7 Don't clobber td->td_retval[0] in proc_reap().
While writing tests for CloudABI, I noticed that close() on process
descriptors returns the process ID of the child process. This is
interesting, as close() is only allowed to return 0 or -1. It turns out
that we clobber td->td_retval[0] in proc_reap(), so that wait*()
properly returns the process ID.

Change proc_reap() to leave td->td_retval[0] alone. Set the return value
in kern_wait6() instead, by keeping track of the PID before we
(potentially) reap the process.

Differential Revision:	https://reviews.freebsd.org/D3032
Reviewed by:	kib
2015-07-09 12:04:45 +00:00
Konstantin Belousov
fcb5b3a419 Cover a race between doselwakeup() and selfdfree(). If doselwakeup()
loop finds the selfd entry and clears its sf_si pointer, which is
handled by selfdfree() in parallel, NULL sf_si makes selfdfree() free
the memory.  The result is the race and accesses to the freed memory.

Refcount the selfd ownership.  One reference is for the sf_link
linkage, which is unconditionally dereferenced by selfdfree().
Another reference is for sf_threads, both selfdfree() and
doselwakeup() race to deref it, the winner unlinks and than frees the
selfd entry.

Reported by:	Larry Rosenman <ler@lerctr.org>
Tested by:	Larry Rosenman <ler@lerctr.org>, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-09 09:22:21 +00:00
Ed Schouten
6d338f9a81 Import the CloudABI datatypes and create a system call table.
CloudABI is a pure capability-based runtime environment for UNIX. It
works similar to Capsicum, except that processes already run in
capabilities mode on startup. All functionality that conflicts with this
model has been omitted, making it a compact binary interface that can be
supported by other operating systems without too much effort.

CloudABI is 'secure by default'; the idea is that it should be safe to
run arbitrary third-party binaries without requiring any explicit
hardware virtualization (Bhyve) or namespace virtualization (Jails). The
rights of an application are purely determined by the set of file
descriptors that you grant it on startup.

The datatypes and constants used by CloudABI's C library (cloudlibc) are
defined in separate files called syscalldefs_mi.h (pointer size
independent) and syscalldefs_md.h (pointer size dependent). We import
these files in sys/contrib/cloudabi and wrap around them in
cloudabi*_syscalldefs.h.

We then add stubs for all of the system calls in sys/compat/cloudabi or
sys/compat/cloudabi64, depending on whether the system call depends on
the pointer size. We only have nine system calls that depend on the
pointer size. If we ever want to support 32-bit binaries, we can simply
add sys/compat/cloudabi32 and implement these nine system calls again.

The next step is to send in code reviews for the individual system call
implementations, but also add a sysentvec, to allow CloudABI executabled
to be started through execve().

More information about CloudABI:
- GitHub: https://github.com/NuxiNL/cloudlibc
- Talk at BSDCan: https://www.youtube.com/watch?v=SVdF84x1EdA

Differential Revision:	https://reviews.freebsd.org/D2848
Reviewed by:	emaste, brooks
Obtained from:	https://github.com/NuxiNL/freebsd
2015-07-09 07:20:15 +00:00