Commit Graph

77 Commits

Author SHA1 Message Date
Mark Johnston
c7902fbeae Improve handling of control message truncation.
If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space
for all control messages, the kernel sets MSG_CTRUNC in the message
flags to indicate truncation of the control messages.  In the case
of SCM_RIGHTS messages, however, we were failing to dispose of the
rights that had already been externalized into the recipient's file
descriptor table.  Add a new function and mbuf type to handle this
cleanup task, and use it any time we fail to copy control messages
out to the recipient.  To simplify cleanup, control message truncation
is now only performed at control message boundaries.

The change also fixes a few related bugs:
- Rights could be leaked to the recipient process if an error occurred
  while copying out a message's contents.
- We failed to set MSG_CTRUNC if the truncation occurred on a control
  message boundary, e.g., if the caller received two control messages
  and provided only the exact amount of buffer space needed for the
  first.

PR:		131876
Reviewed by:	ed (previous version)
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16561
2018-08-07 16:36:48 +00:00
Matt Macy
cbd92ce62e Eliminate the overhead of gratuitous repeated reinitialization of cap_rights
- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
  A 3.6% speedup in fstat was measured with this change.

Reported by:	mjg
Reviewed by:	oshogbo
Approved by:	sbruno
MFC after:	1 month
2018-05-09 18:47:24 +00:00
Eitan Adler
e124e27ba2 sys/cloudabi: Avoid relying on GNU specific extensions
An empty initializer list is not technically valid C grammar.

MFC After:	1 week
2018-03-07 14:47:43 +00:00
Eitan Adler
026227b6ae sys/linux: Fix a few potential infoleaks in cloudabi
Submitted by:	Domagoj Stolfa <domagoj.stolfa@gmail.com>
MFC After:	1 month
Sponsored by:	DARPA/AFRL
2018-03-03 21:50:55 +00:00
Ed Schouten
8e04fe5af3 Allow timed waits with relative timeouts on locks and condvars.
Even though pthreads doesn't support this, there are various alternative
APIs that use this. For example, uv_cond_timedwait() accepts a relative
timeout. So does Rust's std::sync::Condvar::wait_timeout().

Though I personally think that relative timeouts are bad (due to
imprecision for repeated operations), it does seem that people want
this. Extend the existing futex functions to keep track of whether an
absolute timeout is used in a boolean flag.

MFC after:	1 month
2018-01-04 21:57:37 +00:00
Ed Schouten
4e1847781b Import the latest CloudABI definitions, version 0.16.
The most important change in this release is the removal of the
poll_fd() system call; CloudABI's equivalent of kevent(). Though I think
that kqueue is a lot saner than many of its alternatives, our
experience is that emulating this system call on other systems
accurately isn't easy. It has become a complex API, even though I'm not
convinced this complexity is needed. This is why we've decided to take a
different approach, by looking one layer up.

We're currently adding an event loop to CloudABI's C library that is API
compatible with libuv (except when incompatible with Capsicum).
Initially, this event loop will be built on top of plain inefficient
poll() calls. Only after this is finished, we'll work our way backwards
and design a new set of system calls to optimize it.

Interesting challenges will include integrating asynchronous I/O into
such a system call API. libuv currently doesn't aio(4) on Linux/BSD, due
to it being unreliable and having undesired semantics.

Obtained from:	https://github.com/NuxiNL/cloudabi
2017-10-18 19:22:53 +00:00
Ed Schouten
733ba7f881 Merge pipes and socket pairs.
Now that CloudABI's sockets API has been changed to be addressless and
only connected socket instances are used (e.g., socket pairs), they have
become fairly similar to pipes. The only differences on CloudABI is that
socket pairs additionally support shutdown(), send() and recv().

To simplify the ABI, we've therefore decided to remove pipes as a
separate file descriptor type and just let pipe() return a socket pair
of type SOCK_STREAM. S_ISFIFO() and S_ISSOCK() are now defined
identically.
2017-09-05 07:46:45 +00:00
Ed Schouten
b53b978a6c Complete the CloudABI networking refactoring.
Now that all of the packaged software has been adjusted to either use
Flower (https://github.com/NuxiNL/flower) for making incoming/outgoing
network connections or can have connections injected, there is no longer
need to keep accept() around. It is now a lot easier to write networked
services that are address family independent, dual-stack, testable, etc.

Remove all of the bits related to accept(), but also to
getsockopt(SO_ACCEPTCONN).
2017-08-30 07:30:06 +00:00
Ed Schouten
8212ad9a99 Sync CloudABI compatibility against the latest upstream version (v0.13).
With Flower (CloudABI's network connection daemon) becoming more
complete, there is no longer any need for creating any unconnected
sockets. Socket pairs in combination with file descriptor passing is all
that is necessary, as that is what is used by Flower to pass network
connections from the public internet to listening processes.

Remove all of the kernel bits that were used to implement socket(),
listen(), bindat() and connectat(). In principle, accept() and
SO_ACCEPTCONN may also be removed, but there are still some consumers
left.

Obtained from:	https://github.com/NuxiNL/cloudabi
MFC after:	1 month
2017-08-25 11:01:39 +00:00
Ed Schouten
cea9310d4e Upgrade to the latest sources generated from the CloudABI specification.
The CloudABI specification has had some minor changes over the last half
year. No substantial features have been added, but some features that
are deemed unnecessary in retrospect have been removed:

- mlock()/munlock():

  These calls tend to be used for two different purposes: real-time
  support and handling of sensitive (cryptographic) material that
  shouldn't end up in swap. The former use case is out of scope for
  CloudABI. The latter may also be handled by encrypting swap.

  Removing this has the advantage that we no longer need to worry about
  having resource limits put in place.

- SOCK_SEQPACKET:

  Support for SOCK_SEQPACKET is rather inconsistent across various
  operating systems. Some operating systems supported by CloudABI (e.g.,
  macOS) don't support it at all. Considering that they are rarely used,
  remove support for the time being.

- getsockname(), getpeername(), etc.:

  A shortcoming of the sockets API is that it doesn't allow you to
  create socket(pair)s, having fake socket addresses associated with
  them. This makes it harder to test applications or transparently
  forward (proxy) connections to them.

  With CloudABI, we're slowly moving networking connectivity into a
  separate daemon called Flower. In addition to passing around socket
  file descriptors, this daemon provides address information in the form
  of arbitrary string labels. There is thus no longer any need for
  requesting socket address information from the kernel itself.

This change also updates consumers of the generated code accordingly.
Even though system calls end up getting renumbered, this won't cause any
problems in practice. CloudABI programs always call into the kernel
through a kernel-supplied vDSO that has the numbers updated as well.

Obtained from:	https://github.com/NuxiNL/cloudabi
2017-07-26 06:57:15 +00:00
Ed Schouten
eb29911a30 Include <sys/systm.h> to obtain the memcpy() prototype.
I got a report of this source file not building on Raspberry Pi. It's
interesting that this only fails for that target and not for others.
Again, that's no reason not to include the right headers.

PR:		217969
Reported by:	Johannes Jost Meixner
MFC after:	1 week
2017-03-24 07:09:33 +00:00
Ed Schouten
75865d0d75 Make file descriptor passing for CloudABI's recvmsg() work.
Similar to the change for sendmsg(), create a pointer size independent
implementation of recvmsg() and let cloudabi32 and cloudabi64 call into
it. In case userspace requests one or more file descriptors, call
kern_recvit() in such a way that we get the control message headers in
an mbuf. Iterate over all of the headers and copy the file descriptors
to userspace.
2017-03-22 19:20:39 +00:00
Ed Schouten
36cc183884 Make file descriptor passing work for CloudABI's sendmsg().
Reduce the potential amount of code duplication between cloudabi32 and
cloudabi64 by creating a cloudabi_sock_recv() utility function. The
cloudabi32 and cloudabi64 modules will then only contain code to convert
the iovecs to the native pointer size.

In cloudabi_sock_recv(), we can now construct an SCM_RIGHTS cmsghdr in
an mbuf and pass that on to kern_sendit().
2017-03-22 06:43:10 +00:00
Konstantin Belousov
496ab0532d Rework r313352.
Rename kern_vm_* functions to kern_*.  Move the prototypes to
syscallsubr.h.  Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.

Requested by:	alc, jhb
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D9535
2017-02-13 09:04:38 +00:00
Edward Tomasz Napierala
69cdfcef2e Add kern_vm_mmap2(), kern_vm_mprotect(), kern_vm_msync(), kern_vm_munlock(),
kern_vm_munmap(), and kern_vm_madvise(), and use them in various compats
instead of their sys_*() counterparts.

Reviewed by:	ed, dchagin, kib
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9378
2017-02-06 20:57:12 +00:00
Edward Tomasz Napierala
d293f35c09 Add kern_listen(), kern_shutdown(), and kern_socket(), and use them
instead of their sys_*() counterparts in various compats. The svr4
is left untouched, because there's no point.

Reviewed by:	ed@, kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9367
2017-01-30 12:57:22 +00:00
Edward Tomasz Napierala
f67d6b5f12 Add kern_lseek() and use it instead of sys_lseek() in various compats.
I didn't touch svr4/, there's no point.

Reviewed by:	ed@, kib@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9366
2017-01-30 12:24:47 +00:00
Baptiste Daroussin
b4b4b5304b Revert crap accidentally committed 2017-01-28 16:31:23 +00:00
Baptiste Daroussin
814aaaa7da Revert r312923 a better approach will be taken later 2017-01-28 16:30:14 +00:00
Ed Schouten
4423244072 Catch up with changes to structure member names.
Pointer/length pairs are now always named ${name} and ${name}_len.
2017-01-17 22:05:52 +00:00
Mateusz Guzik
02274c93c3 cloudabi: use fget_cap instead of hand-rolling capability read
This has a side effect of unbreaking the build after r306272.

Discussed with:		ed
2016-09-23 23:08:23 +00:00
Mariusz Zaborski
85b0f9de11 capsicum: propagate rights on accept(2)
Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.

PR:		201052
Reviewed by:	emaste, jonathan
Discussed with:	many
Differential Revision:	https://reviews.freebsd.org/D7724
2016-09-22 09:58:46 +00:00
Ed Schouten
4fbc90654c Move the linker script from cloudabi64/ to cloudabi/.
It turns out that it works perfectly fine for generating 32-bits vDSOs
as well. While there, get rid of the extraneous .s file extension.
2016-08-21 15:14:06 +00:00
Ed Schouten
384ef4841a Use memcpy() to copy 64-bit timestamps into the syscall return values.
On 32-bit platforms, our 64-bit timestamps need to be split up across
two registers. A simple assignment to td_retval[0] will cause the top 32
bits to get lost. By using memcpy(), we will automatically either use 1
or 2 registers depending on the size of register_t.
2016-08-21 07:41:11 +00:00
Ed Schouten
a5c3c5b14f Import the new automatically generated system call table for CloudABI.
Now that we've switched over to using the vDSO on CloudABI, it becomes a
lot easier for us to phase out old features. System call numbering is no
longer something that's part of the ABI. It's fully based on names. As
long as the numbering used by the kernel and the vDSO is consistent
(which it always is), it's all right.

Let's put this to the test by removing a system call (thread_tcb_set())
that's already unused for quite some time now, but was only left intact
to serve as a placeholder. Sync in the new system call table that uses
alphabetic sorting of system calls.

Obtained from:	https://github.com/NuxiNL/cloudabi
2016-08-19 17:49:35 +00:00
Ed Schouten
93d9ebd82e Eliminate use of sys_fsync() and sys_fdatasync().
Make the kern_fsync() function public, so that it can be used by other
parts of the kernel. Fix up existing consumers to make use of it.

Requested by:	kib
2016-08-15 20:11:52 +00:00
Ed Schouten
2256eed3eb Let CloudABI use fdatasync() as well.
Now that FreeBSD supports fdatasync() natively, we can tidy up
CloudABI's equivalent system call to use that instead.
2016-08-15 19:42:21 +00:00
Ed Schouten
13b4b4df98 Provide the CloudABI vDSO to its executables.
CloudABI executables already provide support for passing in vDSOs. This
functionality is used by the emulator for OS X to inject system call
handlers. On FreeBSD, we could use it to optimize calls to
gettimeofday(), etc.

Though I don't have any plans to optimize any system calls right now,
let's go ahead and already pass in a vDSO. This will allow us to
simplify the executables, as the traditional "syscall" shims can be
removed entirely. It also means that we gain more flexibility with
regards to adding and removing system calls.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D7438
2016-08-10 21:02:41 +00:00
Konstantin Belousov
2a339d9e3d Add implementation of robust mutexes, hopefully close enough to the
intention of the POSIX IEEE Std 1003.1TM-2008/Cor 1-2013.

A robust mutex is guaranteed to be cleared by the system upon either
thread or process owner termination while the mutex is held.  The next
mutex locker is then notified about inconsistent mutex state and can
execute (or abandon) corrective actions.

The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths.  Additionally, the thread exit handler was extended to
iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.

The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive).  Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.

Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot.  When a thread
touches a robust mutex for the first time, a new umtx op syscall is
issued which informs about location of lists heads.

The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust.

Somewhat unrelated changes in the patch:
1. Style.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
   pi mutexes.
3. Removal of the userspace struct pthread_mutex m_owner field.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
   the lifetime of the shared mutex associated with a vnode' page.

Reviewed by:	jilles (previous version, supposedly the objection was fixed)
Discussed with:	brooks, Martin Simmons <martin@lispworks.com> (some aspects)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-05-17 09:56:22 +00:00
Ed Schouten
38526a2cf1 Sync in the latest CloudABI system call definitions.
Some time ago I made a change to merge together the memory scope
definitions used by mmap (MAP_{PRIVATE,SHARED}) and lock objects
(PTHREAD_PROCESS_{PRIVATE,SHARED}). Though that sounded pretty smart
back then, it's backfiring. In the case of mmap it's used with other
flags in a bitmask, but for locking it's an enumeration. As our plan is
to automatically generate bindings for other languages, that looks a bit
sloppy.

Change all of the locking functions to use separate flags instead.

Obtained from:	https://github.com/NuxiNL/cloudabi
2016-03-31 18:50:06 +00:00
Ed Schouten
1f3bbfd875 Replace the CloudABI system call table by a machine generated version.
The type definitions and constants that were used by COMPAT_CLOUDABI64
are a literal copy of some headers stored inside of CloudABI's C
library, cloudlibc. What is annoying is that we can't make use of
cloudlibc's system call list, as the format is completely different and
doesn't provide enough information. It had to be synced in manually.

We recently decided to solve this (and some other problems) by moving
the ABI definitions into a separate file:

	https://github.com/NuxiNL/cloudabi/blob/master/cloudabi.txt

This file is processed by a pile of Python scripts to generate the
header files like before, documentation (markdown), but in our case more
importantly: a FreeBSD system call table.

This change discards the old files in sys/contrib/cloudabi and replaces
them by the latest copies, which requires some minor changes here and
there. Because cloudabi.txt also enforces consistent names of the system
call arguments, we have to patch up a small number of system call
implementations to use the new argument names.

The new header files can also be included directly in FreeBSD kernel
space without needing any includes/defines, so we can now remove
cloudabi_syscalldefs.h and cloudabi64_syscalldefs.h. Patch up the
sources to include the definitions directly from sys/contrib/cloudabi
instead.
2016-03-24 21:47:15 +00:00
Ed Schouten
c0af8d16d8 Call cap_rights_init() properly.
Even though or'ing the individual rights works in this specific case, it
may not work in general. Pass them in as varargs.
2016-02-24 10:54:26 +00:00
Ed Schouten
70907712be Make handling of mmap()'s prot argument more strict.
- Make the system call fail if prot contains bits other than read, write
  and exec.
- Similar to OpenBSD's W^X, don't allow write and exec to be set at the
  same time. I'd like to see for now what happens if we enforce this
  policy unconditionally. If it turns out that this is far too strict,
  we'll loosen this requirement.
2016-02-23 09:22:00 +00:00
Mateusz Guzik
813361c140 fork: plug a use after free of the returned process
fork1 required its callers to pass a pointer to struct proc * which would
be set to the new process (if any). procdesc and racct manipulation also
used said pointer.

However, the process could have exited prior to do_fork return and be
automatically reaped, thus making this a use-after-free.

Fix the problem by letting callers indicate whether they want the pid or
the struct proc, return the process in stopped state for the latter case.

Reviewed by:	kib
2016-02-04 04:25:30 +00:00
Mateusz Guzik
33fd9b9a2b fork: pass arguments to fork1 in a dedicated structure
Suggested by:	kib
2016-02-04 04:22:18 +00:00
Ed Schouten
808d980506 Properly format pointer size independent CloudABI system calls.
CloudABI has approximately 50 system calls that do not depend on the
pointer size of the system. As the ABI is pretty compact, it takes
little effort to each truss(8) the formatting rules for these system
calls. Start off by formatting pointer size independent system calls.

Changes:

- Make it possible to include the CloudABI system call definitions in
  FreeBSD userspace builds. Add ${root}/sys to the truss(8) Makefile so
  we can pull in <compat/cloudabi/cloudabi_syscalldefs.h>.
- Refactoring: patch up amd64-cloudabi64.c to use the CLOUDABI_*
  constants instead of rolling our own table.
- Add table entries for all of the system calls.
- Add new generic formatting types (UInt, IntArray) that we'll be using
  to format unsigned integers and arrays of integers.
- Add CloudABI specific formatting types.

Approved by:	jhb
Differential Revision:	https://reviews.freebsd.org/D3836
2015-10-08 05:27:45 +00:00
Ed Schouten
bc1ace0b96 Decompose linkat()/renameat() rights to source and target.
To make it easier to understand how Capsicum interacts with linkat() and
renameat(), rename the rights to CAP_{LINK,RENAME}AT_{SOURCE,TARGET}.

This also addresses a shortcoming in Capsicum, where it isn't possible
to disable linking to files stored in a directory. Creating hardlinks
essentially makes it possible to access files with additional rights.

Reviewed by:	rwatson, wblock
Differential Revision:	https://reviews.freebsd.org/D3411
2015-08-27 15:16:41 +00:00
Ed Schouten
edcf7fbf59 Don't forget to invoke pre_execve() and post_execve().
CloudABI's proc_exec() was implemented before r282708 introduced
pre_execve() and post_execve(). Sync up by adding these missing calls.
2015-08-17 13:07:12 +00:00
Ed Schouten
2c20fbe43a Use CAP_EVENT instead of CAP_PDWAIT.
The cloudlibc pdwait() function ends up using FreeBSD's kqueue() in
combination with EVFILT_PROCDESC. This depends on CAP_EVENT -- not
CAP_PDWAIT.

Obtained from:	https://github.com/NuxiNL/freebsd
2015-08-12 11:07:03 +00:00
Ed Schouten
55a224afa2 Fall back to O_RDONLY -- not O_WRONLY.
If CloudABI processes open files with a set of requested rights that do
not match any of the privileges granted by O_RDONLY, O_WRONLY or O_RDWR,
we'd better fall back to O_RDONLY -- not O_WRONLY.
2015-08-11 14:08:46 +00:00
Ed Schouten
9d9123a80d Properly convert the error number to CloudABI's indexing.
We currently return FreeBSD's errno value directly, which is of course
not correct.
2015-08-11 14:07:04 +00:00
Ed Schouten
65c17fe451 Make cap_rights_limit() work for CloudABI processes.
Call into the recently introduced kern_cap_rights_limit() function to
restrict rights.
2015-08-11 08:44:19 +00:00
Ed Schouten
0f85ff377b Add file_open(): the underlying system call of openat().
CloudABI purely operates on file descriptor rights (CAP_*). File
descriptor access modes (O_ACCMODE) are emulated on top of rights.

Instead of accepting the traditional flags argument, file_open() copies
in an fdstat_t object that contains the initial rights the descriptor
should have, but also file descriptor flags that should persist after
opening (APPEND, NONBLOCK, *SYNC). Only flags that don't persist (EXCL,
TRUNC, CREAT, DIRECTORY) are passed in as an argument.

file_open() first converts the rights, the persistent flags and the
non-persistent flags to fflags. It then calls into vn_open(). If
successful, it installs the file descriptor with the requested
rights, trimming off rights that don't apply to the type of
the file that has been opened.

Unlike kern_openat(), this function does not support /dev/fd/*. I can't
think of a reason why we need to support this for CloudABI.

Obtained from:	https://github.com/NuxiNL/freebsd
Differential Revision:	https://reviews.freebsd.org/D3235
2015-08-06 06:47:28 +00:00
Ed Schouten
aaf53ab2aa Correct the previous commit: remove the DECLARE_MODULE().
It looks like a MODULE_VERSION() can also appear on its own -- there is
no need to use explicitly use DECLARE_MODULE(). Looking at other
modules, this seems common practice.
2015-08-05 16:53:49 +00:00
Ed Schouten
b6efa27589 Add DECLARE_MODULE() to the "cloudabi" kernel module.
This kernel module does not require any explicit initialization, but a
module declaration is needed to let the "cloudabi64" kernel module
automatically pull this in.

Obtained from:	https://github.com/NuxiNL/freebsd
2015-08-05 16:45:47 +00:00
Ed Schouten
36310bcd1d Make fcntl(F_SETFL) work.
The stat_put() system call can be used to modify file descriptor
attributes, such as flags, but also Capsicum permission bits. Support
for changing Capsicum bits will be added as soon as its dependent
changes have been pushed through code review.

Obtained from:	https://github.com/NuxiNL/freebsd
2015-08-05 16:15:43 +00:00
Ed Schouten
db1c8ee585 Add the remaining pointer size independent CloudABI socket system calls.
CloudABI uses a structure called cloudabi_sockstat_t. Think of it as
'struct stat' for sockets. It is used by functions such as
getsockname(), getpeername(), some of the getsockopt() values, etc.

This change implements the sock_stat_get() system call that returns a
copy of this structure. The accept() system call should also return a
full copy of this structure eventually, but for now we're only
interested in the peer address. Add a TODO() to make sure this is
patched up later on.

Differential Revision:	https://reviews.freebsd.org/D3218
2015-08-05 08:18:05 +00:00
Ed Schouten
4958fab8cd Allow the creation of polling descriptors (kqueues) on CloudABI. 2015-08-05 07:37:06 +00:00
Ed Schouten
0c0964844e Let the CloudABI futex code use umtx_keys.
The CloudABI kernel still passes all of the cloudlibc unit tests.

Reviewed by:	vangyzen
Differential Revision:	https://reviews.freebsd.org/D3286
2015-08-04 06:02:03 +00:00
Ed Schouten
f52c3dd415 Allow CloudABI processes to create shared memory objects.
Summary:
Use the newly created `kern_shm_open()` function to create objects with
just the rights that are actually needed.

Reviewers: jhb, kib

Subscribers: imp

Differential Revision: https://reviews.freebsd.org/D3260
2015-08-01 07:51:48 +00:00