from taskqueue_enqueue() instead of reading "ta_pending" unlocked and
also ensure the callout is stopped before proceeding.
MFC after: 1 week
Sponsored by: Mellanox Technologies
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
set automatically by the SDT framework.
MFC after: 1 week
sysent.
sv_prepsyscall is unused.
sv_sigsize and sv_sigtbl translate signal number from the FreeBSD
namespace into the ABI domain. It is only utilized on i386 for iBCS2
binaries. The issue with this approach is that signals for iBCS2 were
delivered with the FreeBSD signal frame layout, which does not follow
iBCS2. The same note is true for any other potential user if
sv_sigtbl. In other words, if ABI needs signal number translation, it
really needs custom sv_sendsig method instead.
Sponsored by: The FreeBSD Foundation
sysentvec. This allows the timekeep data to be shared between similar
ABIs which cannot share sysentvec.
Make the timekeep_push_vdso() tick callback to the timekeep structures
instead of sysentvecs. If several sysentvec share the vdso_sv_tk
structure, we would update the userspace data several times on each
tick, without the change.
Only allocate vdso_sv_tk in the exec_sysvec_init() sysinit when
sysentvec is marked with the new SV_TIMEKEEP flag. This saves
allocation and update of unneeded vdso_sv_tk for ABIs which do not
provide userspace gettimeofday yet, which are PowerPCs arches right
now.
Make vdso_sv_tk allocator public, namely split out and export
alloc_sv_tk() and alloc_sv_tk_compat32(). ABIs which share timekeep
data now can allocate it manually and share as appropriate.
Requested by: nwhitehorn
Tested by: nwhitehorn, pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
- Add some missing I/O functions for non-i386 and amd64 platforms.
- Stub ioremap() to NULL using a macro to ensure non-existing memory
attributes are not referred when they do not exist.
- Add more header files to linux/list.h to resolve driver compilation
issues on Sparc64 and PowerPC platforms.
Sponsored by: Mellanox Technologies
The code compiles fine under Clang, but GCC on PPC is less permissive about
integer and pointer sizes. (An intmax_t is clearly *large enough* to hold a
pointer value.)
Another follow-up to r290475.
Reported by: jhibbits
Sponsored by: EMC / Isilon Storage Division
- Move all files related to the LinuxKPI into sys/compat/linuxkpi and
its subfolders.
- Update sys/conf/files and some Makefiles to use new file locations.
- Added description of COMPAT_LINUXKPI to sys/conf/NOTES which in turn
adds the LinuxKPI to all LINT builds.
- The LinuxKPI can be added to the kernel by setting the
COMPAT_LINUXKPI option. The OFED kernel option no longer builds the
LinuxKPI into the kernel. This was done to keep the build rules for
the LinuxKPI in sys/conf/files simple.
- Extend the LinuxKPI module to include support for USB by moving the
Linux USB compat from usb.ko to linuxkpi.ko.
- Bump the FreeBSD_version.
- A universe kernel build has been done.
Reviewed by: np @ (cxgb and cxgbe related changes only)
Sponsored by: Mellanox Technologies
In order to make it easier to support CloudABI on ARM64, move out all of
the bits from the AMD64 cloudabi_sysvec.c into a new file
cloudabi_module.c that would otherwise remain identical. This reduces
the AMD64 specific code to just ~160 lines.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D3974
CloudABI has approximately 50 system calls that do not depend on the
pointer size of the system. As the ABI is pretty compact, it takes
little effort to each truss(8) the formatting rules for these system
calls. Start off by formatting pointer size independent system calls.
Changes:
- Make it possible to include the CloudABI system call definitions in
FreeBSD userspace builds. Add ${root}/sys to the truss(8) Makefile so
we can pull in <compat/cloudabi/cloudabi_syscalldefs.h>.
- Refactoring: patch up amd64-cloudabi64.c to use the CLOUDABI_*
constants instead of rolling our own table.
- Add table entries for all of the system calls.
- Add new generic formatting types (UInt, IntArray) that we'll be using
to format unsigned integers and arrays of integers.
- Add CloudABI specific formatting types.
Approved by: jhb
Differential Revision: https://reviews.freebsd.org/D3836
r161611 added some of the code from sys_vfork() directly into the Linux
module wrappers since they use RFSTOPPED. In r232240, the RFFPWAIT handling
was moved to syscallret(), thus this code in the Linux module is no longer
needed as it will be called later.
This also allows the Linux wrappers to benefit from the fix in r275616 for
threads not getting suspended if their vforked child is stopped while they
wait on them.
Reviewed by: jhb, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3828
SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.
Perhaps SDT_PROBE should be made a private implementation detail.
MFC after: 20 days
To make it easier to understand how Capsicum interacts with linkat() and
renameat(), rename the rights to CAP_{LINK,RENAME}AT_{SOURCE,TARGET}.
This also addresses a shortcoming in Capsicum, where it isn't possible
to disable linking to files stored in a directory. Creating hardlinks
essentially makes it possible to access files with additional rights.
Reviewed by: rwatson, wblock
Differential Revision: https://reviews.freebsd.org/D3411
There is still one TODO item for these calls: add file descriptor
passing. The data structures are already prepared for this. It's just
the translation that's missing.
Obtained from: http://github.com/NuxiNL/freebsd
The cloudlibc pdwait() function ends up using FreeBSD's kqueue() in
combination with EVFILT_PROCDESC. This depends on CAP_EVENT -- not
CAP_PDWAIT.
Obtained from: https://github.com/NuxiNL/freebsd
Blocking on locks and condition variables can be accomplished by polling
and using the special filters CONDVAR, LOCK_RDLOCK and LOCK_WRLOCK.
For now it wouldn't make sense to implement this functionality into
kqueue() itself, for the reason that they are CloudABI specific and
would require us to resize 'struct kevent' to hold all of the parameters
of interest.
Add a bandaid to the CloudABI poll system call to call into the futex
code directly if it detects specific combinations of events that are
used by the C library.
Obtained from: https://github.com/NuxiNL/freebsd
This change implements two functions, cloudabi64_kevent_copyin() and
cloudabi64_kevent_copyout(), that convert CloudABI structures to
FreeBSD's struct kevent. CloudABI uses two structures: subscription_t
and event_t. The former is used for input, whereas the latter is used
for output. Unlike struct kevent, fields aren't overloaded for multiple
purposes or for separate event types.
For poll() we call into the newly introduced kern_kevent_anonymous()
function that allows us to poll without a file descriptor. This function
is not only used by poll(), but also by functions such as
sleep() and clock_nanosleep().
Reviewed by: jmg
Obtained from: https://github.com/NuxiNL/freebsd
Differential Revision: https://reviews.freebsd.org/D3308
If CloudABI processes open files with a set of requested rights that do
not match any of the privileges granted by O_RDONLY, O_WRONLY or O_RDWR,
we'd better fall back to O_RDONLY -- not O_WRONLY.
CloudABI purely operates on file descriptor rights (CAP_*). File
descriptor access modes (O_ACCMODE) are emulated on top of rights.
Instead of accepting the traditional flags argument, file_open() copies
in an fdstat_t object that contains the initial rights the descriptor
should have, but also file descriptor flags that should persist after
opening (APPEND, NONBLOCK, *SYNC). Only flags that don't persist (EXCL,
TRUNC, CREAT, DIRECTORY) are passed in as an argument.
file_open() first converts the rights, the persistent flags and the
non-persistent flags to fflags. It then calls into vn_open(). If
successful, it installs the file descriptor with the requested
rights, trimming off rights that don't apply to the type of
the file that has been opened.
Unlike kern_openat(), this function does not support /dev/fd/*. I can't
think of a reason why we need to support this for CloudABI.
Obtained from: https://github.com/NuxiNL/freebsd
Differential Revision: https://reviews.freebsd.org/D3235
It looks like a MODULE_VERSION() can also appear on its own -- there is
no need to use explicitly use DECLARE_MODULE(). Looking at other
modules, this seems common practice.
This kernel module does not require any explicit initialization, but a
module declaration is needed to let the "cloudabi64" kernel module
automatically pull this in.
Obtained from: https://github.com/NuxiNL/freebsd
The stat_put() system call can be used to modify file descriptor
attributes, such as flags, but also Capsicum permission bits. Support
for changing Capsicum bits will be added as soon as its dependent
changes have been pushed through code review.
Obtained from: https://github.com/NuxiNL/freebsd
CloudABI uses a structure called cloudabi_sockstat_t. Think of it as
'struct stat' for sockets. It is used by functions such as
getsockname(), getpeername(), some of the getsockopt() values, etc.
This change implements the sock_stat_get() system call that returns a
copy of this structure. The accept() system call should also return a
full copy of this structure eventually, but for now we're only
interested in the peer address. Add a TODO() to make sure this is
patched up later on.
Differential Revision: https://reviews.freebsd.org/D3218
On CloudABI we want to create file descriptors with just the minimal set
of Capsicum rights in place. The reason for this is that it makes it
easier to obtain uniform behaviour across different operating systems.
By explicitly whitelisting the operations, we can return consistent
error codes, but also prevent applications from depending OS-specific
behaviour.
Extend kern_kqueue() to take an additional struct filecaps that is
passed on to falloc_caps(). Update the existing consumers to pass in
NULL.
Differential Revision: https://reviews.freebsd.org/D3259
Summary:
Use the newly created `kern_shm_open()` function to create objects with
just the rights that are actually needed.
Reviewers: jhb, kib
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3260
On CloudABI, the rights bits returned by cap_rights_get() match up with
the operations that you can actually perform on the file descriptor.
Limiting the rights is good, because it makes it easier to get uniform
behaviour across different operating systems. If process descriptors on
FreeBSD would suddenly gain support for any new file operation, this
wouldn't become exposed to CloudABI processes without first extending
the rights.
Extend fork1() to gain a 'struct filecaps' argument that allows you to
construct process descriptors with custom rights. Use this in
cloudabi_sys_proc_fork() to limit the rights to just fstat() and
pdwait().
Obtained from: https://github.com/NuxiNL/freebsd
Summary:
Pipes in CloudABI are unidirectional. The reason for this is that
CloudABI attempts to provide a uniform runtime environment across
different flavours of UNIX.
Instead of implementing a custom pipe that is unidirectional, we can
simply reuse Capsicum permission bits to support this. This is nice,
because CloudABI already attempts to restrict permission bits to
correspond with the operations that apply to a certain file descriptor.
Replace kern_pipe() and kern_pipe2() by a single kern_pipe() that takes
a pair of filecaps. These filecaps are passed to the newly introduced
falloc_caps() function that creates the descriptors with rights in
place.
Test Plan:
CloudABI pipes seem to be created with proper rights in place:
https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/unistd/pipe_test.c#L44
Reviewers: jilles, mjg
Reviewed By: mjg
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3236
CloudABI's openat() ensures that files are opened with the smallest set
of relevant rights. For example, when opening a FIFO, unrelated rights
like CAP_RECV are automatically removed. To remove unrelated rights, we
can just reuse the code for this that was already present in the rights
conversion function.
Summary:
CloudABI's readdir() system call could be thought of as a mixture
between FreeBSD's getdents(2) and pread(). Instead of using the file
descriptor offset, userspace provides a 64-bit cloudabi_dircookie_t
continue reading at a given point. CLOUDABI_DIRCOOKIE_START, having
value 0, can be used to return entries at the start of the directory.
The file descriptor offset is not used to store the cookie for the
reason that in a file descriptor centric environment, it would make
sense to allow concurrent use of a single file descriptor.
The remaining space returned by the system call should be filled with a
partially truncated copy of the next entry. The advantage of doing this
is that it gracefully deals with long filenames. If the C library
provides a buffer that is too small to hold a single entry, it can still
extract the directory entry header, meaning that it can retry the read
with a larger buffer or skip it using the cookie.
Test Plan:
This implementation passes the cloudlibc unit tests at:
https://github.com/NuxiNL/cloudlibc/tree/master/src/libc/dirent
Reviewers: marcel, kib
Reviewed By: kib
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3226
CloudABI uses a system call interface to modify file attributes that is
more similar to KPI's/FUSE, namely where a stat structure is passed back
to the kernel, together with a bitmask of attributes that should be
changed. This would allow us to update any set of attributes atomically.
That said, I'd rather not go as far as to actually implement it that
way, as it would require us to duplicate more code than strictly needed.
Let's just stick to the combinations that are actually used by
cloudlibc.
Obtained from: https://github.com/NuxiNL/freebsd
The file_create() system call can be used to create files of a given
type. Right now it can only be used to create directories and FIFOs. As
CloudABI does not expose filesystem permissions, this system call lacks
a mode argument. Simply use 0777 or 0666 depending on the file type.
Summary:
CloudABI provides access to two different stat structures:
- fdstat, containing file descriptor level status: oflags, file
descriptor type and Capsicum rights, used by cap_rights_get(),
fcntl(F_GETFL), getsockopt(SO_TYPE).
- filestat, containing your regular file status: timestamps, inode
number, used by fstat().
Unlike FreeBSD's stat::st_mode, CloudABI file descriptor types don't
have overloaded meanings (e.g., returning S_ISCHR() for kqueues). Add a
utility function to extract the type of a file descriptor accurately.
CloudABI does not work with O_ACCMODEs. File descriptors have two sets
of Capsicum-style rights: rights that apply to the file descriptor
itself ('base') and rights that apply to any new file descriptors
yielded through openat() ('inheriting'). Though not perfect, we can
pretty safely decompose Capsicum rights to such a pair. This is done in
convert_capabilities().
Test Plan: Tests for these system calls are fairly extensive in cloudlibc.
Reviewers: jonathan, mjg, #manpages
Reviewed By: mjg
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3171
Summary:
CloudABI provides two different types of futex objects: read-write locks
and condition variables. There is no need to provide separate support
for once objects and thread joining, as these are efficiently simulated
by blocking on a read-write lock. Mutexes simply use read-write locks.
Condition variables always have a lock object associated to them. They
always know to which lock a thread needs to be migrated if woken up.
This allows us to implement requeueing. A broadcast on a condition
variable will never cause multiple threads to be woken up at once. They
will be woken up iteratively.
This implementation still has lots of room for improvement. Locking is
coarse and right now we use linked lists to store all of the locks and
condition variables, instead of using a hash table. The primary goal of
this implementation was to behave correctly. Performance will be
improved as we go.
Test Plan:
This futex implementation has been in use for the last couple of months
and seems to work pretty well. All of the cloudlibc and libc++ unit
tests seem to pass.
Reviewers: dchagin, kib, vangyzen
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3148
Futex object scopes have been renamed from using their own constants to
simply reusing the existing CLOUDABI_MAP_{PRIVATE,SHARED} flags, as they
are more accurate in this context.
Summary:
Unlike FreeBSD, CloudABI does not use null terminated strings for its
pathnames. Introduce a function called copyin_path() that can be used by
all of the filesystem system calls that use pathnames. This change
already implements the system calls that don't depend on any additional
functionality (e.g., conversion of struct stat).
Also implement the socket system calls that operate on pathnames, namely
the ones used by the C library functions bindat() and connectat(). These
don't receive a 'struct sockaddr_un', but just the pathname, meaning
they could be implemented in such a way that they don't depend on the
size of sun_path. For now, just use the existing interfaces.
Add a missing #include to cloudabi_syscalldefs.h to get this code to
build, as one of its macros depends on UINT64_C().
Test Plan:
These implementations have already been tested in the CloudABI branch on
GitHub. They pass all of the tests.
Reviewers: kib, pjd
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3097
Though the standard C library uses a 'struct timespec' using a 64-bit
'time_t', there is no need to use such a type at the system call level.
CloudABI uses a simple 64-bit unsigned timestamp in nanoseconds. This is
sufficient to express any time value from 1970 to 2554.
The CloudABI low-level interface also supports fetching timestamp values
with a lower precision. Instead of overloading the clock ID argument for
this purpose, the system call provides a precision argument that may be
used to specify the maximum slack. The current system call
implementation does not use this information, but it's good to already
have this available.
Expose cloudabi_convert_timespec(), as we're going to need this for
fstat() as well.
Obtained from: https://github.com/NuxiNL/freebsd
Summary:
Remove the stub system call that was put in place during the system call
import and replace it by a target-dependent version stored in sys/amd64.
Initialize the thread in a way similar to cpu_set_upcall_kse(). We
provide the entry point with two arguments: the thread ID and the
argument pointer.
Test Plan:
Thread creation still seems to work, both for FreeBSD and CloudABI
binaries.
Reviewers: dchagin, mjg, kib
Reviewed By: kib
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3110
Just like FreeBSD+Capsicum, CloudABI uses process descriptors. Return
the file descriptor number to the parent process.
To the child process we both return a special value for the file
descriptor number (CLOUDABI_PROCESS_CHILD). We also return the thread ID
of the new thread in the copied process, so the threading library can
reinitialize itself.
Obtained from: https://github.com/NuxiNL/freebsd
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).
Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig. Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig. p_xexit contains complete status
and copied out into si_status.
Requested by: Joerg Schilling
Reviewed by: jilles (previous version), pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Add support for the <sys/mman.h> functions by wrapping around our own
implementations. There are no kern_*() variants of these system calls,
but we also don't need them in this case. It is sufficient to just call
into the sys_*() functions.
Differential Revision: https://reviews.freebsd.org/D3033
Reviewed by: brooks
Summary:
For CloudABI we need to put two things on the stack of new processes:
the argument data (a binary blob; not strings) and a startup data
structure. The startup data structure contains interesting things such
as a pointer to the ELF program header, the thread ID of the initial
thread, a stack smashing protection canary, and a pointer to the
argument data.
Fetching system call arguments and setting the return value is similar
to FreeBSD. The only differences are that system call 0 does not exist
and that we call into cloudabi_convert_errno() to convert the error
code. We also need this function in a couple of other places, so we'd
better reuse it here.
Reviewers: dchagin, kib
Reviewed By: kib
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3098
Summary:
In a runtime that is purely based on capability-based security, there is
a strong emphasis on how programs start their execution. We need to make
sure that we execute an new program with an exact set of file
descriptors, ensuring that credentials are not leaked into the process
accidentally.
Providing the right file descriptors is just half the problem. There
also needs to be a framework in place that gives meaning to these file
descriptors. How does a CloudABI mail server know which of the file
descriptors corresponds to the socket that receives incoming emails?
Furthermore, how will this mail server acquire its configuration
parameters, as it cannot open a configuration file from a global path on
disk?
CloudABI solves this problem by replacing traditional string command
line arguments by tree-like data structure consisting of scalars,
sequences and mappings (similar to YAML/JSON). In this structure, file
descriptors are treated as a first-class citizen. When calling exec(),
file descriptors are passed on to the new executable if and only if they
are referenced from this tree structure. See the cloudabi-run(1) man
page for more details and examples (sysutils/cloudabi-utils).
Fortunately, the kernel does not need to care about this tree structure
at all. The C library is responsible for serializing and deserializing,
but also for extracting the list of referenced file descriptors. The
system call only receives a copy of the serialized data and a layout of
what the new file descriptor table should look like:
int proc_exec(int execfd, const void *data, size_t datalen, const int *fds,
size_t fdslen);
This change introduces a set of fd*_remapped() functions:
- fdcopy_remapped() pulls a copy of a file descriptor table, remapping
all of the file descriptors according to the provided mapping table.
- fdinstall_remapped() replaces the file descriptor table of the process
by the copy created by fdcopy_remapped().
- fdescfree_remapped() frees the table in case we aborted before
fdinstall_remapped().
We then add a function exec_copyin_data_fds() that builds on top these
functions. It copies in the data and constructs a new remapped file
descriptor. This is used by cloudabi_sys_proc_exec().
Test Plan:
cloudabi-run(1) is capable of spawning processes successfully, providing
it data and file descriptors. procstat -f seems to confirm all is good.
Regular FreeBSD processes also work properly.
Reviewers: kib, mjg
Reviewed By: mjg
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3079
We can map these system calls directly to the FreeBSD counterparts. The
other filesystem related system calls will be sent out for review
separately, as they are a bit more complex to get right.
The random_get() system call works similar to getentropy()/getrandom()
on OpenBSD/Linux. It fills a buffer with random data.
This change introduces a new function, read_random_uio(), that is used
to implement read() on the random devices. We can call into this
function from within the CloudABI compatibility layer.
Approved by: secteam
Reviewed by: jmg, markm, wblock
Obtained from: https://github.com/NuxiNL/freebsd
Differential Revision: https://reviews.freebsd.org/D3053
The first system call is used to set the user TLS address. Right now
this system call is invoked by the C library for both the initial thread
and additional threads unconditionally, but in the future we'll only
call this if the architecture does not support this. On recent x86-64
CPUs we could use the WRFSBASE instruction.
This system call was erroneously placed in sys/compat/cloudabi64, even
though it does not depend on any pointer size dependent datastructure.
Move it to the right place.
Obtained from: https://github.com/NuxiNL/freebsd
Add a routine similar to copyinuio() and freebsd32_copyinuio() that
copies in CloudABI's struct iovecs. These are then translated into
FreeBSD format and placed in a 'struct uio', so we can call into the
kern_*() functions.
Obtained from: https://github.com/NuxiNL/freebsd
Summary:
As discussed with kib@ in response to r285404, don't call into
kern_sigaction() within proc_raise() to reset the signal to the default
action before delivery. We'd better do that during image execution.
Change the code to simply use pksignal(), so we don't waste cycles on
functions like pfind() to look up the currently running process itself.
Test Plan:
This change has also been pushed into the cloudabi branch on GitHub. The
raise() tests still seem to pass.
Reviewers: kib
Reviewed By: kib
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3076
CloudABI does not provide an explicit kill() system call, for the reason
that there is no access to the global process namespace. Instead, it
offers a raise() system call that can at least be used to terminate the
process abnormally.
CloudABI does not support installing signal handlers. CloudABI's raise()
system call should behave as if the default policy is set up. Call into
kern_sigaction(SIG_DFL) before calling sys_kill() to force this.
Obtained from: https://github.com/NuxiNL/freebsd
Previously several places were doing it on its own, partially
incorrectly (e.g. without the filedesc locked) or even actively harmful
by populating jdir or assigning rootvnode without vrefing it.
Reviewed by: kib
All of the CloudABI system calls that operate on file descriptors of an
arbitrary type are prefixed with fd_. This change adds wrappers for
most of these system calls around their FreeBSD equivalents.
The dup2() system call present on CloudABI deviates from POSIX, in the
sense that it can only be used to replace existing file descriptor. It
cannot be used to create new ones. The reason for this is that this is
inherently thread-unsafe. Furthermore, there is no need on CloudABI to
use fixed file descriptor numbers. File descriptors 0, 1 and 2 have no
special meaning.
This change exposes the kern_dup() through <sys/syscallsubr.h> and puts
the FDDUP_* flags in <sys/filedesc.h>. It then adds a new flag,
FDDUP_MUSTREPLACE to force that file descriptors are replaced -- not
allocated.
Differential Revision: https://reviews.freebsd.org/D3035
Reviewed by: mjg
CloudABI is a pure capability-based runtime environment for UNIX. It
works similar to Capsicum, except that processes already run in
capabilities mode on startup. All functionality that conflicts with this
model has been omitted, making it a compact binary interface that can be
supported by other operating systems without too much effort.
CloudABI is 'secure by default'; the idea is that it should be safe to
run arbitrary third-party binaries without requiring any explicit
hardware virtualization (Bhyve) or namespace virtualization (Jails). The
rights of an application are purely determined by the set of file
descriptors that you grant it on startup.
The datatypes and constants used by CloudABI's C library (cloudlibc) are
defined in separate files called syscalldefs_mi.h (pointer size
independent) and syscalldefs_md.h (pointer size dependent). We import
these files in sys/contrib/cloudabi and wrap around them in
cloudabi*_syscalldefs.h.
We then add stubs for all of the system calls in sys/compat/cloudabi or
sys/compat/cloudabi64, depending on whether the system call depends on
the pointer size. We only have nine system calls that depend on the
pointer size. If we ever want to support 32-bit binaries, we can simply
add sys/compat/cloudabi32 and implement these nine system calls again.
The next step is to send in code reviews for the individual system call
implementations, but also add a sysentvec, to allow CloudABI executabled
to be started through execve().
More information about CloudABI:
- GitHub: https://github.com/NuxiNL/cloudlibc
- Talk at BSDCan: https://www.youtube.com/watch?v=SVdF84x1EdA
Differential Revision: https://reviews.freebsd.org/D2848
Reviewed by: emaste, brooks
Obtained from: https://github.com/NuxiNL/freebsd
Use the same scheme implemented to manage credentials.
Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.
Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
When providing memory map information to userland, populate the vnode pointer
for tmpfs files. Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.
This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).
Submitted by: Eric Badger <eric@badgerio.us> (initial revision)
Obtained from: Dell Inc.
PR: 198431
MFC after: 2 weeks
Reviewed by: jhb
Approved by: kib (mentor)
writes the remaining time into the structure pointed to by rmtp
unless rmtp is NULL. The value of *rmtp can then be used to call
nanosleep() again and complete the specified pause if the previous
call was interrupted.
Note. clock_nanosleep() with an absolute time value does not write
the remaining time.
While here fix whitespaces and typo in SDT_PROBE.
implemented via ioctl interface. First of all return ENOTSUP for this
operation as a cp fallback to usual method in that case. Secondly, do
not print out the message about unimplemented operation.
1. Linux sigset always 64 bit on all platforms. In order to move Linux
sigset code to the linux_common module define it as 64 bit int. Move
Linux sigset manipulation routines to the MI path.
2. Move Linux signal number definitions to the MI path. In general, they
are the same on all platforms except for a few signals.
3. Map Linux RT signals to the FreeBSD RT signals and hide signal conversion
tables to avoid conversion errors.
4. Emulate Linux SIGPWR signal via FreeBSD SIGRTMIN signal which is outside
of allowed on Linux signal numbers.
PR: 197216
argument is not a null pointer, and the ss_flags member pointed to by ss
contains flags other than SS_DISABLE. However, in fact, Linux also
allows SS_ONSTACK flag which is simply ignored.
For buggy apps (at least mono) ignore other than SS_DISABLE
flags as a Linux do.
While here move MI part of sigaltstack code to the appropriate place.
Reported by: abi at abinet dot ru
"highly experimental" remove /dev/shm magic commited
in r218497 and convert tmpfs type to an expected magic number.
Differential Revision: https://reviews.freebsd.org/D1497
Reviewed by: emaste, trasz
as it already zeroed by malloc with M_ZERO flag
and move zeroing to the proper place in exec path.
Differential Revision: https://reviews.freebsd.org/D1462
Reviewed by: trasz
around kqueue() to implement epoll subset of functionality.
The kqueue user data are 32bit on i386 which is not enough for
epoll user data, so we keep user data in the proc emuldata.
Initial patch developed by rdivacky@ in 2007, then extended
by Yuri Victorovich @ r255672 and finished by me
in collaboration with mjg@ and jillies@.
Differential Revision: https://reviews.freebsd.org/D1092
Check wait options as a Linux do.
Linux always set WEXITED option not a WUNTRACED|WNOHANG
which is a strange bug.
Differential Revision: https://reviews.freebsd.org/D1085
Reviewed by: trasz
The AT_EACCESS and AT_SYMLINK_NOFOLLOW flags are actually implemented
within the glibc wrapper function for faccessat(). If either of these
flags are specified, then the wrapper function employs fstatat() to
determine access permissions.
Differential Revision: https://reviews.freebsd.org/D1078
Reviewed by: trasz
thread emuldata to proc emuldata as it was originally intended.
As we can have both 64 & 32 bit Linuxulator running any eventhandler
can be called twice for us. To prevent this move eventhandlers code
from linux_emul.c to the linux_common.ko module.
Differential Revision: https://reviews.freebsd.org/D1073
following primary purposes:
1. Remove the dependency of linsysfs and linprocfs modules from linux.ko,
which will be architecture specific on amd64.
2. Incorporate into linux_common.ko general code for platforms on which
we'll support two Linuxulator modules (for both instruction set - 32 & 64 bit).
3. Move malloc(9) declaration to linux_common.ko, to enable getting memory
usage statistics properly.
Currently linux_common.ko incorporates a code from linux_mib.c and linux_util.c
and linprocfs, linsysfs and linux kernel modules depend on linux_common.ko.
Temporarily remove dtrace garbage from linux_mib.c and linux_util.c
Differential Revision: https://reviews.freebsd.org/D1072
In collaboration with: Vassilis Laganakos.
Reviewed by: trasz
Move struct ipc_perm definition to the MD path as it differs for 64 and
32 bit platform.
Differential Revision: https://reviews.freebsd.org/D1068
Reviewed by: trasz
All fields of type l_int in struct statfs are defined
as l_long on i386 and amd64.
Differential Revision: https://reviews.freebsd.org/D1064
Reviewed by: trasz
exposes functions from kernel with proper DWARF CFI information so that
it becomes easier to unwind through them.
Using vdso is a mandatory for a thread cancelation && cleanup
on a modern glibc.
Differential Revision: https://reviews.freebsd.org/D1060
Use it in linux_wait4() system call and move linux_wait4() to the MI path.
While here add a prototype for the static bsd_to_linux_rusage().
Differential Revision: https://reviews.freebsd.org/D2138
Reviewed by: trasz
of harcoded pr_osrelease, pr_osrel values. This will be used later in
the VDSO.
Differential Revision: https://reviews.freebsd.org/D1042
Reviewed by: trasz
of multiple simultaneous calls to pthread_join() specifying the same
target thread are undefined wake up the one thread.
Differential Revision: https://reviews.freebsd.org/D1040
The reasons:
1. Get rid of the stubs/quirks with process dethreading,
process reparent when the process group leader exits and close
to this problems on wait(), waitpid(), etc.
2. Reuse our kernel code instead of writing excessive thread
managment routines in Linuxulator.
Implementation details:
1. The thread is created via kern_thr_new() in the clone() call with
the CLONE_THREAD parameter. Thus, everything else is a process.
2. The test that the process has a threads is done via P_HADTHREADS
bit p_flag of struct proc.
3. Per thread emulator state data structure is now located in the
struct thread and freed in the thread_dtor() hook.
Mandatory holdig of the p_mtx required when referencing emuldata
from the other threads.
4. PID mangling has changed. Now Linux pid is the native tid
and Linux tgid is the native pid, with the exception of the first
thread in the process where tid and pid are one and the same.
Ugliness:
In case when the Linux thread is the initial thread in the thread
group thread id is equal to the process id. Glibc depends on this
magic (assert in pthread_getattr_np.c). So for system calls that
take thread id as a parameter we should use the special method
to reference struct thread.
Differential Revision: https://reviews.freebsd.org/D1039
threads refactor kern_sched_rr_get_interval() and sys_sched_rr_get_interval().
Add a kern_sched_rr_get_interval() counterpart which takes a targettd
parameter to allow specify target thread directly by callee (new Linuxulator).
Linuxulator temporarily uses first thread in proc.
Move linux_sched_rr_get_interval() to the MI part.
Differential Revision: https://reviews.freebsd.org/D1032
Reviewed by: trasz
threads introduce linux_exit() stub instead of sys_exit() call
(which terminates process).
In the new linuxulator exit() system call terminates the calling
thread (not a whole process).
Differential Revision: https://reviews.freebsd.org/D1027
Reviewed by: trasz
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.
Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks
allocated from exec_map. If many threads try to perform execve(2) in
parallel, the exec map is exhausted and some threads sleep
uninterruptible waiting for the map space. Then, the thread which won
the race for the space allocation, cannot single-thread the process,
causing deadlock.
Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
rework. The number of entries was supposed to be returned to the user,
not used as a scratch variable.
This broke RELENG_4 jails starting up on current systems.
argument. This will be used for the Linux emulation layer - for Linux,
PATH_MAX is 4096 and not 1024.
Differential Revision: https://reviews.freebsd.org/D2335
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
and export them to userland.
- Define __HAVE_REG32 on platforms that define a reg32 structure and check
for this in <sys/procfs.h> to control when to export prstatus32, etc.
- Add prstatus32_t and prpsinfo32_t typedefs for the 32-bit structures.
libbfd looks for these types, and having them fixes 'gcore' in gdb of a
32-bit process on a 64-bit platform.
- Use the structure definitions from <sys/procfs.h> in gcore's elf32 core
dump code instead of duplicating the definitions.
Differential Revision: https://reviews.freebsd.org/D2142
Reviewed by: kib, nathanw (powerpc bits)
MFC after: 1 week
The goal here is to provide one place altering process credentials.
This eases debugging and opens up posibilities to do additional work when such
an action is performed.
The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.
A new UTIME_* constant might allow setting birthtimes in future.
Differential Revision: https://reviews.freebsd.org/D1426
Submitted by: pluknet (partially)
Reviewed by: delphij, pluknet, rwatson
Relnotes: yes
attachment to the process. Note that the command is not intended to
be a security measure, rather it is an obfuscation feature,
implemented for parity with other operating systems.
Discussed with: jilles, rwatson
Man page fixes by: rwatson
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
in r276564, change path type to char * (pathnames are always char *).
And remove bogus casts of malloc().
kern___getcwd() internally doesn't actually use or support u_char *
paths, except to copy them to a normal char * path.
These changes are not visible to libc as libc/gen/getcwd.c misdeclares
__getcwd() as taking a plain char * path.
While here remove _SYS_SYSPROTO_H_ for __getcwd() syscall as
we always have sysproto.h.
Pointed out by: bde
MFC after: 1 week
the orphaned descendants. Base of the API is modelled after the same
feature from the DragonFlyBSD.
Requested by: bapt
Reviewed by: jilles (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
- Threads lifetime cycle, in particular, counting of the threads in
the process, and interlocking with process mutex and thread lock.
The main reason of this is that turnstile locks are after thread
locks, so you e.g. cannot unlock blockable mutex (think process
mutex) while owning thread lock.
- Virtual and profiling itimers, since the timers activation is done
from the clock interrupt context. Replace the p_slock by p_itimmtx
and PROC_ITIMLOCK().
- Profiling code (profil(2)), for similar reason. Replace the p_slock
by p_profmtx and PROC_PROFLOCK().
- Resource usage accounting. Need for the spinlock there is subtle,
my understanding is that spinlock blocks context switching for the
current thread, which prevents td_runtime and similar fields from
changing (updates are done at the mi_switch()). Replace the p_slock
by p_statmtx and PROC_STATLOCK().
The split is done mostly for code clarity, and should not affect
scalability.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
- Dump an NT_X86_XSTATE note if XSAVE is in use. This note is designed
to match what Linux does in that 1) it dumps the entire XSAVE area
including the fxsave state, and 2) it stashes a copy of the current
xsave mask in the unused padding between the fxsave state and the
xstate header at the same location used by Linux.
- Teach readelf() to recognize NT_X86_XSTATE notes.
- Change PT_GET/SETXSTATE to take the entire XSAVE state instead of
only the extra portion. This avoids having to always make two
ptrace() calls to get or set the full XSAVE state.
- Add a PT_GET_XSTATE_INFO which returns the length of the current
XSTATE save area (so the size of the buffer needed for PT_GETXSTATE)
and the current XSAVE mask (%xcr0).
Differential Revision: https://reviews.freebsd.org/D1193
Reviewed by: kib
MFC after: 2 weeks
have both kern_open() and kern_openat(); change the callers to use
kern_openat().
This removes one (sometimes two) levels of indirection and
consolidates arguments checks.
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
ever used. It didn't go into stable/10, neither was documented.
It might be useful, but we collectively decided to remove it, rather
leave it abandoned and unmaintained. It is removed in one single
commit, so restoring it should be easy, if anyone wants to reopen
this idea.
Sponsored by: Netflix
The kernel tracks syscall users so that modules can safely unregister them.
But if the module is not unloadable or was compiled into the kernel, there is
no need to do this.
Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC
during kernel build and 0 otherwise.
Reviewed by: kib (previous version)
MFC after: 2 weeks
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
rather than u_char.
To try and play nice with the ABI, the u_char CPU ID values are clamped
at 254. The new fields now contain the full CPU ID, or -1 for no cpu.
Differential Revision: D955
Reviewed by: jhb, kib
Sponsored by: Norse Corp, Inc.
Move the SCTP syscalls to netinet with the rest of the SCTP code.
Submitted by: Steve Kiernan <stevek@juniper.net>
Reviewed by: tuexen, rrs
Obtained from: Juniper Networks, Inc.
syscalls themselves are tightly coupled with the network stack and
therefore should not be in the generic socket code.
The following four syscalls have been marked as NOSTD so they can be
dynamically registered in sctp_syscalls_init() function:
sys_sctp_peeloff
sys_sctp_generic_sendmsg
sys_sctp_generic_sendmsg_iov
sys_sctp_generic_recvmsg
The syscalls are also set up to be dynamically registered when COMPAT32
option is configured.
As a side effect of moving the SCTP syscalls, getsock_cap needs to be
made available outside of the uipc_syscalls.c source file. A proper
prototype has been added to the sys/socketvar.h header file.
API tests from the SCTP reference implementation have been run to ensure
compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout)
Submitted by: Steve Kiernan <stevek@juniper.net>
Reviewed by: tuexen, rrs
Obtained from: Juniper Networks, Inc.
struct flock are done in the sys_fcntl(), which mean that compat32 used
direct access to userland pointers.
Move code from sys_fcntl() to new wrapper, kern_fcntl_freebsd(), which
performs neccessary userland memory accesses, and use it from both
native and compat32 fcntl syscalls.
Reported by: jhibbits
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Update linux compat minimum revision to match linux-c6 now in ports. This
is a candidate for 10.1 R as it matches the current state of supported
linux compat packages in the ports tree.
PR: 187786
Reviewed by: xmj
MFC after: 2 days
Relnotes: yes
for amd64/linux32. Fix the entirely bogus (untested) version from
r161310 for i386/linux using the same shared code in compat/linux.
It is unclear to me if we could support more clock mappings but
the current set allows me to successfully run commercial
32bit linux software under linuxolator on amd64.
Reviewed by: jhb
Differential Revision: D784
MFC after: 3 days
Sponsored by: DARPA, AFRL
Add a separate field which exports tracer pid and add a new keyword
("tracer") for ps to display it.
This is a follow up to r270444.
Reviewed by: kib
MFC after: 1 week
Relnotes: yes
This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation
This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h
Discussed at: BSDcan
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
Make it really work for native FreeBSD programs. Before this it was broken
for years due to different number of pointer dereferences in Linux and
FreeBSD IOCTL paths, permanently returning errors to FreeBSD programs.
This change breaks the driver FreeBSD IOCTL ABI, making it more strict,
but since it was not working any way -- who bother.
Add shims for 32-bit programs on 64-bit host, translating the argument
of the SG_IO IOCTL for both FreeBSD and Linux ABIs.
With this change I was able to run 32-bit Linux sg3_utils tools and simple
32 and 64-bit FreeBSD test tools on both 32 and 64-bit FreeBSD systems.
MFC after: 1 month
flag has been added instead of FUTEX_WAIT to replace the FUTEX_WAIT
logic which needs to do gettimeofday() calls before the futex syscall
to convert the absolute timeout to a relative timeout.
Before this the CLOCK_MONOTONIC used by the FUTEX_WAIT_BITSET op.
When the FUTEX_CLOCK_REALTIME is specified the timeout is an absolute
time, not a relative time. Rework futex_wait to handle this.
On the side fix the futex leak in error case and remove useless
parentheses.
Properly calculate the timeout for the CLOCK_MONOTONIC case.
MFC after: 3 days
Some Linux futex ops atomically verifies that the futex address uaddr
(uval) contains the value val. Comparing signed uval and unsigned val
may lead to an unexpected result, mostly to a deadlock.
So copyin uaddr to an unsigned int to compare the parameters correctly.
While here change ktr records to print parameters in more readable format.
Tested by eadler@
MFC after: 3 days
call to freebsd32_convert_msg_in() with freebsd32_copyin_control() to
readin and convert in a single step. This makes it simpler to put all
the control messages in a single mbuf or mbuf cluster as per the
limitations imposed upon us by ip6_setpktopts().
The logic is as follows:
1. Go over the array of control messages to determine overall size
and include extra padding for proper alignment as we go.
2. Get a mbuf or mbuf cluster as needed or fail if the overall
(adjusted) size is larger than a cluster.
3. Go over the array of control messages again, but now copy them
into kernel space and into aligned offsets.
4. Update the length of the control message to take padding between
the header and the data into account (but not for padding added
between one control message and the next).
Obtained from: Juniper Networks, Inc.
MFC after: 1 week
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
Also, remove the expression which calculated the location of the
strings for a new image and grown over the time to be
non-comprehensible. Instead, calculate the offsets by steps, which
also makes fixing the alignments much cleaner.
Reported and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
- Retire long time unused (basically always unused) sys__umtx_lock()
and sys__umtx_unlock() syscalls
- struct umtx and their supporting definitions
- UMUTEX_ERROR_CHECK flag
- Retire UMTX_OP_LOCK/UMTX_OP_UNLOCK from _umtx_op() syscall
__FreeBSD_version is not bumped yet because it is expected that further
breakages to the umtx interface will follow up in the next days.
However there will be a final bump when necessary.
Sponsored by: EMC / Isilon storage division
Reviewed by: jhb
The NetBSD Foundation states "Third parties are encouraged to change the
license on any files which have a 4-clause license contributed to the
NetBSD Foundation to a 2-clause license."
This change removes clauses 3 and 4 from copyright / license blocks that
list The NetBSD Foundation as the only copyright holder.
Sponsored by: The FreeBSD Foundation
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.
MFC after: 3 weeks
change... This eliminates a cast, and also forces td_retval
(often 2 32-bit registers) to be aligned so that off_t's can be
stored there on arches with strict alignment requirements like
armeb (AVILA)... On i386, this doesn't change alignment, and on
amd64 it doesn't either, as register_t is already 64bits...
This will also prevent future breakage due to people adding additional
fields to the struct...
This gets AVILA booting a bit farther...
Reviewed by: bde
interface, in the r241616 a crutch was provided. It didn't work well, and
finally we decided that it is time to break ABI and simply make if_baudrate
a 64-bit value. Meanwhile, the entire struct if_data was reviewed.
o Remove the if_baudrate_pf crutch.
o Make all fields of struct if_data fixed machine independent size. The
notion of data (packet counters, etc) are by no means MD. And it is a
bug that on amd64 we've got a 64-bit counters, while on i386 32-bit,
which at modern speeds overflow within a second.
This also removes quite a lot of COMPAT_FREEBSD32 code.
o Give 16 bit for the ifi_datalen field. This field was provided to
make future changes to if_data less ABI breaking. Unfortunately the
8 bit size of it had effectively limited sizeof if_data to 256 bytes.
o Give 32 bits to ifi_mtu and ifi_metric.
o Give 64 bits to the rest of fields, since they are counters.
__FreeBSD_version bumped.
Discussed with: emax
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
This fires off a kqueue note (of type sendfile) to the configured kqfd
when the sendfile transaction has completed and the relevant memory
backing the transaction is no longer in use by this transaction.
This is analogous to SF_SYNC waiting for the mbufs to complete -
except now you don't have to wait.
Both SF_SYNC and SF_KQUEUE should work together, even if it
doesn't necessarily make any practical sense.
This is designed for use by applications which use backing cache/store
files (eg Varnish) or POSIX shared memory (not sure anything is using
it yet!) to know when a region of memory is free for re-use. Note
it doesn't mark the region as free overall - only free from this
transaction. The application developer still needs to track which
ranges are in the process of being recycled and wait until all
pending transactions are completed.
TODO:
* documentation, as always
Sponsored by: Netflix, Inc.
for extending and reusing it.
The sendfile_sync wrapper is mostly just a "mbuf transaction" wrapper,
used to indicate that the backing store for a group of mbufs has completed.
It's only being used by sendfile for now and it's only implementing a
sleep/wakeup rendezvous. However, there are other potential signaling
paths (kqueue) and other potential uses (socket zero-copy write) where the
same mechanism would also be useful.
So, with that in mind:
* extract the sendfile_sync code out into sf_sync_*() methods
* teach the sf_sync_alloc method about the current config flag -
it will eventually know about kqueue.
* move the sendfile_sync code out of do_sendfile() - the only thing
it now knows about is the sfs pointer. The guts of the sync
rendezvous (setup, rendezvous/wait, free) is now done in the
syscall wrapper.
* .. and teach the 32-bit compat sendfile call the same.
This should be a no-op. It's primarily preparation work for teaching
the sendfile_sync about kqueue notification.
Tested:
* Peter Holm's sendfile stress / regression scripts
Sponsored by: Netflix, Inc.
in one of the many layers of indirection and shims through stable/7
in jail_handle_ips(). When it was cleaned up and unified through
kern_jail() for 8.x, the byte order swap was lost.
This only matters for ancient binaries that call jail(2) themselves
internally.
given process.
Note that the correctness of the trampoline length returned for ABIs
which do not use shared page depends on the correctness of the struct
sysvec sv_szsigcodebase member, which will be fixed on as-need basis.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).
Reviewed by: markj
MFC after: 3 weeks