Commit Graph

196 Commits

Author SHA1 Message Date
kib
e75ba1d5c4 Commit the 64-bit inode project.
Extend the ino_t, dev_t, nlink_t types to 64-bit ints.  Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment.  Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks.  Unfortunately, not everything can be
fixed, especially outside the base system.  For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING.  Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb).  Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver.  Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem).  Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by:	The FreeBSD Foundation (emaste, kib)
Differential revision:	https://reviews.freebsd.org/D10439
2017-05-23 09:29:05 +00:00
vangyzen
d6de25428d Add clock_nanosleep()
Add a clock_nanosleep() syscall, as specified by POSIX.
Make nanosleep() a wrapper around it.

Attach the clock_nanosleep test from NetBSD. Adjust it for the
FreeBSD behavior of updating rmtp only when interrupted by a signal.
I believe this to be POSIX-compliant, since POSIX mentions the rmtp
parameter only in the paragraph about EINTR. This is also what
Linux does. (NetBSD updates rmtp unconditionally.)

Copy the whole nanosleep.2 man page from NetBSD because it is complete
and closely resembles the POSIX description. Edit, polish, and reword it
a bit, being sure to keep any relevant text from the FreeBSD page.

Reviewed by:	kib, ngie, jilles
MFC after:	3 weeks
Relnotes:	yes
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D10020
2017-03-19 00:51:12 +00:00
ngie
32a5b43489 Replace dot-dot relative pathing with SRCTOP-relative paths where possible
This reduces build output, need for recalculating paths, and makes it clearer
which paths are relative to what areas in the source tree. The change in
performance over a locally mounted UFS filesystem was negligible in my testing,
but this may more positively impact other filesystems like NFS.

LIBC_SRCTOP was left alone so Juniper (and other users) can continue to
manipulate lib/libc/Makefile (and other Makefile.inc's under lib/libc) as
include Makefiles with custom options.

Discussed with:	marcel, sjg
MFC after:	1 week
Reviewed by:	emaste
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D9207
2017-01-20 03:23:24 +00:00
kib
e4920be6f2 Document thr_suspend(2) and thr_wake(2).
Reviewed by:	bjk, jilles
Discussed with:	emaste, wblock
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8016
2016-09-26 08:18:34 +00:00
badger
ec84c9826b Add manpage for rctl_* system calls
Reviewed by:	trasz, wblock
Approved by:	kib (mentor)
MFC after:	3 days
Sponsored by:	Dell Technologies
Differential Revision:	https://reviews.freebsd.org/D7877
2016-09-19 02:25:30 +00:00
brooks
5072762802 Fix spelling in comment.
Submitted by:	brueffer
2016-09-09 16:18:44 +00:00
brooks
dc591d96b2 Reduce duplicate NOASM and PSEUDO definitions
The initial value of NOASM is nearly the same in all cases and the
initial value of PSEUDO is the same in all cases so reduce duplication
(and hopefully, future merge conflicts) by machine independent defaults.

Also document the PSEUDO variable.

Reviewed by:	jhb, kib
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D7820
2016-09-08 22:38:20 +00:00
kib
fffac29068 Rewrite ptrace(2) wrappers in C.
Besides removing hand-translation to assembler, this also adds missing
wrappers for arm64 and risc-v.

Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7694
2016-08-29 18:47:51 +00:00
kib
21a721373f Add fdatasync(2) man page, combined with fsync(2).
Reviewed by:	emaste, rpokala, wblock
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7522
2016-08-17 10:16:42 +00:00
kib
e04600d300 The fdatasync(2) call must be cancellation point.
Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2016-08-16 08:27:03 +00:00
brooks
ecead64b41 Replace use of the pipe(2) system call with pipe2(2) with a zero flags
value.

This eliminates the need for machine dependant assembly wrappers for
pipe(2).

It also make passing an invalid address to pipe(2) return EFAULT rather
than triggering a segfault.  Document this behavior (which was already
true for pipe2(2), but undocumented).

Reviewed by:	andrew
Approved by:	re (gjb)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D6815
2016-06-22 21:11:27 +00:00
kib
b8e76aa55c Add thr*.2 and _umtx_op.2 manpages to the build.
Sponsored by:	The FreeBSD Foundation
2016-05-14 09:43:28 +00:00
kib
2b6ac44d5d Annotate arm userspace assembler sources stating their tolerance to
the non-executable stack.

Reviewed by:	andrew
Sponsored by:	The FreeBSD Foundation
2015-09-29 16:09:58 +00:00
bdrewery
3faab6f866 Avoid adding duplicates into OBJS. bsd.lib.mk already handles adding
entries to OBJS based on SRCS.

MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2015-09-22 04:55:28 +00:00
pfg
9b366d35ee Move the stack protector to a new "secure" directory
As part of the code refactoring to support FORTIFY_SOURCE we want
a new subdirectory "secure" to keep the files related to security.
Move the stack protector functions to this new directory.

No functional change.

Differential Review:	https://reviews.freebsd.org/D3333
2015-08-14 03:03:13 +00:00
adrian
41db4b88e0 Add an initial NUMA affinity/policy configuration for threads and processes.
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.

* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
  the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
  if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
  policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
  policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
  set in a variety of methods.

This is only relevant for very specific workloads.

This doesn't pretend to be a final NUMA solution.

The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.

This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.

Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.

Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.

Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.

Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!

Tested:

* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)

* Peter Holm ran a stress test suite on this work and found one
  issue, but has not been able to verify it (it doesn't look NUMA
  related, and he only saw it once over many testing runs.)

* I've tested bhyve instances running in fixed NUMA domains and cpusets;
  all seems to work correctly.

Verified:

* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
  NUMA policies for processes under test.

Review:

This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@.  The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).

This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus.  My hope is that with further
exposure and testing more functionality can be implemented and evaluated.

Notes:

* The VM doesn't handle unbalanced domains very well, and if you have an overly
  unbalanced memory setup whilst under high memory pressure, VM page allocation
  may fail leading to a kernel panic.  This was a problem in the past, but it's
  much more easily triggered now with these tools.

* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
  affect contigmalloc, KVA placement, UMA, etc.  So, driver placement of memory
  isn't really guaranteed in any way.  That's next on my plate.

Sponsored by:	Norse Corp, Inc.; Dell
2015-07-11 15:21:37 +00:00
kib
2254748ed0 The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter.  The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development.  The
shims were reasonable to allow easier revert to the older kernel at
that time.

Now, two or three major releases later, shims do not serve any
purpose.  Such old kernels cannot handle current libc, so revert the
compatibility code.

Make padded syscalls support conditional under the COMPAT6 config
option.  For COMPAT32, the syscalls were under COMPAT6 already.

Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.

Reviewed by:	jhb, imp (previous versions)
Discussed with:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-18 21:50:13 +00:00
kib
9a774084c8 Make wait6(2), waitid(3) and ppoll(2) cancellation points. The
waitid() function is required to be cancellable by the standard.  The
wait6() and ppoll() follow the other syscalls in their groups.

Reviewed by:	jhb, jilles (previous versions)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-18 21:35:41 +00:00
kib
6531ee3ae5 Make kevent(2) a cancellation point.
Note that to cancel blocked kevent(2) call, changelist must be empty,
since we cannot cancel a call which already made changes to the
process state.  And in reverse, call which only makes changes to the
kqueue state, without waiting for an event, is not cancellable.  This
makes a natural usage model to migrate kqueue loop to support
cancellation, where existing single kevent(2) call must be split into
two: first uncancellable update of kqueue, then cancellable wait for
events.

Note that this is ABI-incompatible change, but it is believed that
there is no cancel-safe code that relies on kevent(2) not being a
cancellation point.  Option to preserve the ABI would be to keep
kevent(2) as is, but add new call with flags to specify cancellation
behaviour, which only value seems to add complications.

Suggested and reviewed by:	jilles
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-03-29 19:14:41 +00:00
marius
c837ced420 Unbreak sparc64 after r276630 by calling __sparc_sigtramp_setup signal
trampoline as part of the MD __sys_sigaction again.

Submitted by:	kib (initial versions)
MFC after:	3 days
2015-02-16 22:13:03 +00:00
jilles
67db24d0f2 Add futimens and utimensat system calls.
The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.

A new UTIME_* constant might allow setting birthtimes in future.

Differential Revision:	https://reviews.freebsd.org/D1426
Submitted by:	pluknet (partially)
Reviewed by:	delphij, pluknet, rwatson
Relnotes:	yes
2015-01-23 21:07:08 +00:00
kib
8d1dfb4106 Fix known issues which blow up the process after dlopen("libthr.so")
(or loading a dso linked to libthr.so into process which was not
linked against threading library).

- Remove libthr interposers of the libc functions, including
  __error(). Instead, functions calls are indirected through the
  interposing table, similar to how pthread stubs in libc are already
  done.  Libc by default points either to syscall trampolines or to
  existing libc implementations.  On libthr load, libthr rewrites the
  pointers to the cancellable implementations already in libthr.  The
  interposition table is separate from pthreads stubs indirection
  table to not pull pthreads stubs into static binaries.

- Postpone the malloc(3) internal mutexes initialization until libthr
  is loaded.  This avoids recursion between calloc(3) and static
  pthread_mutex_t initialization.

- Reinstall signal handlers with wrapper on libthr load.  The
  _rtld_is_dlopened(3) is used to avoid useless calls to sigaction(2)
  when libthr is statically referenced from the main binary.

In the process, fix openat(2), swapcontext(2) and setcontext(2)
interposing.  The libc symbols were exported at different versions
than libthr interposers.  Export both libc and libthr versions from
libc now, with default set to the higher version from libthr.

Remove unused and disconnected swapcontext(3) userspace implementation
from libc/gen.

No objections from:	deischen
Tested by:	pho, antoine (exp-run) (previous versions)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-03 18:38:46 +00:00
dchagin
162012051b Add the ppoll() system call.
Export kern_poll() needed by an upcoming Linuxulator change.

Differential Revision:	https://reviews.freebsd.org/D1133
Reviewed by:	kib, wblock
MFC after:	1 month
2014-11-13 05:26:14 +00:00
imp
9878392e1a Convert from WITHOUT_SYSCALL_COMPAT to MK_SYSCALL_COMPAT. 2014-04-05 17:54:43 +00:00
marcel
99c9726a00 Replace use of ${.CURDIR} by ${LIBC_SRCTOP} and define ${LIBC_SRCTOP}
if not already defined. This allows building libc from outside of
lib/libc using a reach-over makefile.

A typical use-case is to build a standard ILP32 version and a COMPAT32
version in a single iteration by building the COMPAT32 version using a
reach-over makefile.

Obtained from:	Juniper Networks, Inc.
2014-03-04 02:19:39 +00:00
pluknet
903839e958 Provide the manual page for aio_fsync(2).
Reviewed by:	davidxu
MFC after:	1 week
2013-12-26 19:16:30 +00:00
pluknet
71c1c69583 Fix extattr(2) MLINKS.
MFC after:	1 week
2013-11-09 00:36:09 +00:00
jhb
d3ef75b6c7 Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
  from arbitrary processes.  Similar to ktrace it can apply a change to all
  existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
  control operations on processes (as opposed to the debugger-specific
  operations provided by ptrace(2)).  procctl(2) uses a combination of
  idtype_t and an id to identify the set of processes on which to operate
  similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
  of a set of processes.  MADV_PROTECT still works for backwards
  compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
  the first bit of which is used to track if P_PROTECT should be inherited
  by new child processes.

Reviewed by:	kib, jilles (earlier version)
Approved by:	re (delphij)
MFC after:	1 month
2013-09-19 18:53:42 +00:00
glebius
9a02f3097d Add new system call - aio_mlock(). The name speaks for itself. It allows
to perform the mlock(2) operation, which can consume a lot of time, under
control of aio(4).

Reviewed by:	kib, jilles
Sponsored by:	Nginx, Inc.
2013-06-08 13:27:57 +00:00
jilles
16772c421d Add pipe2() system call.
The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and
O_NONBLOCK (on both sides) as part of the function.

If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p).

If the pointer is not valid, behaviour differs: pipe2() writes into the
array from the kernel like socketpair() does, while pipe() writes into the
array from an architecture-specific assembler wrapper.

Reviewed by:	kan, kib
2013-05-01 22:42:42 +00:00
jilles
299afd25fd Add accept4() system call.
The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)

The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.

Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.
2013-05-01 20:10:21 +00:00
pjd
01401cc9bc Document chflagsat(2).
Obtained from:	jilles
2013-03-21 23:05:44 +00:00
pjd
702516e70b - Implement two new system calls:
int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
	int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);

  which allow to bind and connect respectively to a UNIX domain socket with a
  path relative to the directory associated with the given file descriptor 'fd'.

- Add manual pages for the new syscalls.

- Make the new syscalls available for processes in capability mode sandbox.

- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
  the directory descriptor for the syscalls to work.

- Update audit(4) to support those two new syscalls and to handle path
  in sockaddr_un structure relative to the given directory descriptor.

- Update procstat(1) to recognize the new capability rights.

- Document the new capability rights in cap_rights_limit(2).

Sponsored by:	The FreeBSD Foundation
Discussed with:	rwatson, jilles, kib, des
2013-03-02 21:11:30 +00:00
pjd
f07ebb8888 Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
  has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
  should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
  cap_new(2), which limits capability rights of the given descriptor
  without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
  ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
  ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
  that can be used with the new cap_fcntls_limit(2) syscall and retrive
  them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
  heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
  recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
  backward API and ABI compatibility there are some incompatible changes
  that are described in detail below:

	CAP_CREATE old behaviour:
	- Allow for openat(2)+O_CREAT.
	- Allow for linkat(2).
	- Allow for symlinkat(2).
	CAP_CREATE new behaviour:
	- Allow for openat(2)+O_CREAT.

	Added CAP_LINKAT:
	- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
	- Allow to be target for renameat(2).

	Added CAP_SYMLINKAT:
	- Allow for symlinkat(2).

	Removed CAP_DELETE. Old behaviour:
	- Allow for unlinkat(2) when removing non-directory object.
	- Allow to be source for renameat(2).

	Removed CAP_RMDIR. Old behaviour:
	- Allow for unlinkat(2) when removing directory.

	Added CAP_RENAMEAT:
	- Required for source directory for the renameat(2) syscall.

	Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
	- Allow for unlinkat(2) on any object.
	- Required if target of renameat(2) exists and will be removed by this
	  call.

	Removed CAP_MAPEXEC.

	CAP_MMAP old behaviour:
	- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
	  PROT_WRITE.
	CAP_MMAP new behaviour:
	- Allow for mmap(2)+PROT_NONE.

	Added CAP_MMAP_R:
	- Allow for mmap(PROT_READ).
	Added CAP_MMAP_W:
	- Allow for mmap(PROT_WRITE).
	Added CAP_MMAP_X:
	- Allow for mmap(PROT_EXEC).
	Added CAP_MMAP_RW:
	- Allow for mmap(PROT_READ | PROT_WRITE).
	Added CAP_MMAP_RX:
	- Allow for mmap(PROT_READ | PROT_EXEC).
	Added CAP_MMAP_WX:
	- Allow for mmap(PROT_WRITE | PROT_EXEC).
	Added CAP_MMAP_RWX:
	- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

	Renamed CAP_MKDIR to CAP_MKDIRAT.
	Renamed CAP_MKFIFO to CAP_MKFIFOAT.
	Renamed CAP_MKNODE to CAP_MKNODEAT.

	CAP_READ old behaviour:
	- Allow pread(2).
	- Disallow read(2), readv(2) (if there is no CAP_SEEK).
	CAP_READ new behaviour:
	- Allow read(2), readv(2).
	- Disallow pread(2) (CAP_SEEK was also required).

	CAP_WRITE old behaviour:
	- Allow pwrite(2).
	- Disallow write(2), writev(2) (if there is no CAP_SEEK).
	CAP_WRITE new behaviour:
	- Allow write(2), writev(2).
	- Disallow pwrite(2) (CAP_SEEK was also required).

	Added convinient defines:

	#define	CAP_PREAD		(CAP_SEEK | CAP_READ)
	#define	CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
	#define	CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
	#define	CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
	#define	CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
	#define	CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_RECV		CAP_READ
	#define	CAP_SEND		CAP_WRITE

	#define	CAP_SOCK_CLIENT \
		(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
		 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
	#define	CAP_SOCK_SERVER \
		(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
		 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
		 CAP_SETSOCKOPT | CAP_SHUTDOWN)

	Added defines for backward API compatibility:

	#define	CAP_MAPEXEC		CAP_MMAP_X
	#define	CAP_DELETE		CAP_UNLINKAT
	#define	CAP_MKDIR		CAP_MKDIRAT
	#define	CAP_RMDIR		CAP_UNLINKAT
	#define	CAP_MKFIFO		CAP_MKFIFOAT
	#define	CAP_MKNOD		CAP_MKNODAT
	#define	CAP_SOCK_ALL		(CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by:	The FreeBSD Foundation
Reviewed by:	Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with:	rwatson, benl, jonathan
ABI compatibility discussed with:	kib
2013-03-02 00:53:12 +00:00
pjd
cc57b32cb6 Put one file per line so it is easier to read diffs against those files. 2013-02-16 22:21:46 +00:00
kib
39bfcc5eed Document wait6() and waitid().
PR:	standards/170346
Submitted by:	"Jukka A. Ukkonen" <jau@iki.fi>
MFC after:	1 month
2012-11-13 12:56:42 +00:00
kib
13a9f42818 Use struct vdso_timehands data to implement fast gettimeofday(2) and
clock_gettime(2) functions if supported. The speedup seen in
microbenchmarks is in range 4x-7x depending on the hardware.

Only amd64 and i386 architectures are supported. Libc uses rdtsc and
kernel data to calculate current time, if enabled by kernel.

Hopefully, this code is going to migrate into vdso in some future.

Discussed with:	bde
Reviewed by:	jhb
Tested by:	flo
MFC after:	1 month
2012-06-22 07:13:30 +00:00
pjd
e3b0ea6c34 Link EV_SET(3) to kqueue(2).
MFC after:	3 days
2012-03-05 20:59:34 +00:00
delphij
d6760e1d0d Update rtprio(2) manual page to reflect the latest changes in -CURRENT as
well as provide documentation for rtprio_thread(2) system call.

MFC after:	1 month
X-MFC-after:	r228470
2011-12-27 10:34:00 +00:00
lstewart
cca3084242 - Add the ffclock_getcounter(), ffclock_getestimate() and ffclock_setestimate()
system calls to provide feed-forward clock management capabilities to
  userspace processes. ffclock_getcounter() returns the current value of the
  kernel's feed-forward clock counter. ffclock_getestimate() returns the current
  feed-forward clock parameter estimates and ffclock_setestimate() updates the
  feed-forward clock parameter estimates.

- Document the syscalls in the ffclock.2 man page.

- Regenerate the script-derived syscall related files.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Submitted by:	Julien Ridoux (jridoux at unimelb edu au)
2011-11-21 01:26:10 +00:00
jhb
78c075174e Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region.  It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types.  The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE().  These modes are
thus filesystem independent.  Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used.  These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request.  This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf().  This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode.  This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue.  The second change adds
a new vm_object_page_cache() method.  This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by:	jilles, kib, arch@
Approved by:	re (kib)
MFC after:	1 month
2011-11-04 04:02:50 +00:00
jonathan
5ecd1c9d40 Add experimental support for process descriptors
A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.

New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.

When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.

This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.

Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-08-18 22:51:30 +00:00
jonathan
f77ff62fcd Add cap_new(2) and cap_getrights(2) symbols to libc.
These system calls have already been implemented in the kernel; now we
hook up libc symbols so userspace can drive them.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-07-20 13:29:39 +00:00
mdf
9c9a32d97b Add the posix_fallocate(2) syscall. The default implementation in
vop_stdallocate() is filesystem agnostic and will run as slow as a
read/write loop in userspace; however, it serves to correctly
implement the functionality for filesystems that do not implement a
VOP_ALLOCATE.

Note that __FreeBSD_version was already bumped today to 900036 for any
ports which would like to use this function.

Also reserve space in the syscall table for posix_fadvise(2).

Reviewed by:	-arch (previous version)
2011-04-18 16:32:22 +00:00
marcel
70f17b3a42 When building libc with the syscall compatibility, don't also generate the
syscall assembly files. This results in conflicting dependencies and can
cause unexpected results for parallel builds. This is because the .c file
and the .S file both generate the same .o file.

Submitted by:	Simon Gerraty <sjg@juniper.net>
Sponsored by:	Juniper Networks
2011-03-17 04:40:37 +00:00
trasz
d9bca0ab24 Add manual page for getloginclass(2) and setloginclass(2). 2011-03-06 08:35:50 +00:00
rwatson
817c148c9f Make cap_new(2) and cap_getmode(2) symbols from libc public so applications
can link against them.  Add man pages for the new system calls, with one
errant forward reference to changes not yet present in FreeBSD, but soon
will be.

Reviewed by:	anderson
Obtained from:	Capsicum Project
Sponsored by:	Google, Inc.
Discussed with:	benl, kris, pjd
MFC after:	3 months
2011-03-03 11:31:08 +00:00
kib
736c72ab1a Emit .note.GNU-stack for the syscall stubs generated by libc only on
architectures that support this .note. In particular, do not unneccessary
emit the notes on ia64 and sparc64, which ABI require non-executable stacks.

Tested by:	marcel
2011-01-25 21:06:49 +00:00
kib
a7db50394e Emit .note.GNU-stack for the syscall stubs generated by libc. 2011-01-07 14:28:54 +00:00
davidxu
e129c18a83 Because POSIX does not allow EINTR to be returned from sigwait(),
add a wrapper for it in libc and rework the code in libthr, the
system call still can return EINTR, we keep this feature.

Discussed on: thread
Reviewed by:  jilles
2010-09-10 01:47:37 +00:00