Commit Graph

297 Commits

Author SHA1 Message Date
John Baldwin
bb430bc740 Fully handle size_t lengths in AIO requests.
First, update the return types of aio_return() and aio_waitcomplete() to
ssize_t.

POSIX requires aio_return() to return a ssize_t so that it can represent
all return values from read() and write().  aio_waitcomplete() should use
ssize_t for the same reason.

aio_return() has used ssize_t in <aio.h> since r31620 but the manpage and
system call entry were not updated.  aio_waitcomplete() has always
returned int.

Note that this does not require new system call stubs as this is
effectively only an API change in how the compiler interprets the return
value.

Second, allow aio_nbytes values up to IOSIZE_MAX instead of just INT_MAX.

aio_read/write should now honor the same length limits as normal read/write.

Third, use longs instead of ints in the aio_return() and aio_waitcomplete()
system call functions so that the 64-bit size_t in the in-kernel aiocb
isn't truncated to 32-bits before being copied out to userland or
being returned.

Finally, a simple test has been added to verify the bounds checking on the
maximum read size from a file.
2016-03-21 21:37:33 +00:00
John Baldwin
399e8c1773 Simplify AIO initialization now that it is standard.
- Mark AIO system calls as STD and remove the helpers to dynamically
  register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
  an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
  the pathconf() system call.  fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5589
2016-03-09 19:05:11 +00:00
Adrian Chadd
871ef8b0d8 Regenerate syscalls. 2015-07-11 15:22:11 +00:00
Konstantin Belousov
0538aafc41 The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter.  The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development.  The
shims were reasonable to allow easier revert to the older kernel at
that time.

Now, two or three major releases later, shims do not serve any
purpose.  Such old kernels cannot handle current libc, so revert the
compatibility code.

Make padded syscalls support conditional under the COMPAT6 config
option.  For COMPAT32, the syscalls were under COMPAT6 already.

Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.

Reviewed by:	jhb, imp (previous versions)
Discussed with:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-18 21:50:13 +00:00
Jilles Tjoelker
2205e0d1bd Add futimens and utimensat system calls.
The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.

A new UTIME_* constant might allow setting birthtimes in future.

Differential Revision:	https://reviews.freebsd.org/D1426
Submitted by:	pluknet (partially)
Reviewed by:	delphij, pluknet, rwatson
Relnotes:	yes
2015-01-23 21:07:08 +00:00
Dmitry Chagin
9f7a06f27e Indeed, instead of hiding the kern___getcwd() bug by bogus cast
in r276564, change path type to char * (pathnames are always char *).
And remove bogus casts of malloc().
kern___getcwd() internally doesn't actually use or support u_char *
paths, except to copy them to a normal char * path.

These changes are not visible to libc as libc/gen/getcwd.c misdeclares
__getcwd() as taking a plain char * path.

While here remove _SYS_SYSPROTO_H_ for __getcwd() syscall as
we always have sysproto.h.

Pointed out by:	bde

MFC after:	1 week
2015-01-04 10:34:02 +00:00
Dmitry Chagin
186d9c3473 Add the ppoll() system call.
Export kern_poll() needed by an upcoming Linuxulator change.

Differential Revision:	https://reviews.freebsd.org/D1133
Reviewed by:	kib, wblock
MFC after:	1 month
2014-11-13 05:26:14 +00:00
Marcel Moolenaar
80b47aefa1 Move the SCTP syscalls to netinet with the rest of the SCTP code. The
syscalls themselves are tightly coupled with the network stack and
therefore should not be in the generic socket code.

The following four syscalls have been marked as NOSTD so they can be
dynamically registered in sctp_syscalls_init() function:
  sys_sctp_peeloff
  sys_sctp_generic_sendmsg
  sys_sctp_generic_sendmsg_iov
  sys_sctp_generic_recvmsg

The syscalls are also set up to be dynamically registered when COMPAT32
option is configured.

As a side effect of moving the SCTP syscalls, getsock_cap needs to be
made available outside of the uipc_syscalls.c source file.  A proper
prototype has been added to the sys/socketvar.h header file.

API tests from the SCTP reference implementation have been run to ensure
compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout)

Submitted by:	Steve Kiernan <stevek@juniper.net>
Reviewed by:	tuexen, rrs
Obtained from:	Juniper Networks, Inc.
2014-10-09 15:16:52 +00:00
Attilio Rao
ce42e79310 Remove dead code from umtx support:
- Retire long time unused (basically always unused) sys__umtx_lock()
  and sys__umtx_unlock() syscalls
- struct umtx and their supporting definitions
- UMUTEX_ERROR_CHECK flag
- Retire UMTX_OP_LOCK/UMTX_OP_UNLOCK from _umtx_op() syscall

__FreeBSD_version is not bumped yet because it is expected that further
breakages to the umtx interface will follow up in the next days.
However there will be a final bump when necessary.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jhb
2014-03-18 21:32:03 +00:00
John Baldwin
55648840de Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
  from arbitrary processes.  Similar to ktrace it can apply a change to all
  existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
  control operations on processes (as opposed to the debugger-specific
  operations provided by ptrace(2)).  procctl(2) uses a combination of
  idtype_t and an id to identify the set of processes on which to operate
  similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
  of a set of processes.  MADV_PROTECT still works for backwards
  compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
  the first bit of which is used to track if P_PROTECT should be inherited
  by new child processes.

Reviewed by:	kib, jilles (earlier version)
Approved by:	re (delphij)
MFC after:	1 month
2013-09-19 18:53:42 +00:00
John Baldwin
84c21af119 Fix the type of the idtype argument to wait6() in syscalls.master.
Approved by:	re (kib)
MFC after:	1 week
2013-09-12 17:52:18 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
Gleb Smirnoff
6160e12c10 Add new system call - aio_mlock(). The name speaks for itself. It allows
to perform the mlock(2) operation, which can consume a lot of time, under
control of aio(4).

Reviewed by:	kib, jilles
Sponsored by:	Nginx, Inc.
2013-06-08 13:27:57 +00:00
Konstantin Belousov
48947eccee Fix the wait6(2) on 32bit architectures and for the compat32, by using
the right type for the argument in syscalls.master.  Also fix the
posix_fallocate(2) and posix_fadvise(2) compat32 syscalls on the
architectures which require padding of the 64bit argument.

Noted and reviewed by:	jhb
Pointy hat to:	kib
MFC after:	1 week
2013-05-21 11:40:16 +00:00
Jilles Tjoelker
dc570d5e56 Add pipe2() system call.
The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and
O_NONBLOCK (on both sides) as part of the function.

If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p).

If the pointer is not valid, behaviour differs: pipe2() writes into the
array from the kernel like socketpair() does, while pipe() writes into the
array from an architecture-specific assembler wrapper.

Reviewed by:	kan, kib
2013-05-01 22:42:42 +00:00
Jilles Tjoelker
da7d2afb6d Add accept4() system call.
The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)

The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.

Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.
2013-05-01 20:10:21 +00:00
Matthew D Fleming
e324bf91e8 Fix return type of extattr_set_* and fix rmextattr(8) utility.
extattr_set_{fd,file,link} is logically a write(2)-like operation and
should return ssize_t, just like extattr_get_*.  Also, the user-space
utility was using an int for the return value of extattr_get_* and
extattr_list_*, both of which return an ssize_t.

MFC after:	1 week
2013-04-02 05:30:41 +00:00
Pawel Jakub Dawidek
e948704e4b Implement chflagsat(2) system call, similar to fchmodat(2), but operates on
file flags.

Reviewed by:	kib, jilles
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:59:01 +00:00
Pawel Jakub Dawidek
b4b2596b97 - Make 'flags' argument to chflags(2), fchflags(2) and lchflags(2) of type
u_long. Before this change it was of type int for syscalls, but prototypes
  in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
  for lchflags(2)) stated that it was u_long. Now some related functions
  use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.

Discussed on:	arch
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:44:33 +00:00
Pawel Jakub Dawidek
7493f24ee6 - Implement two new system calls:
int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
	int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);

  which allow to bind and connect respectively to a UNIX domain socket with a
  path relative to the directory associated with the given file descriptor 'fd'.

- Add manual pages for the new syscalls.

- Make the new syscalls available for processes in capability mode sandbox.

- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
  the directory descriptor for the syscalls to work.

- Update audit(4) to support those two new syscalls and to handle path
  in sockaddr_un structure relative to the given directory descriptor.

- Update procstat(1) to recognize the new capability rights.

- Document the new capability rights in cap_rights_limit(2).

Sponsored by:	The FreeBSD Foundation
Discussed with:	rwatson, jilles, kib, des
2013-03-02 21:11:30 +00:00
Pawel Jakub Dawidek
2609222ab4 Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
  has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
  should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
  cap_new(2), which limits capability rights of the given descriptor
  without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
  ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
  ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
  that can be used with the new cap_fcntls_limit(2) syscall and retrive
  them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
  heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
  recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
  backward API and ABI compatibility there are some incompatible changes
  that are described in detail below:

	CAP_CREATE old behaviour:
	- Allow for openat(2)+O_CREAT.
	- Allow for linkat(2).
	- Allow for symlinkat(2).
	CAP_CREATE new behaviour:
	- Allow for openat(2)+O_CREAT.

	Added CAP_LINKAT:
	- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
	- Allow to be target for renameat(2).

	Added CAP_SYMLINKAT:
	- Allow for symlinkat(2).

	Removed CAP_DELETE. Old behaviour:
	- Allow for unlinkat(2) when removing non-directory object.
	- Allow to be source for renameat(2).

	Removed CAP_RMDIR. Old behaviour:
	- Allow for unlinkat(2) when removing directory.

	Added CAP_RENAMEAT:
	- Required for source directory for the renameat(2) syscall.

	Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
	- Allow for unlinkat(2) on any object.
	- Required if target of renameat(2) exists and will be removed by this
	  call.

	Removed CAP_MAPEXEC.

	CAP_MMAP old behaviour:
	- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
	  PROT_WRITE.
	CAP_MMAP new behaviour:
	- Allow for mmap(2)+PROT_NONE.

	Added CAP_MMAP_R:
	- Allow for mmap(PROT_READ).
	Added CAP_MMAP_W:
	- Allow for mmap(PROT_WRITE).
	Added CAP_MMAP_X:
	- Allow for mmap(PROT_EXEC).
	Added CAP_MMAP_RW:
	- Allow for mmap(PROT_READ | PROT_WRITE).
	Added CAP_MMAP_RX:
	- Allow for mmap(PROT_READ | PROT_EXEC).
	Added CAP_MMAP_WX:
	- Allow for mmap(PROT_WRITE | PROT_EXEC).
	Added CAP_MMAP_RWX:
	- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

	Renamed CAP_MKDIR to CAP_MKDIRAT.
	Renamed CAP_MKFIFO to CAP_MKFIFOAT.
	Renamed CAP_MKNODE to CAP_MKNODEAT.

	CAP_READ old behaviour:
	- Allow pread(2).
	- Disallow read(2), readv(2) (if there is no CAP_SEEK).
	CAP_READ new behaviour:
	- Allow read(2), readv(2).
	- Disallow pread(2) (CAP_SEEK was also required).

	CAP_WRITE old behaviour:
	- Allow pwrite(2).
	- Disallow write(2), writev(2) (if there is no CAP_SEEK).
	CAP_WRITE new behaviour:
	- Allow write(2), writev(2).
	- Disallow pwrite(2) (CAP_SEEK was also required).

	Added convinient defines:

	#define	CAP_PREAD		(CAP_SEEK | CAP_READ)
	#define	CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
	#define	CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
	#define	CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
	#define	CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
	#define	CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_RECV		CAP_READ
	#define	CAP_SEND		CAP_WRITE

	#define	CAP_SOCK_CLIENT \
		(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
		 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
	#define	CAP_SOCK_SERVER \
		(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
		 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
		 CAP_SETSOCKOPT | CAP_SHUTDOWN)

	Added defines for backward API compatibility:

	#define	CAP_MAPEXEC		CAP_MMAP_X
	#define	CAP_DELETE		CAP_UNLINKAT
	#define	CAP_MKDIR		CAP_MKDIRAT
	#define	CAP_RMDIR		CAP_UNLINKAT
	#define	CAP_MKFIFO		CAP_MKFIFOAT
	#define	CAP_MKNOD		CAP_MKNODAT
	#define	CAP_SOCK_ALL		(CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by:	The FreeBSD Foundation
Reviewed by:	Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with:	rwatson, benl, jonathan
ABI compatibility discussed with:	kib
2013-03-02 00:53:12 +00:00
Konstantin Belousov
f13b5a0f01 Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR:	standards/170346
Submitted by:	"Jukka A. Ukkonen" <jau@iki.fi>
MFC after:	1 month
2012-11-13 12:52:31 +00:00
David Xu
d65f1abca7 Implement syscall clock_getcpuclockid2, so we can get a clock id
for process, thread or others we want to support.
Use the syscall to implement POSIX API clock_getcpuclock and
pthread_getcpuclockid.

PR:	168417
2012-08-17 02:26:31 +00:00
Ed Schouten
520b6a84f6 Remove use of non-ISO-C integer types from system call tables.
These files already use ISO-C-style integer types, so make them less
inconsistent by preferring the standard types.
2012-05-25 21:50:48 +00:00
Lawrence Stewart
cf13a58510 - Add the ffclock_getcounter(), ffclock_getestimate() and ffclock_setestimate()
system calls to provide feed-forward clock management capabilities to
  userspace processes. ffclock_getcounter() returns the current value of the
  kernel's feed-forward clock counter. ffclock_getestimate() returns the current
  feed-forward clock parameter estimates and ffclock_setestimate() updates the
  feed-forward clock parameter estimates.

- Document the syscalls in the ffclock.2 man page.

- Regenerate the script-derived syscall related files.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Submitted by:	Julien Ridoux (jridoux at unimelb edu au)
2011-11-21 01:26:10 +00:00
Ed Schouten
d3a993d46b Improve *access*() parameter name consistency.
The current code mixes the use of `flags' and `mode'. This is a bit
confusing, since the faccessat() function as a `flag' parameter to store
the AT_ flag.

Make this less confusing by using the same name as used in the POSIX
specification -- `amode'.
2011-11-19 06:35:15 +00:00
John Baldwin
936c09ac0f Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region.  It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types.  The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE().  These modes are
thus filesystem independent.  Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used.  These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request.  This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf().  This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode.  This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue.  The second change adds
a new vm_object_page_cache() method.  This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by:	jilles, kib, arch@
Approved by:	re (kib)
MFC after:	1 month
2011-11-04 04:02:50 +00:00
Jonathan Anderson
cfb5f76865 Add experimental support for process descriptors
A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.

New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.

When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.

This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.

Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-08-18 22:51:30 +00:00
Jonathan Anderson
cfb9df5541 Add cap_new() and cap_getrights() system calls.
Implement two previously-reserved Capsicum system calls:
- cap_new() creates a capability to wrap an existing file descriptor
- cap_getrights() queries the rights mask of a capability.

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
2011-07-15 18:26:19 +00:00
Matthew D Fleming
d91f88f7f3 Add the posix_fallocate(2) syscall. The default implementation in
vop_stdallocate() is filesystem agnostic and will run as slow as a
read/write loop in userspace; however, it serves to correctly
implement the functionality for filesystems that do not implement a
VOP_ALLOCATE.

Note that __FreeBSD_version was already bumped today to 900036 for any
ports which would like to use this function.

Also reserve space in the syscall table for posix_fadvise(2).

Reviewed by:	-arch (previous version)
2011-04-18 16:32:22 +00:00
Edward Tomasz Napierala
ec125fbbc5 Add rctl. It's used by racct to take user-configurable actions based
on the set of rules it maintains and the current resource usage.  It also
privides userland API to manage that ruleset.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-03-30 17:48:15 +00:00
Edward Tomasz Napierala
2bfc50bc4f Add two new system calls, setloginclass(2) and getloginclass(2). This makes
it possible for the kernel to track login class the process is assigned to,
which is required for RCTL.  This change also make setusercontext(3) call
setloginclass(2) and makes it possible to retrieve current login class using
id(1).

Reviewed by:	kib (as part of a larger patch)
2011-03-05 12:40:35 +00:00
Robert Watson
96fcc75fdf Add initial support for Capsicum's Capability Mode to the FreeBSD kernel,
compiled conditionally on options CAPABILITIES:

Add a new credential flag, CRED_FLAG_CAPMODE, which indicates that a
subject (typically a process) is in capability mode.

Add two new system calls, cap_enter(2) and cap_getmode(2), which allow
setting and querying (but never clearing) the flag.

Export the capability mode flag via process information sysctls.

Sponsored by:	Google, Inc.
Reviewed by:	anderson
Discussed with:	benl, kris, pjd
Obtained from:	Capsicum Project
MFC after:	3 months
2011-03-01 13:23:37 +00:00
Konstantin Belousov
8d19559bde Make the syscalls reserved for AFS usable by OpenAFS port.
Submitted by:	Benjamin Kaduk <kaduk mit edu>
MFC after:	2 weeks
2010-08-30 14:24:44 +00:00
Konstantin Belousov
13561ed4ed Fix typo.
Submitted by:	Ben Kaduk <minimarmot gmail com>
2010-08-26 11:20:57 +00:00
Konstantin Belousov
153ac44cf6 Count number of threads that enter and leave dynamically registered
syscalls. On the dynamic syscall deregistration, wait until all
threads leave the syscall code. This somewhat increases the safety
of the loadable modules unloading.

Reviewed by:	jhb
Tested by:	pho
MFC after:	1 month
2010-06-28 18:06:46 +00:00
Ed Schouten
790f66db55 Remove unused LIBCOMPAT keyword from syscalls.master. 2010-02-08 10:02:01 +00:00
Konstantin Belousov
066d836b02 Current pselect(3) is implemented in usermode and thus vulnerable to
well-known race condition, which elimination was the reason for the
function appearance in first place. If sigmask supplied as argument to
pselect() enables a signal, the signal might be delivered before thread
called select(2), causing lost wakeup. Reimplement pselect() in kernel,
making change of sigmask and sleep atomic.

Since signal shall be delivered to the usermode, but sigmask restored,
set TDP_OLDMASK and save old mask in td_oldsigmask. The TDP_OLDMASK
should be cleared by ast() in case signal was not gelivered during
syscall execution.

Reviewed by:	davidxu
Tested by:	pho
MFC after:	1 month
2009-10-27 10:55:34 +00:00
Robert Watson
72c35fc69c Reserve system call numbers for Capsicum security framework capabilities,
capability mode, and process descriptors: cap_new, cap_getrights, cap_enter,
cap_getmode, pdfork, pdkill, pdgetpid, and pdwait.

Obtained from:	TrustedBSD Project
Sponsored by:	Google
MFC after:	3 weeks
2009-09-30 08:46:01 +00:00
Edward Tomasz Napierala
c38898116a There is an optimization in chmod(1), that makes it not to call chmod(2)
if the new file mode is the same as it was before; however, this
optimization must be disabled for filesystems that support NFSv4 ACLs.
Chmod uses pathconf(2) to determine whether this is the case - however,
pathconf(2) always follows symbolic links, while the 'chmod -h' doesn't.

This change adds lpathconf(3) to make it possible to solve that problem
in a clean way.

Reviewed by:	rwatson (earlier version)
Approved by:	re (kib)
2009-07-08 15:23:18 +00:00
John Baldwin
b648d4806b Change the ABI of some of the structures used by the SYSV IPC API:
- The uid/cuid members of struct ipc_perm are now uid_t instead of unsigned
  short.
- The gid/cgid members of struct ipc_perm are now gid_t instead of unsigned
  short.
- The mode member of struct ipc_perm is now mode_t instead of unsigned short
  (this is merely a style bug).
- The rather dubious padding fields for ABI compat with SV/I386 have been
  removed from struct msqid_ds and struct semid_ds.
- The shm_segsz member of struct shmid_ds is now a size_t instead of an
  int.  This removes the need for the shm_bsegsz member in struct
  shmid_kernel and should allow for complete support of SYSV SHM regions
  >= 2GB.
- The shm_nattch member of struct shmid_ds is now an int instead of a
  short.
- The shm_internal member of struct shmid_ds is now gone.  The internal
  VM object pointer for SHM regions has been moved into struct
  shmid_kernel.
- The existing __semctl(), msgctl(), and shmctl() system call entries are
  now marked COMPAT7 and new versions of those system calls which support
  the new ABI are now present.
- The new system calls are assigned to the FBSD-1.1 version in libc.  The
  FBSD-1.0 symbols in libc now refer to the old COMPAT7 system calls.
- A simplistic framework for tagging system calls with compatibility
  symbol versions has been added to libc.  Version tags are added to
  system calls by adding an appropriate __sym_compat() entry to
  src/lib/libc/incldue/compat.h. [1]

PR:		kern/16195 kern/113218 bin/129855
Reviewed by:	arch@, rwatson
Discussed with:	kan, kib [1]
2009-06-24 21:10:52 +00:00
John Baldwin
45f48220bb Deprecate the msgsys(), semsys(), and shmsys() system calls by moving
them under COMPAT_FREEBSD[4567].  Starting with FreeBSD 5.0 the SYSV IPC
API was implemented via direct system calls (e.g. msgctl(), msgget(), etc.)
rather than indirecting through the var-args *sys() system calls.  The
shmsys() system call was already effectively deprecated for all but
COMPAT_FREEBSD4 already as its implementation for the !COMPAT_FREEBSD4 case
was to simply invoke nosys().
2009-06-24 20:01:13 +00:00
John Baldwin
3c366f1f14 Add a new COMPAT7 flag for FreeBSD 7.x compatibility system calls. 2009-06-24 13:36:37 +00:00
John Baldwin
0b0fe06a5e Fix a typo in a comment. 2009-06-22 20:12:40 +00:00
John Baldwin
21def99b51 - Add the ability to mix multiple flags seperated by pipe ('|') characters
in the type field of system call tables.  Specifically, one can now use
  the 'NO*' types as flags in addition to the 'COMPAT*' types.  For example,
  to tag 'COMPAT*' system calls as living in a KLD via NOSTD.  The COMPAT*
  type is required to be listed first in this case.
- Add new functions 'type()' and 'flag()' to the embedded awk script in
  makesyscalls.sh that return true if a requested flag is found in the
  type field ($3).  The flag() function checks all of the flags in the
  field, but type() only checks the first flag.  type() is meant to be
  used in the top-level "switch" statement and flag() should be used
  otherwise.
- Retire the CPT_NOA type, it is now replaced with "COMPAT|NOARGS" using
  the flags approach.
- Tweak the comment descriptions of COMPAT[46] system calls so that they
  say "freebsd[46] foo" rather than "old foo".
- Document the COMPAT6 type.
- Sync comments in compat32 syscall table with the master table.
2009-06-17 19:50:38 +00:00
John Baldwin
0ec0b41cf6 Remove the now-unused NOIMPL flag. It serves no useful purpose given the
existing UNIMPL and NOSTD types.
2009-06-17 18:46:14 +00:00
John Baldwin
f425839118 - NOSTD results in lkmressys being used instead of lkmssys.
- Mark nfsclnt as UNIMPL.  It should have been NOSTD instead of NOIMPL back
  when it lived in nfsclient.ko, but it was removed from that a long time
  ago.
2009-06-17 18:44:15 +00:00
John Baldwin
c4f16b69e1 Add a new 'void closefrom(int lowfd)' system call. When called, it closes
any open file descriptors >= 'lowfd'.  It is largely identical to the same
function on other operating systems such as Solaris, DFly, NetBSD, and
OpenBSD.  One difference from other *BSD is that this closefrom() does not
fail with any errors.  In practice, while the manpages for NetBSD and
OpenBSD claim that they return EINTR, they ignore internal errors from
close() and never return EINTR.  DFly does return EINTR, but for the common
use case (closing fd's prior to execve()), the caller really wants all
fd's closed and returning EINTR just forces callers to call closefrom() in
a loop until it stops failing.

Note that this implementation of closefrom(2) does not make any effort to
resolve userland races with open(2) in other threads.  As such, it is not
multithread safe.

Submitted by:	rwatson (initial version)
Reviewed by:	rwatson
MFC after:	2 weeks
2009-06-15 20:38:55 +00:00
Jamie Gritton
b38ff370e4 Introduce the extensible jail framework, using the same "name=value"
interface as nmount(2).  Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
  This replaces jail(2).
* jail_get, to read the parameters of existing jails.  This replaces the
  security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes.  The current jail(2) system
call still exists, though it is now a stub to jail_set(2).

Approved by:	bz (mentor)
2009-04-29 21:14:15 +00:00
Ed Schouten
a1b5a8955e Mark uname(), getdomainname() and setdomainname() with COMPAT_FREEBSD4.
Looking at our source code history, it seems the uname(),
getdomainname() and setdomainname() system calls got deprecated
somewhere after FreeBSD 1.1, but they have never been phased out
properly. Because we don't have a COMPAT_FREEBSD1, just use
COMPAT_FREEBSD4.

Also fix the Linuxolator to build without the setdomainname() routine by
just making it call userland_sysctl on kern.domainname. Also replace the
setdomainname()'s implementation to use this approach, because we're
duplicating code with sysctl_domainname().

I wasn't able to keep these three routines working in our
COMPAT_FREEBSD32, because that would require yet another keyword for
syscalls.master (COMPAT4+NOPROTO). Because this routine is probably
unused already, this won't be a problem in practice. If it turns out to
be a problem, we'll just restore this functionality.

Reviewed by:	rdivacky, kib
2008-11-09 10:45:13 +00:00