Commit Graph

457 Commits

Author SHA1 Message Date
Konstantin Belousov
49d39308ba The posix_madvise(3) and posix_fadvise(2) should return error on
failure, same as posix_fallocate(2).

Noted by:	Bob Bishop <rb@gid.co.uk>
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-01-30 18:04:39 +00:00
Konstantin Belousov
2852de0489 The posix_fallocate(2) syscall should return error number on error,
without modifying errno.

Reported and tested by:	Gennady Proskurin <gpr@mail.ru>
Reviewed by:	mdf
PR:	standards/186028
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-01-23 17:24:26 +00:00
Adrian Chadd
0cfea1c8fc Implement a kqueue notification path for sendfile.
This fires off a kqueue note (of type sendfile) to the configured kqfd
when the sendfile transaction has completed and the relevant memory
backing the transaction is no longer in use by this transaction.
This is analogous to SF_SYNC waiting for the mbufs to complete -
except now you don't have to wait.

Both SF_SYNC and SF_KQUEUE should work together, even if it
doesn't necessarily make any practical sense.

This is designed for use by applications which use backing cache/store
files (eg Varnish) or POSIX shared memory (not sure anything is using
it yet!) to know when a region of memory is free for re-use.  Note
it doesn't mark the region as free overall - only free from this
transaction.  The application developer still needs to track which
ranges are in the process of being recycled and wait until all
pending transactions are completed.

TODO:

* documentation, as always

Sponsored by:	Netflix, Inc.
2014-01-17 05:26:55 +00:00
Adrian Chadd
a43caef195 Refactor out the common sendfile code from the do_sendfile() and the
compat32 sendfile syscall.

Sponsored by:	Netflix, Inc.
2014-01-09 00:11:14 +00:00
Adrian Chadd
79750e3b36 Migrate the sendfile_sync structure into a public(ish) API in preparation
for extending and reusing it.

The sendfile_sync wrapper is mostly just a "mbuf transaction" wrapper,
used to indicate that the backing store for a group of mbufs has completed.
It's only being used by sendfile for now and it's only implementing a
sleep/wakeup rendezvous.  However, there are other potential signaling
paths (kqueue) and other potential uses (socket zero-copy write) where the
same mechanism would also be useful.

So, with that in mind:

* extract the sendfile_sync code out into sf_sync_*() methods
* teach the sf_sync_alloc method about the current config flag -
  it will eventually know about kqueue.
* move the sendfile_sync code out of do_sendfile() - the only thing
  it now knows about is the sfs pointer.  The guts of the sync
  rendezvous (setup, rendezvous/wait, free) is now done in the
  syscall wrapper.
* .. and teach the 32-bit compat sendfile call the same.

This should be a no-op.  It's primarily preparation work for teaching
the sendfile_sync about kqueue notification.

Tested:

* Peter Holm's sendfile stress / regression scripts

Sponsored by:	Netflix, Inc.
2013-12-01 03:53:21 +00:00
Peter Wemm
b5019bc45b jail_v0.ip_number was always in host byte order. This was handled
in one of the many layers of indirection and shims through stable/7
in jail_handle_ips().  When it was cleaned up and unified through
kern_jail() for 8.x, the byte order swap was lost.

This only matters for ancient binaries that call jail(2) themselves
internally.
2013-11-28 19:40:33 +00:00
Konstantin Belousov
80c3af4e80 Add an kinfo sysctl to retrieve signal trampoline location for the
given process.

Note that the correctness of the trampoline length returned for ABIs
which do not use shared page depends on the correctness of the struct
sysvec sv_szsigcodebase member, which will be fixed on as-need basis.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-11-26 19:47:09 +00:00
Adrian Chadd
7689abaedc Fix the compat32 sendfile() to be in line with my recent changes.
Reminded by:	kib
2013-11-26 08:32:37 +00:00
Konstantin Belousov
3a2092bad0 Add padding to match the compat32 struct stat32 definition to the real
struct stat on 32bit architectures.

Debugged and tested by:	bsam
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (marius)
2013-10-04 22:05:23 +00:00
Mark Johnston
8d305ba0dc Regenerate syscall argument strings after r255777.
Approved by:	re (gjb)
MFC after:	1 week
2013-09-21 23:06:36 +00:00
John Baldwin
a566e8e3c5 Regen.
Approved by:	re (delphij)
2013-09-19 18:56:00 +00:00
John Baldwin
55648840de Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
  from arbitrary processes.  Similar to ktrace it can apply a change to all
  existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
  control operations on processes (as opposed to the debugger-specific
  operations provided by ptrace(2)).  procctl(2) uses a combination of
  idtype_t and an id to identify the set of processes on which to operate
  similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
  of a set of processes.  MADV_PROTECT still works for backwards
  compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
  the first bit of which is used to track if P_PROTECT should be inherited
  by new child processes.

Reviewed by:	kib, jilles (earlier version)
Approved by:	re (delphij)
MFC after:	1 month
2013-09-19 18:53:42 +00:00
Jilles Tjoelker
9fdb497cd0 Regenerate for freebsd32_cap_enter().
Approved by:	re (hrs)
2013-09-17 20:49:05 +00:00
Jilles Tjoelker
529411c369 Disallow cap_enter() in freebsd32 compatibility mode.
The freebsd32 compatibility mode (for running 32-bit binaries on 64-bit
kernels) does not currently allow any system calls in capability mode, but
still permits cap_enter(). As a result, 32-bit binaries on 64-bit kernels
that use capability mode do not work (they crash after being disallowed to
call sys_exit()). Affected binaries include dhclient and uniq. The latter's
crashes cause obscure build failures.

This commit makes freebsd32 cap_enter() fail with [ENOSYS], as if capability
mode was not compiled in. Applications deal with this by doing their work
without capability mode.

This commit does not fix the uncommon situation where a 64-bit process
enters capability mode and then executes a 32-bit binary using fexecve().

This commit should be reverted when allowing the necessary freebsd32 system
calls in capability mode.

Reviewed by:	pjd
Approved by:	re (hrs)
2013-09-17 20:48:19 +00:00
John Baldwin
edb572a38c Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space.  This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address.  While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by:	alc
Approved by:	re (kib)
2013-09-09 18:11:59 +00:00
Pawel Jakub Dawidek
00a7f703b3 Regenerate after r255219.
Sponsored by:	The FreeBSD Foundation
2013-09-05 00:11:59 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
Will Andrews
5e9ccc8797 Add the ability to display the default FIB number for a process to the
ps(1) utility, e.g. "ps -O fib".

bin/ps/keyword.c:
	Add the "fib" keyword and default its column name to "FIB".

bin/ps/ps.1:
	Add "fib" as a supported keyword.

sys/compat/freebsd32/freebsd32.h:
sys/kern/kern_proc.c:
sys/sys/user.h:
	Add the default fib number for a process (p->p_fibnum)
	to the user land accessible process data of struct kinfo_proc.

Submitted by:	Oliver Fromme <olli@fromme.com>, gibbs
2013-08-26 23:48:21 +00:00
Pawel Jakub Dawidek
f69e5b30c9 Regenerate after r254491. 2013-08-18 13:38:39 +00:00
Pawel Jakub Dawidek
32536142f6 The cap_rights_limit(2) system calls needs a wrapper for 32bit binaries
running under 64bit kernels as the 'rights' argument has to be split
into two registers or the half of the rights will disappear.

Reported by:	jilles
Sponsored by:	The FreeBSD Foundation
2013-08-18 13:37:54 +00:00
Pawel Jakub Dawidek
d6f6b87647 Move the PAIR32TO64() macro and the RETVAL_HI/RETVAL_LO defines to a
header file for use by other .c files.

Sponsored by:	The FreeBSD Foundation
2013-08-18 13:34:11 +00:00
Pawel Jakub Dawidek
e11dc435ba Regenerate after r254481. 2013-08-18 10:31:30 +00:00
Pawel Jakub Dawidek
0dac22d8ea Implement 32bit versions of the cap_ioctls_limit(2) and cap_ioctls_get(2)
system calls as unsigned longs have different size on i386 and amd64.

Reported by:	jilles
Sponsored by:	The FreeBSD Foundation
2013-08-18 10:30:41 +00:00
Pawel Jakub Dawidek
9a57f6e88c Regenerate after r254447.
Sponsored by:	The FreeBSD Foundation
2013-08-17 14:18:41 +00:00
Pawel Jakub Dawidek
b49f2e4b48 Make pdfork(2), pdkill(2) and pdgetpid(2) syscalls available for 32bit
binaries running under 64bit kernel.

Sponsored by:	The FreeBSD Foundation
2013-08-17 14:17:13 +00:00
Gleb Smirnoff
ca04d21d5f Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-08-15 07:54:31 +00:00
Konstantin Belousov
68044954d2 Regenerate. 2013-07-21 19:44:53 +00:00
Konstantin Belousov
643ee87175 Implement compat32 wrappers for the ktimer_* syscalls.
Reported, reviewed and tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-21 19:43:52 +00:00
Konstantin Belousov
058262c69a Wrap kmq_notify(2) for compat32 to properly consume struct sigevent32
argument.

Reviewed and tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-21 19:40:30 +00:00
Konstantin Belousov
ab751a525b The freebsd32_lio_listio() compat syscall takes the struct sigevent32.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-21 19:36:53 +00:00
Konstantin Belousov
9731998997 Move the convert_sigevent32() utility function into freebsd32_misc.c
for consumption outside the vfs_aio.c.

For SIGEV_THREAD_ID and SIGEV_SIGNAL notification delivery methods,
also copy in the sigev_value, since librt event pumping loop compares
note generation number with the value passed through sigev_value.

Tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-21 19:33:48 +00:00
Konstantin Belousov
df4d0ed1cd Cosmetic change, use the same union name on the left and right sides
of the conversion.

Tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-21 19:17:46 +00:00
Konstantin Belousov
c5d0363709 Regenerate 2013-07-20 13:40:03 +00:00
Konstantin Belousov
d31e4b3a58 id_t is 64bit, provide the compat32 wrapper for clock_getcpuclockid2(2).
Reported and tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
PR:	threads/180652
Sponsored by:	The FreeBSD Foundation
2013-07-20 13:39:41 +00:00
Gleb Smirnoff
8d1aa3c6b4 aio_mlock() added:
- Regen for r251526.
  - Bump __FreeBSD_version.
2013-06-08 13:30:13 +00:00
Gleb Smirnoff
6160e12c10 Add new system call - aio_mlock(). The name speaks for itself. It allows
to perform the mlock(2) operation, which can consume a lot of time, under
control of aio(4).

Reviewed by:	kib, jilles
Sponsored by:	Nginx, Inc.
2013-06-08 13:27:57 +00:00
David E. O'Brien
1f4e9654bd Add a "kern.features" MIB for 32bit support under a 64bit kernel. 2013-05-31 21:43:17 +00:00
Konstantin Belousov
f85769eb75 Regenerate. 2013-05-21 11:41:08 +00:00
Konstantin Belousov
48947eccee Fix the wait6(2) on 32bit architectures and for the compat32, by using
the right type for the argument in syscalls.master.  Also fix the
posix_fallocate(2) and posix_fadvise(2) compat32 syscalls on the
architectures which require padding of the 64bit argument.

Noted and reviewed by:	jhb
Pointy hat to:	kib
MFC after:	1 week
2013-05-21 11:40:16 +00:00
Jilles Tjoelker
b201f4a0dc Regenerate files for pipe2(). 2013-05-01 22:45:04 +00:00
Jilles Tjoelker
dc570d5e56 Add pipe2() system call.
The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and
O_NONBLOCK (on both sides) as part of the function.

If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p).

If the pointer is not valid, behaviour differs: pipe2() writes into the
array from the kernel like socketpair() does, while pipe() writes into the
array from an architecture-specific assembler wrapper.

Reviewed by:	kan, kib
2013-05-01 22:42:42 +00:00
Jilles Tjoelker
1bf6b724f1 Regenerate files for accept4(). 2013-05-01 20:12:58 +00:00
Jilles Tjoelker
da7d2afb6d Add accept4() system call.
The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)

The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.

Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.
2013-05-01 20:10:21 +00:00
Matthew D Fleming
b3e6bbc676 Regen.
MFC after:	1 week
2013-04-02 05:30:52 +00:00
Matthew D Fleming
e324bf91e8 Fix return type of extattr_set_* and fix rmextattr(8) utility.
extattr_set_{fd,file,link} is logically a write(2)-like operation and
should return ssize_t, just like extattr_get_*.  Also, the user-space
utility was using an int for the return value of extattr_get_* and
extattr_list_*, both of which return an ssize_t.

MFC after:	1 week
2013-04-02 05:30:41 +00:00
Pawel Jakub Dawidek
5d46382415 Regenerate after r248599.
Sponsored by:	The FreeBSD Foundation
2013-03-21 23:02:19 +00:00
Pawel Jakub Dawidek
e948704e4b Implement chflagsat(2) system call, similar to fchmodat(2), but operates on
file flags.

Reviewed by:	kib, jilles
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:59:01 +00:00
Pawel Jakub Dawidek
14cd1ffdf8 Regenerate after r248597.
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:47:03 +00:00
Pawel Jakub Dawidek
b4b2596b97 - Make 'flags' argument to chflags(2), fchflags(2) and lchflags(2) of type
u_long. Before this change it was of type int for syscalls, but prototypes
  in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
  for lchflags(2)) stated that it was u_long. Now some related functions
  use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.

Discussed on:	arch
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:44:33 +00:00
Pawel Jakub Dawidek
378a73d1bd Regen after r247667. 2013-03-02 21:12:54 +00:00