Commit Graph

291 Commits

Author SHA1 Message Date
John Baldwin
005ce8e4e6 Fix locking issues with aio_fsync().
- Use correct lock in aio_cancel_sync when dequeueing job.
- Add _locked variants of aio_set/clear_cancel_function and use those
  to avoid lock recursion when adding and removing fsync jobs to the
  per-process sync queue.
- While here, add a basic test for aio_fsync().

PR:		211390
Reported by:	Randy Westlund <rwestlun@gmail.com>
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7339
2016-07-29 18:26:15 +00:00
John Baldwin
b9a53e161b Adjust tests in fsync job scheduling loop to reduce indentation. 2016-07-27 19:31:25 +00:00
John Baldwin
9c20dc9963 Add more documentation regarding unsafe AIO requests.
The asynchronous I/O changes made previously result in different
behavior out of the box. Previously all AIO requests failed with
ENOSYS / SIGSYS unless aio.ko was explicitly loaded. Now, some AIO
requests complete and others ("unsafe" requests) fail with EOPNOTSUPP.

Reword the introductory paragraph in aio(4) to add a general
description of AIO before describing the vfs.aio.enable_unsafe sysctl.

Remove the ENOSYS error description from aio_fsync(2), aio_read(2),
and aio_write(2) and replace it with a description of EOPNOTSUPP.

Remove the ENOSYS error description from aio_mlock(2).

Log a message to the system log the first time a process requests an
"unsafe" AIO request that fails with EOPNOTSUPP. This is modeled on
the log message used for processes using the legacy pty devices.

Reviewed by:	kib (earlier version)
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D7151
2016-07-21 22:49:47 +00:00
Konstantin Belousov
9fe297bbdc Declare aio requests on files from local filesystems safe.
Two notes:
- I allow AIO on reclaimed vnodes, since it is deterministically terminated
  fast.
- devfs mounts are marked as MNT_LOCAL, but device vnodes have type
  VCHR, so the slow device io is not allowed.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D7273
2016-07-21 17:07:06 +00:00
John Baldwin
b1012d8036 Account for AIO socket operations in thread/process resource usage.
File and disk-backed I/O requests store counts of read/written disk
blocks in each AIO job so that they can be charged to the thread that
completes an AIO request via aio_return() or aio_waitcomplete().  This
change extends AIO jobs to store counts of received/sent messages and
updates socket backends to set these counts accordingly.  Note that
the socket backends are careful to only charge a single messages for
each AIO request even though a single request on a blocking socket might
invoke sosend or soreceive multiple times.  This is to mimic the
resource accounting of synchronous read/write.

Adjust the UNIX socketpair AIO test to verify that the message resource
usage counts update accordingly for aio_read and aio_write.

Approved by:	re (hrs)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D6911
2016-06-21 22:19:06 +00:00
John Baldwin
fe0bdd1d2c Move backend-specific fields of kaiocb into a union.
This reduces the size of kaiocb slightly. I've also added some generic
fields that other backends can use in place of the BIO-specific fields.

Change the socket and Chelsio DDP backends to use 'backend3' instead of
abusing _aiocb_private.status directly. This confines the use of
_aiocb_private to the AIO internals in vfs_aio.c.

Reviewed by:	kib (earlier version)
Approved by:	re (gjb)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D6547
2016-06-15 20:56:45 +00:00
John Baldwin
f0ec174043 Consistently set status to -1 when completing an AIO request with an error.
Sponsored by:	Chelsio Communications
2016-05-20 19:46:25 +00:00
John Baldwin
4d805eacfa Tidy up the unmapped I/O code in qphysio.
- Move some blocks around to reduce the number of 'if (unmap)' checks.
- Use 'pbuf == NULL' instead of 'unmap'.
- Use nitems.
- Pull an assignment out of an if expression.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
2016-03-31 17:27:30 +00:00
John Baldwin
bb430bc740 Fully handle size_t lengths in AIO requests.
First, update the return types of aio_return() and aio_waitcomplete() to
ssize_t.

POSIX requires aio_return() to return a ssize_t so that it can represent
all return values from read() and write().  aio_waitcomplete() should use
ssize_t for the same reason.

aio_return() has used ssize_t in <aio.h> since r31620 but the manpage and
system call entry were not updated.  aio_waitcomplete() has always
returned int.

Note that this does not require new system call stubs as this is
effectively only an API change in how the compiler interprets the return
value.

Second, allow aio_nbytes values up to IOSIZE_MAX instead of just INT_MAX.

aio_read/write should now honor the same length limits as normal read/write.

Third, use longs instead of ints in the aio_return() and aio_waitcomplete()
system call functions so that the 64-bit size_t in the in-kernel aiocb
isn't truncated to 32-bits before being copied out to userland or
being returned.

Finally, a simple test has been added to verify the bounds checking on the
maximum read size from a file.
2016-03-21 21:37:33 +00:00
Pedro F. Giffuni
5166fdde7f aio_qphysio(): Avoid uninitialized pointer read on error.
For the !unmap case it may happen that pbuf gets called unreferenced
when vm_fault_quick_hold_pages() fails.
Initialize it so it doesn't cause trouble.

CID:		1352776
Reviewed by:	jhb
MFC after:	1 week
2016-03-18 19:04:01 +00:00
John Baldwin
399e8c1773 Simplify AIO initialization now that it is standard.
- Mark AIO system calls as STD and remove the helpers to dynamically
  register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
  an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
  the pathconf() system call.  fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5589
2016-03-09 19:05:11 +00:00
John Baldwin
f3215338ef Refactor the AIO subsystem to permit file-type-specific handling and
improve cancellation robustness.

Introduce a new file operation, fo_aio_queue, which is responsible for
queueing and completing an asynchronous I/O request for a given file.
The AIO subystem now exports library of routines to manipulate AIO
requests as well as the ability to run a handler function in the
"default" pool of AIO daemons to service a request.

A default implementation for file types which do not include an
fo_aio_queue method queues requests to the "default" pool invoking the
fo_read or fo_write methods as before.

The AIO subsystem permits file types to install a private "cancel"
routine when a request is queued to permit safe dequeueing and cleanup
of cancelled requests.

Sockets now use their own pool of AIO daemons and service per-socket
requests in FIFO order.  Socket requests will not block indefinitely
permitting timely cancellation of all requests.

Due to the now-tight coupling of the AIO subsystem with file types,
the AIO subsystem is now a standard part of all kernels.  The VFS_AIO
kernel option and aio.ko module are gone.

Many file types may block indefinitely in their fo_read or fo_write
callbacks resulting in a hung AIO daemon.  This can result in hung
user processes (when processes attempt to cancel all outstanding
requests during exit) or a hung system.  To protect against this, AIO
requests are only permitted for known "safe" files by default.  AIO
requests for all file types can be enabled by setting the new
vfs.aio.enable_usafe sysctl to a non-zero value.  The AIO tests have
been updated to skip operations on unsafe file types if the sysctl is
zero.

Currently, AIO requests on sockets and raw disks are considered safe
and are enabled by default.  aio_mlock() is also enabled by default.

Reviewed by:	cem, jilles
Discussed with:	kib (earlier version)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5289
2016-03-01 18:12:14 +00:00
John Baldwin
5652770d8f Rename aiocblist to kaiocb and use consistent variable names.
Typically <foo>list is used for a structure that holds a list head in
FreeBSD, not for members of a list.  As such, rename 'struct aiocblist'
to 'struct kaiocb' (the kernel version of 'struct aiocb').

While here, use more consistent variable names for AIO control blocks:

- Use 'job' instead of 'aiocbe', 'cb', 'cbe', or 'iocb' for kernel job
  objects.
- Use 'jobn' instead of 'cbn' for use with TAILQ_FOREACH_SAFE().
- Use 'sjob' and 'sjobn' instead of 'scb' and 'scbn' for fsync jobs.
- Use 'ujob' instead of 'aiocbp', 'job', 'uaiocb', or 'uuaiocb' to hold
  a user pointer to a 'struct aiocb'.
- Use 'ujobp' instead of 'aiocbp' for a user pointer to a 'struct aiocb *'.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5125
2016-02-05 20:38:09 +00:00
John Baldwin
0dd6c0352b Various style fixes.
- Wrap long lines.
- Fix indentation.
- Remove excessive parens.
- Whitespace fixes in struct definitions.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D5025
2016-01-26 21:24:49 +00:00
John Baldwin
39314b7d99 AIO daemons have always been kernel processes to facilitate switching to
user VM spaces while servicing jobs.  Update various comments and data
structures that refer to AIO daemons as threads to refer to processes
instead.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4999
2016-01-21 02:20:38 +00:00
John Baldwin
4429f0e2e3 Remove unused variables for socket AIO.
In r55943, a per-process queue of pending socket AIO requests (requests
waiting for the socket to become ready) was added so that they could be
cancelled during process rundown.  In r154765, the rundown code was
changed to handle jobs in this state (JOBST_JOBQSOCK) directly removing
the need for the extra queue.  However, the per-process queue head and
global lock were never removed.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4997
2016-01-21 01:28:31 +00:00
John Baldwin
8a4dc40ff4 Various cleanups to the main function for AIO kernel processes:
- Pull the vmspace logic out into helper functions and reduce duplication.
  Operations on the vmspace are all isolated to vm_map.c, but it now exports
  a new 'vmspace_switch_aio' for use by AIO kernel processes.
- When an AIO kernel process wants to exit, break out of the main loop and
  perform cleanup after the loop end.  This reduces a lot of indentation and
  allows cleanup to more closely mirror setup actions before the loop starts.
- Convert a DIAGNOSTIC to KASSERT().
- Replace mycp with more typical 'p'.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4990
2016-01-19 21:37:51 +00:00
John Baldwin
f2e7f06a0d Don't create a dedicated session for each AIO kernel process.
This code dates back to the initial AIO support and the commit log does
not explain why it is needed.  However, I cannot find anything in the
AIO code or the various file methods (fo_read/fo_write) that would change
behavior due to using a private session instead of proc0's session.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4988
2016-01-19 20:46:30 +00:00
John Baldwin
6c8fd02283 Remove aiod_timeout.
It hasn't been used since the AIO code was made MPSAFE 10 years ago.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4946
2016-01-14 21:28:56 +00:00
John Baldwin
c85650cacc Rename aiod_bio taskqueue to aiod_kick.
This taskqueue is not used to handle bio requests.  It is only used to
run aio_kick_nowait() to spin up new aio daemon processes.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D4904
2016-01-14 20:51:48 +00:00
Pawel Jakub Dawidek
38d68e2d42 The aio_waitcomplete(2) syscall should not sleep when the given timeout
is 0. Without this change it was sleeping for one tick. Maybe not a big
deal, but it makes share/dtrace/blocking script to report that.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D3814
Sponsored by:	Wheel Systems, http://wheelsystems.com
2015-10-25 18:48:09 +00:00
Konstantin Belousov
9889bbac23 Mutex memory is not zeroed, add MTX_NEW.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-06 14:09:00 +00:00
Mateusz Guzik
f131759f54 fd: make 'rights' a manadatory argument to fget* functions 2015-07-05 19:05:16 +00:00
Alexander Motin
f743d981f3 Make AIO to not allocate pbufs for unmapped I/O like r281825.
While there, make few more performance optimizations.

On 40-core system doing many 512-byte AIO reads from array of raw SSDs
this change removes lock congestions inside pbuf allocator and devfs,
and bottleneck on single AIO completion taskqueue thread.  It improves
peak AIO performance from ~600K to ~1.3M IOPS.

MFC after:	2 weeks
2015-04-22 18:11:34 +00:00
Mateusz Guzik
e015b1ab0a Avoid dynamic syscall overhead for statically compiled modules.
The kernel tracks syscall users so that modules can safely unregister them.

But if the module is not unloadable or was compiled into the kernel, there is
no need to do this.

Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC
during kernel build and 0 otherwise.

Reviewed by:	kib (previous version)
MFC after:	2 weeks
2014-10-26 19:42:44 +00:00
Robert Watson
4a14441044 Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
Pawel Jakub Dawidek
96a62209fb The fget() function now takes pointer to cap_rights_t, so change 0 to NULL. 2013-09-05 11:59:23 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
Kenneth D. Merry
ce625ec719 Change the way that unmapped I/O capability is advertised.
The previous method was to set the D_UNMAPPED_IO flag in the cdevsw
for the driver.  The problem with this is that in many cases (e.g.
sa(4)) there may be some instances of the driver that can handle
unmapped I/O and some that can't.  The isp(4) driver can handle
unmapped I/O, but the esp(4) driver currently cannot.  The cdevsw
is shared among all driver instances.

So instead of setting a flag on the cdevsw, set a flag on the cdev.
This allows drivers to indicate support for unmapped I/O on a
per-instance basis.

sys/conf.h:	Remove the D_UNMAPPED_IO cdevsw flag and replace it
		with an SI_UNMAPPED cdev flag.

kern_physio.c:	Look at the cdev SI_UNMAPPED flag to determine
		whether or not a particular driver can handle
		unmapped I/O.

geom_dev.c:	Set the SI_UNMAPPED flag for all GEOM cdevs.
		Since GEOM will create a temporary mapping when
		needed, setting SI_UNMAPPED unconditionally will
		work.

		Remove the D_UNMAPPED_IO flag.

nvme_ns.c:	Set the SI_UNMAPPED flag on cdevs created here
		if NVME_UNMAPPED_BIO_SUPPORT is enabled.

vfs_aio.c:	In aio_qphysio(), check the SI_UNMAPPED flag on a
		cdev instead of the D_UNMAPPED_IO flag on the cdevsw.

sys/param.h:	Bump __FreeBSD_version to 1000045 for the switch from
		setting the D_UNMAPPED_IO flag in the cdevsw to setting
		SI_UNMAPPED in the cdev.

Reviewed by:	kib, jimharris
MFC after:	1 week
Sponsored by:	Spectra Logic
2013-08-15 22:52:39 +00:00
Gleb Smirnoff
977c7043eb Remove extra zeroing after M_ZERO allocation. 2013-08-02 13:06:49 +00:00
Konstantin Belousov
9731998997 Move the convert_sigevent32() utility function into freebsd32_misc.c
for consumption outside the vfs_aio.c.

For SIGEV_THREAD_ID and SIGEV_SIGNAL notification delivery methods,
also copy in the sigev_value, since librt event pumping loop compares
note generation number with the value passed through sigev_value.

Tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-21 19:33:48 +00:00
Gleb Smirnoff
6160e12c10 Add new system call - aio_mlock(). The name speaks for itself. It allows
to perform the mlock(2) operation, which can consume a lot of time, under
control of aio(4).

Reviewed by:	kib, jilles
Sponsored by:	Nginx, Inc.
2013-06-08 13:27:57 +00:00
Gleb Smirnoff
f95c13db04 Separate LIO_SYNC processing into a separate function aio_process_sync(),
and rename aio_process() into aio_process_rw().

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
2013-06-08 13:02:43 +00:00
Konstantin Belousov
f3215a60fd Fix a race with the vnode reclamation in the aio_qphysio(). Obtain
the thread reference on the vp->v_rdev and use the returned struct
cdev *dev instead of using vp->v_rdev.  Call dev_strategy_csw()
instead of dev_strategy(), since we now own the reference.

Since the csw was already calculated, test d_flags to avoid mapping
the buffer if the driver supports unmapped requests [*].

Suggested by:	kan [*]
Reviewed by:	kan (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-03-27 11:47:52 +00:00
Konstantin Belousov
e81ff91e62 Do not remap usermode pages into KVA for physio.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-19 14:43:57 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Pawel Jakub Dawidek
2609222ab4 Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
  has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
  should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
  cap_new(2), which limits capability rights of the given descriptor
  without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
  ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
  ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
  that can be used with the new cap_fcntls_limit(2) syscall and retrive
  them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
  heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
  recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
  backward API and ABI compatibility there are some incompatible changes
  that are described in detail below:

	CAP_CREATE old behaviour:
	- Allow for openat(2)+O_CREAT.
	- Allow for linkat(2).
	- Allow for symlinkat(2).
	CAP_CREATE new behaviour:
	- Allow for openat(2)+O_CREAT.

	Added CAP_LINKAT:
	- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
	- Allow to be target for renameat(2).

	Added CAP_SYMLINKAT:
	- Allow for symlinkat(2).

	Removed CAP_DELETE. Old behaviour:
	- Allow for unlinkat(2) when removing non-directory object.
	- Allow to be source for renameat(2).

	Removed CAP_RMDIR. Old behaviour:
	- Allow for unlinkat(2) when removing directory.

	Added CAP_RENAMEAT:
	- Required for source directory for the renameat(2) syscall.

	Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
	- Allow for unlinkat(2) on any object.
	- Required if target of renameat(2) exists and will be removed by this
	  call.

	Removed CAP_MAPEXEC.

	CAP_MMAP old behaviour:
	- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
	  PROT_WRITE.
	CAP_MMAP new behaviour:
	- Allow for mmap(2)+PROT_NONE.

	Added CAP_MMAP_R:
	- Allow for mmap(PROT_READ).
	Added CAP_MMAP_W:
	- Allow for mmap(PROT_WRITE).
	Added CAP_MMAP_X:
	- Allow for mmap(PROT_EXEC).
	Added CAP_MMAP_RW:
	- Allow for mmap(PROT_READ | PROT_WRITE).
	Added CAP_MMAP_RX:
	- Allow for mmap(PROT_READ | PROT_EXEC).
	Added CAP_MMAP_WX:
	- Allow for mmap(PROT_WRITE | PROT_EXEC).
	Added CAP_MMAP_RWX:
	- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

	Renamed CAP_MKDIR to CAP_MKDIRAT.
	Renamed CAP_MKFIFO to CAP_MKFIFOAT.
	Renamed CAP_MKNODE to CAP_MKNODEAT.

	CAP_READ old behaviour:
	- Allow pread(2).
	- Disallow read(2), readv(2) (if there is no CAP_SEEK).
	CAP_READ new behaviour:
	- Allow read(2), readv(2).
	- Disallow pread(2) (CAP_SEEK was also required).

	CAP_WRITE old behaviour:
	- Allow pwrite(2).
	- Disallow write(2), writev(2) (if there is no CAP_SEEK).
	CAP_WRITE new behaviour:
	- Allow write(2), writev(2).
	- Disallow pwrite(2) (CAP_SEEK was also required).

	Added convinient defines:

	#define	CAP_PREAD		(CAP_SEEK | CAP_READ)
	#define	CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
	#define	CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
	#define	CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
	#define	CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
	#define	CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_RECV		CAP_READ
	#define	CAP_SEND		CAP_WRITE

	#define	CAP_SOCK_CLIENT \
		(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
		 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
	#define	CAP_SOCK_SERVER \
		(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
		 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
		 CAP_SETSOCKOPT | CAP_SHUTDOWN)

	Added defines for backward API compatibility:

	#define	CAP_MAPEXEC		CAP_MMAP_X
	#define	CAP_DELETE		CAP_UNLINKAT
	#define	CAP_MKDIR		CAP_MKDIRAT
	#define	CAP_RMDIR		CAP_UNLINKAT
	#define	CAP_MKFIFO		CAP_MKFIFOAT
	#define	CAP_MKNOD		CAP_MKNODAT
	#define	CAP_SOCK_ALL		(CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by:	The FreeBSD Foundation
Reviewed by:	Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with:	rwatson, benl, jonathan
ABI compatibility discussed with:	kib
2013-03-02 00:53:12 +00:00
Konstantin Belousov
5050aa86cf Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
David Xu
d56e058a79 Add 32-bit compat code for AIO kevent flags introduced in revision 230857. 2012-02-05 04:49:31 +00:00
David Xu
fde809356a If multiple threads call kevent() to get AIO events on same kqueue fd,
it is possible that a single AIO event will be reported to multiple
threads, it is not threading friendly, and the existing API can not
control this behavior.
Allocate a kevent flags field sigev_notify_kevent_flags for AIO event
notification in sigevent, and allow user to pass EV_CLEAR, EV_DISPATCH
or EV_ONESHOT to AIO kernel code, user can control whether the event
should be cleared once it is retrieved by a thread. This change should
be comptaible with existing application, because the field should have
already been zero-filled, and no additional action will be taken by
kernel.

PR:	kern/156567
2012-02-01 02:53:06 +00:00
Doug Ambrisko
8e9fc27818 When detaching an AIO or LIO requests grab the lock and tell knlist_remove
that we have the lock now.  This cleans up a locking panic ASSERT when
knlist_empty is called without a lock when INVARIANTS etc. are turned.

Reviewed by:	kib jhb
MFC after:	1 week
2012-01-30 19:19:22 +00:00
Gleb Smirnoff
94fce84763 Fix size check, that prevents getting negative after casting
to a signed type

Reviewed by:	bde
2012-01-27 08:58:58 +00:00
Gleb Smirnoff
434ea137cc Although aio_nbytes is size_t, later is is signed to
casted types: to ssize_t in filesystem code and to
int in buf code, thus supplying a negative argument
leads to kernel panic later. To fix that check user
supplied argument in the beginning of syscall.

Submitted by:	Maxim Dounin <mdounin mdounin.ru>, maxim@
2012-01-26 11:59:48 +00:00
Kip Macy
8451d0dd78 In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by:	rwatson
Approved by:	re (bz)
2011-09-16 13:58:51 +00:00
Robert Watson
a9d2f8d84f Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *.  With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by:	re (bz)
Submitted by:	jonathan
Sponsored by:	Google Inc
2011-08-11 12:30:23 +00:00
David Xu
cf7d9a8ca8 Create a global thread hash table to speed up thread lookup, use
rwlock to protect the table. In old code, thread lookup is done with
process lock held, to find a thread, kernel has to iterate through
process and thread list, this is quite inefficient.
With this change, test shows in extreme case performance is
dramatically improved.

Earlier patch was reviewed by: jhb, julian
2010-10-09 02:50:23 +00:00
Konstantin Belousov
723d37c0ac Convert aio syscall registration to SYSCALL_INIT_HELPER.
Reviewed by:	jhb
MFC after:	2 weeks
2010-03-19 11:11:34 +00:00
Nathan Whitehorn
841c0c7ec7 Provide groundwork for 32-bit binary compatibility on non-x86 platforms,
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.

Reviewed by:	kib, jhb
2010-03-11 14:49:06 +00:00
Robert Watson
e76d823b81 Use C99 initialization for struct filterops.
Obtained from:	Mac OS X
Sponsored by:	Apple Inc.
MFC after:	3 weeks
2009-09-12 20:03:45 +00:00
Konstantin Belousov
d8b0556c6d Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by:	jhb [1]
Noted by:	jhb [2]
Reviewed by:	jhb
Tested by:	rnoland
2009-06-10 20:59:32 +00:00