Commit Graph

551 Commits

Author SHA1 Message Date
pjd
c0b1af2f13 - Require CAP_FSYNC capability right when opening a file with O_SYNC or O_FSYNC
flags.
- While here simplify check for locking flags.

Sponsored by:	The FreeBSD Foundation
2013-02-17 11:53:51 +00:00
kib
92d95b8406 Stop translating the ERESTART error from the open(2) into EINTR.
Posix requires that open(2) is restartable for SA_RESTART.

For non-posix objects, in particular, devfs nodes, still disable
automatic restart of the opens. The open call to a driver could have
significant side effects for the hardware.

Noted and reviewed by:	jilles
Discussed with:	bde
MFC after:	2 weeks
2013-02-07 14:53:33 +00:00
pjd
2163564eab Now that MPSAFE flag is gone, we can arrange code a bit better. 2013-01-31 22:20:05 +00:00
pjd
8a682d18ff Remove leftover label after Giant removal from VFS. 2013-01-31 22:15:41 +00:00
ed
8467024240 Remove unused `vfslocked' variable.
I have no idea what this `vfslocked' thing means. I wonder how it ended
up here.
2012-10-22 21:14:26 +00:00
kib
560aa751e0 Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
kib
867cb9c7c5 Acquire the rangelock for truncate(2) as well.
Reported and reviewed by:	avg
Tested by:	pho
MFC after:	1 week
2012-10-15 18:15:18 +00:00
pjd
ef5782071f - Enforce CAP_MKFIFO on mkfifoat(2), not on mknodat(2). Without this change
mkfifoat(2) was not restricted.
- Introduce CAP_MKNOD and enforce it on mknodat(2).

Sponsored by:	FreeBSD Foundation
MFC after:	2 weeks
2012-10-01 05:43:24 +00:00
pjd
1d5d62ac36 Require CAP_DELETE on directory descriptor for unlinkat(2).
Sponsored by:	FreeBSD Foundation
MFC after:	2 weeks
2012-09-25 21:00:36 +00:00
pjd
4816885ff1 Require CAP_CREATE on directory descriptor for symlinkat(2).
Sponsored by:	FreeBSD Foundation
MFC after:	2 weeks
2012-09-25 20:59:12 +00:00
pjd
5ca6c6bd61 Require CAP_CREATE on directory descriptor for linkat(2).
Sponsored by:	FreeBSD Foundation
MFC after:	2 weeks
2012-09-25 20:58:15 +00:00
pjd
76c124139f O_EXEC flag is not part of the O_ACCMODE mask, check it separately.
If O_EXEC is provided don't require CAP_READ/CAP_WRITE, as O_EXEC
is mutually exclusive to O_RDONLY/O_WRONLY/O_RDWR.

Without this change CAP_FEXECVE capability right is not enforced.

Sponsored by:	FreeBSD Foundation
MFC after:	3 days
2012-09-25 20:48:49 +00:00
jhb
cdbfd348a5 Reorder the managament of advisory locks on open files so that the advisory
lock is obtained before the write count is increased during open() and the
lock is released after the write count is decreased during close().

The first change closes a race where an open() that will block with O_SHLOCK
or O_EXLOCK can increase the write count while it waits.  If the process
holding the current lock on the file then tries to call exec() on the file
it has locked, it can fail with ETXTBUSY even though the advisory lock is
preventing other threads from succesfully completeing a writable open().

The second change closes a race where a read-only open() with O_SHLOCK or
O_EXLOCK may return successfully while the write count is non-zero due to
another descriptor that had the advisory lock and was blocking the open()
still being in the process of closing.  If the process that completed the
open() then attempts to call exec() on the file it locked, it can fail with
ETXTBUSY even though the other process that held a write lock has closed
the file and released the lock.

Reviewed by:	kib
MFC after:	1 month
2012-07-31 18:25:00 +00:00
kib
53224f018a Extend the KPI to lock and unlock f_offset member of struct file. It
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().

Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.

The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).

Tested by:	pho
No objections from:	jhb
MFC after:    3 weeks
2012-07-02 21:01:03 +00:00
jhb
571562fffb Further refine the implementation of POSIX_FADV_NOREUSE.
First, extend the changes in r230782 to better handle the common case
of using NOREUSE with sequential reads.  A NOREUSE file descriptor
will now track the last implicit DONTNEED request it made as a result
of a NOREUSE read.  If a subsequent NOREUSE read is adjacent to the
previous range, it will apply the DONTNEED request to the entire range
of both the previous read and the current read.  The effect is that
each read of a file accessed sequentially will apply the DONTNEED
request to the entire range that has been read.  This allows NOREUSE
to properly handle misaligned reads by flushing each buffer to cache
once it has been completely read.

Second, apply the same changes made to read(2) by r230782 and this
change to writes.  This provides much better performance in the
sequential write case as it allows writes to still be clustered.  It
also provides much better performance for misaligned writes.  It does
mean that NOREUSE will be generally ineffective for non-sequential
writes as the current implementation relies on a future NOREUSE
write's implicit DONTNEED request to flush the dirty buffer from the
current write.

MFC after:	2 weeks
2012-06-19 18:42:24 +00:00
pjd
0123f7ed5a Now that dupfdopen() doesn't depend on finstall() being called earlier,
indx will never be -1 on error, as none of dupfdopen(), finstall() and
kern_capwrap() modifies it on error, but what is more important none of
those functions install and leave file at indx descriptor on error.

Leave an assert to prove my words.

MFC after:	1 month
2012-06-13 21:38:07 +00:00
pjd
f695b590b4 Allocate descriptor number in dupfdopen() itself instead of depending on
the caller using finstall().
This saves us the filedesc lock/unlock cycle, fhold()/fdrop() cycle and closes
a race between finstall() and dupfdopen().

MFC after:	1 month
2012-06-13 21:32:35 +00:00
pjd
f7e18321ef - Remove nfp variable that is not really needed.
- Update comment.
- Style nits.

MFC after:	1 month
2012-06-13 21:22:35 +00:00
pjd
219cd5caaa Remove duplicated code.
MFC after:	1 month
2012-06-13 21:15:01 +00:00
pjd
5d3532ce69 Add missing {.
MFC after:	1 month
2012-06-13 21:13:18 +00:00
pjd
c745de62f2 Style.
MFC after:	1 month
2012-06-13 21:11:58 +00:00
pjd
54a86dc320 There is no need to set td->td_retval[0] to -1 on error.
Confirmed by:	jhb
MFC after:	1 month
2012-06-13 21:10:00 +00:00
pjd
859bb04daa Style fixes and simplifications.
MFC after:	1 month
2012-06-11 16:08:03 +00:00
jhb
176ddf31c3 Split the second half of vn_open_cred() (after a vnode has been found via
a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function
and use this function in fhopen() instead of duplicating code from
vn_open_cred() directly.

Tested by:	pho
Reviewed by:	kib
MFC after:	2 weeks
2012-06-08 18:32:09 +00:00
gleb
3c7243df78 Add kern_fhstat(), adjust sys_fhstat() to use it.
Extend kern_getdirentries() to accept uio segflag and optionally return
buffer residue.

Sponsored by:	Google Summer of Code 2011
2012-05-24 08:00:26 +00:00
jh
433fc8eeff The value of flags matching VNOVAL can't be supported. Return EOPNOTSUPP
from setfflags() in this case. This fixes the return value of
chflags(path, -1).

Discussed with:	bde
MFC after:	2 weeks
2012-04-20 10:08:30 +00:00
pho
c84e05a07c Perform the parameter validation before assigning it to a signed int
variable. This fixes the problem seen with readdir(3) fuzzing.

Submitted by:	bde
MFC after:	1 week
2012-03-09 21:31:12 +00:00
pho
81cae127b0 Free up allocated memory used by posix_fadvise(2). 2012-03-08 20:34:13 +00:00
jhb
19feaba08b Add KTR_VFS traces to track modifications to a vnode's writecount. 2012-03-08 20:27:20 +00:00
kib
80ae8fe82c Fix found places where uio_resid is truncated to int.
Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with:	bde, das (previous versions)
MFC after:	1 month
2012-02-21 01:05:12 +00:00
kib
52c17430bc Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return.  Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by:	mckusick
Tested by:	scottl
MFC after:	2 weeks
2012-02-06 11:04:36 +00:00
kib
5ccad4b353 Avoid LOR between vfs_busy() lock and covered vnode lock on quotaon().
The vfs_busy() is after covered vnode lock in the global lock order, but
since quotaon() does recursive VFS call to open quota file, we usually
end up locking covered vnode after mp is busied in sys_quotactl().

Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied
by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon,
esp. if vfs_busy cannot succeed due to unmount being performed.

Reported and tested by:	pho
MFC after:	1 week
2012-01-08 23:06:53 +00:00
jhb
fa881e0cf4 Fire a kevent if necessary after seeking on a regular file. This fixes a
case where a kevent would not fire on a regular file if an application read
to EOF and then seeked backwards into the file.

Reviewed by:	kib
MFC after:	2 weeks
2011-12-16 20:10:00 +00:00
eadler
3072a90209 Document a large number of currently undocumented sysctls. While here
fix some style(9) issues and reduce redundancy.

PR:		kern/155491
PR:		kern/155490
PR:		kern/155489
Submitted by:	Galimov Albert <wtfcrap@mail.ru>
Approved by:	bde
Reviewed by:	jhb
MFC after:	1 week
2011-12-13 00:38:50 +00:00
kib
6ecd4a2bb2 Fix a race between getvnode() dereferencing half-constructed file
and dupfdopen().

Reported and tested by:	pho
MFC after:	3 days
2011-11-24 20:34:06 +00:00
ed
9cedd4d52c Improve *access*() parameter name consistency.
The current code mixes the use of `flags' and `mode'. This is a bit
confusing, since the faccessat() function as a `flag' parameter to store
the AT_ flag.

Make this less confusing by using the same name as used in the POSIX
specification -- `amode'.
2011-11-19 06:35:15 +00:00
jhb
9315d9d42c - Split out a kern_posix_fadvise() from the posix_fadvise() system call so
it can be used by in-kernel consumers.
- Make kern_posix_fallocate() public.
- Use kern_posix_fadvise() and kern_posix_fallocate() to implement the
  freebsd32 wrappers for the two system calls.
2011-11-14 18:00:15 +00:00
jhb
78c075174e Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region.  It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types.  The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE().  These modes are
thus filesystem independent.  Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used.  These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request.  This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf().  This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode.  This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue.  The second change adds
a new vm_object_page_cache() method.  This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by:	jilles, kib, arch@
Approved by:	re (kib)
MFC after:	1 month
2011-11-04 04:02:50 +00:00
kmacy
99851f359e In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by:	rwatson
Approved by:	re (bz)
2011-09-16 13:58:51 +00:00
kib
011f42054d Add the fo_chown and fo_chmod methods to struct fileops and use them
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.

Based on the submission by:	glebius
Reviewed by:	rwatson
Approved by:	re (bz)
2011-08-16 20:07:47 +00:00
rwatson
11c783ae3f When falloc() was broken into separate falloc_noinstall() and finstall(),
a bug was introduced in kern_openat() such that the error from the vnode
open operation was overwritten before it was passed as an argument to
dupfdopen().  This broke operations on /dev/{stdin,stdout,stderr}.  Fix
by preserving the original error number across finstall() so that it is
still available.

Approved by:	re (kib)
Reported by:	cognet
2011-08-13 16:03:40 +00:00
jonathan
f63d2e9205 Allow Capsicum capabilities to delegate constrained
access to file system subtrees to sandboxed processes.

- Use of absolute paths and '..' are limited in capability mode.
- Use of absolute paths and '..' are limited when looking up relative
  to a capability.
- When a name lookup is performed, identify what operation is to be
  performed (such as CAP_MKDIR) as well as check for CAP_LOOKUP.

With these constraints, openat() and friends are now safe in capability
mode, and can then be used by code such as the capability-mode runtime
linker.

Approved by: re (bz), mentor (rwatson)
Sponsored by: Google Inc
2011-08-13 09:21:16 +00:00
jonathan
8cb2c0bf4d Only call fdclose() on successfully-opened FDs.
Since kern_openat() now uses falloc_noinstall() and finstall() separately,
there are cases where we could get to cleanup code without ever creating
a file descriptor. In those cases, we should not call fdclose() on FD -1.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-08-11 13:29:59 +00:00
rwatson
4af919b491 Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *.  With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by:	re (bz)
Submitted by:	jonathan
Sponsored by:	Google Inc
2011-08-11 12:30:23 +00:00
rmacklem
fbb8a5e8ec Add a lock flags argument to the VFS_FHTOVP() file system
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.

Reviewed by:	kib
2011-05-22 01:07:54 +00:00
mdf
597ae9f19b Allow VOP_ALLOCATE to be iterative, and have kern_posix_fallocate(9)
drive looping and potentially yielding.

Requested by:	kib
2011-04-19 16:36:24 +00:00
mdf
45c5f27863 Fix a copy/paste whitespace error. 2011-04-18 16:40:47 +00:00
mdf
9c9a32d97b Add the posix_fallocate(2) syscall. The default implementation in
vop_stdallocate() is filesystem agnostic and will run as slow as a
read/write loop in userspace; however, it serves to correctly
implement the functionality for filesystems that do not implement a
VOP_ALLOCATE.

Note that __FreeBSD_version was already bumped today to 900036 for any
ports which would like to use this function.

Also reserve space in the syscall table for posix_fadvise(2).

Reviewed by:	-arch (previous version)
2011-04-18 16:32:22 +00:00
kib
eb730d92e4 After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9)
and remove the falloc() version that lacks flag argument. This is done
to reduce the KPI bloat.

Requested by:	jhb
X-MFC-note:	do not
2011-04-01 13:28:34 +00:00
kib
7c2eaa21fe Add support for executing the FreeBSD 1/i386 a.out binaries on amd64.
In particular:
- implement compat shims for old stat(2) variants and ogetdirentries(2);
- implement delivery of signals with ancient stack frame layout and
  corresponding sigreturn(2);
- implement old getpagesize(2);
- provide a user-mode trampoline and LDT call gate for lcall $7,$0;
- port a.out image activator and connect it to the build as a module
  on amd64.

The changes are hidden under COMPAT_43.

MFC after:   1 month
2011-04-01 11:16:29 +00:00