Commit Graph

623 Commits

Author SHA1 Message Date
kib
5def9fa2c2 Do not allocate struct statfs on kernel stack.
Right now size of the structure is 472 bytes on amd64, which is
already large and stack allocations are indesirable.  With the ino64
work, MNAMELEN is increased to 1024, which will make it impossible to have
struct statfs on the stack.

Extracted from:	ino64 work by gleb
Discussed with:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-05 17:19:26 +00:00
kib
6d69bbcc31 Some style fixes for getfstat(2)-related code.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-05 17:03:35 +00:00
kib
e1aff3b457 The callers of kern_getfsstat(UIO_SYSSPACE) expect that *buf always
returns memory which must be freed, regardless of the error.  Assign
NULL to *buf in case we are not going to allocate any memory due to
invalid mode.

Reported and tested by:	pho
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks (together with r310638)
Differential revision:	https://reviews.freebsd.org/D9042
2017-01-04 16:09:45 +00:00
kib
778442cef7 There is no need to use temporary statfs buffer for fsid obliteration
and prison enforcement.  Do it on the caller buffer directly.

Besides eliminating memory copies, this change also removes large
structure from the kernel stack.

Extracted from:	ino64 work by gleb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-02 18:59:23 +00:00
kib
9a7682de00 Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-02 18:49:48 +00:00
kib
577892f66a Move common code from kern_statfs() and kern_fstatfs() into a new helper.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-02 18:20:22 +00:00
jhb
9105298a66 Rename the 'flags' argument to getfsstat() to 'mode' and validate it.
This argument is not a bitmask of flags, but only accepts a single value.
Fail with EINVAL if an invalid value is passed to 'flag'.  Rename the
'flags' argument to getmntinfo(3) to 'mode' as well to match.

This is a followup to r308088.

Reviewed by:	kib
MFC after:	1 month
2016-12-27 20:21:11 +00:00
mjg
aa6fdf05e8 vfs: use vrefact in getcwd and fchdir 2016-12-12 19:16:35 +00:00
kib
a41f4cc9a5 Allow some dotdot lookups in capability mode.
If dotdot lookup does not escape from the file descriptor passed as
the lookup root, we can allow the component traversal.  Track the
directories traversed, and check the result of dotdot lookup against
the recorded list of the directory vnodes.

Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently
disabled by default until more verification of the approach is done.

Disallow non-local filesystems for dotdot, since remote server might
conspire with the local process to allow it to escape the namespace.
This might be too cautious, provide the knob
vfs.lookup_cap_dotdot_nonlocal to override as well.

Idea by:	rwatson
Discussed with:	emaste, jonathan, rwatson
Reviewed by:	mjg (previous version)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
Differential revision:	https://reviews.freebsd.org/D8110
2016-11-02 12:43:15 +00:00
trasz
2c0de38912 Fix getfsstat(2) with MNT_WAIT to not skip filesystems that are in the
process of being unmounted.  Previously it would skip them, even if the
unmount eventually failed eg due to the filesystem being busy.

This behaviour broke autounmountd(8) - if you tried to manually unmount
a mounted filesystem, using 'automount -u', and the autounmountd attempted
to refresh the filesystem list in that very moment, it would conclude that
the filesystem got unmounted and not try to unmount it afterwards.

Reviewed by:	kib@
Tested by:	pho@
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8030
2016-11-02 09:43:19 +00:00
trasz
5ea37b9562 Fix getfsstat(2) handling of flags. The 'flags' argument is an enum,
not a bitfield. For the intended usage - being passed either MNT_WAIT,
or MNT_NOWAIT - this shouldn't introduce any changes in behaviour.

Reviewed by:	jhb@
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8373
2016-10-29 12:38:30 +00:00
emaste
00b67b15b9 Renumber license clauses in sys/kern to avoid skipping #3 2016-09-15 13:16:20 +00:00
ed
b877ee1f8a Eliminate use of sys_fsync() and sys_fdatasync().
Make the kern_fsync() function public, so that it can be used by other
parts of the kernel. Fix up existing consumers to make use of it.

Requested by:	kib
2016-08-15 20:11:52 +00:00
kib
aa6b4fc56a Add an implementation of fdatasync(2).
The syscall is a trivial wrapper around new VOP_FDATASYNC(), sharing
code with fsync(2).  For all filesystems, this commit provides the
implementation which delegates the work of VOP_FDATASYNC() to
VOP_FSYNC().  This is functionally correct but not efficient.

This is not yet POSIX-compliant implementation, because it does not
ensure that queued AIO requests are completed before returning.

Reviewed by:	mckusick
Discussed with:	avg (ZFS), jhb (AIO part)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7471
2016-08-15 19:08:51 +00:00
kib
53c82a1389 Do not allow creation of char or block special nodes with VNOVAL dev_t.
As was reported on http://seclists.org/oss-sec/2016/q3/68, tmpfs code
contains assertion that rdev != VNOVAL.  On FreeBSD, there is no other
consequences except triggering the assert.  To be compatible with
systems where device nodes have some significance, reject mknod(2)
call with dev == VNOVAL at the syscall level.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-07-15 09:23:18 +00:00
rwatson
f003d8da48 Audit the file-descriptor number argument for openat(2). Remove a comment
about the desirability of auditing the number, as it was in fact in the
wrong place (in the common path for open(2) and openat(2), and only the
latter accepts a file-descriptor argument).  Where other ABIs support
openat(2), it may be necessary to do additional argument auditing as it is
not performed in kern_openat(9).

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-10 09:50:21 +00:00
glebius
a0c5a05de8 Fix kernel stack disclosures in the Linux and 4.3BSD compat layers.
Submitted by:	CTurt
Security:	SA-16:20
Security:	SA-16:21
2016-05-31 16:56:30 +00:00
jhb
1b87e4306e Simplify AIO initialization now that it is standard.
- Mark AIO system calls as STD and remove the helpers to dynamically
  register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
  an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
  the pathconf() system call.  fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5589
2016-03-09 19:05:11 +00:00
markj
9abb1836d9 Improve error handling for posix_fallocate(2) and posix_fadvise(2).
- Set td_errno so that ktrace and dtrace can obtain the syscall error
  number in the usual way.
- Pass negative error numbers directly to the syscall layer, as they're
  not intended to be returned to userland.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5425
2016-02-25 19:58:23 +00:00
mckusick
3ddfa78dde Clarify a comment in kern_openat() about the use of falloc_noinstall().
Suggested by: Steve Jacobson
2016-02-07 01:04:47 +00:00
trasz
a6632d64f5 The freebsd4_getfsstat() was broken in r281551 to always return 0 on success.
All versions of getfsstat(3) are supposed to return the number of [o]statfs
structs in the array that was copied out.

Also fix missing bounds checking and signed comparison of unsigned types.

Submitted by:	bde@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-11-20 14:08:12 +00:00
markj
d6b5b38ff0 Revert r288628 and instead fix a discrepancy between the posix_fadvise(2)
man page and POSIX: posix_fadvise(2) returns an error number on failure.

Reported by:	jilles
MFC after:	1 week
2015-10-03 22:27:14 +00:00
markj
f73def85e6 The return value of posix_fadvise(2) is just an error status, so
sys_posix_fadvise() should simply return the errno (or 0) to syscallenter()
rather than setting a return value.

MFC after:	1 week
2015-10-03 19:37:41 +00:00
markj
6348241c12 As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written.  At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached.  To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released.  POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3726
2015-09-30 23:06:29 +00:00
avg
425c0bb088 save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE
SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after:	20 days
2015-09-28 12:14:16 +00:00
ed
066f63003b Decompose linkat()/renameat() rights to source and target.
To make it easier to understand how Capsicum interacts with linkat() and
renameat(), rename the rights to CAP_{LINK,RENAME}AT_{SOURCE,TARGET}.

This also addresses a shortcoming in Capsicum, where it isn't possible
to disable linking to files stored in a directory. Creating hardlinks
essentially makes it possible to access files with additional rights.

Reviewed by:	rwatson, wblock
Differential Revision:	https://reviews.freebsd.org/D3411
2015-08-27 15:16:41 +00:00
bz
782cc51566 Try to unbreak the build after r285390 removing the obsolete static
declaration.
2015-07-12 00:26:22 +00:00
mjg
c71e9ab863 Move chdir/chroot-related fdp manipulation to kern_descrip.c
Prefix exported functions with pwd_.

Deduplicate some code by adding a helper for setting fd_cdir.

Reviewed by:	kib
2015-07-11 16:19:11 +00:00
mjg
1a3e7a935e Replace struct filedesc argument in getvnode with struct thread
This is is a step towards removal of spurious arguments.
2015-06-16 13:09:18 +00:00
mjg
82d355a2e3 Tidy up sys_umask a little bit
Consistently use saved fdp pointer as it cannot change. If it could change the
code would be already incorrect.

No functional changes.
2015-05-18 13:43:33 +00:00
kib
2254748ed0 The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter.  The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development.  The
shims were reasonable to allow easier revert to the older kernel at
that time.

Now, two or three major releases later, shims do not serve any
purpose.  Such old kernels cannot handle current libc, so revert the
compatibility code.

Make padded syscalls support conditional under the COMPAT6 config
option.  For COMPAT32, the syscalls were under COMPAT6 already.

Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.

Reviewed by:	jhb, imp (previous versions)
Discussed with:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-18 21:50:13 +00:00
trasz
009c656eaf Rewrite linprocfs_domtab() as a wrapper around kern_getfsstat(). This
adds missing jail and MAC checks.

Differential Revision:	https://reviews.freebsd.org/D2193
Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-04-15 09:13:11 +00:00
jilles
308eebbd52 utimensat: Correct Capsicum required capability rights. 2015-04-04 21:47:54 +00:00
mjg
0a219ba739 filedesc: simplify fget_unlocked & friends
Introduce fget_fcntl which performs appropriate checks when needed.
This removes a branch from fget_unlocked.

Introduce fget_mmap dealing with cap_rights_to_vmprot conversion.
This removes a branch from _fget.

Modify fget_unlocked to pass sequence counter to interested callers so
that they can perform their own checks and make sure the result was
otained from stable & current state.

Reviewed by:	silence on -hackers
2015-02-17 23:54:06 +00:00
jilles
67db24d0f2 Add futimens and utimensat system calls.
The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.

A new UTIME_* constant might allow setting birthtimes in future.

Differential Revision:	https://reviews.freebsd.org/D1426
Submitted by:	pluknet (partially)
Reviewed by:	delphij, pluknet, rwatson
Relnotes:	yes
2015-01-23 21:07:08 +00:00
kib
77c9d3f4e8 The VOP_LOOKUP() implementations for CREATE op do not put the name
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.

Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter().  Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.

Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.

Suggested by:	Chris Torek <torek@pi-coral.com>
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-18 10:01:12 +00:00
jkim
66032933f9 Correct a typo to fix chown(2). It was broken since r274476.
Pointy hat to:	kib
X-MFC-With:	r274476
2014-11-13 23:51:13 +00:00
kib
b4ef709604 Remove the no-at variants of the kern_xx() syscall helpers. E.g., we
have both kern_open() and kern_openat(); change the callers to use
kern_openat().

This removes one (sometimes two) levels of indirection and
consolidates arguments checks.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-13 18:01:51 +00:00
kib
e257542e11 Remove fossil. It has been present in 4.4Lite2, but its use was
removed for some time.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-13 17:43:37 +00:00
kib
ff19294d91 For posix_fallocate(2) and posix_fadvise(2), return ESPIPE when
underlying file does not have DFLAG_SEEKABLE set [1].

For posix_fallocate(2), simplify error handling logic.  Do return when
fp is not yet referenced.

Noted by:	bde [1]
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-12 17:31:38 +00:00
mjg
b43c178ac3 Reduce nesting in vn_access.
No functional changes.
2014-10-22 01:53:00 +00:00
mjg
a5dd454a60 Avoid crdup when possible in kern_accessat.
While here tidy up a little.
2014-10-22 01:09:07 +00:00
kib
0529718f1d In kern_linkat() and kern_renameat(), do not call namei(9) while
holding a write reference on the filesystem.  Try to get write
reference in unblocked way after all vnodes are resolved; if failed,
drop all locks and retry after waiting for suspension end.

The VFS_UNMOUNT() methods for UFS and tmpfs try to establish
suspension on unmount, while covered vnode is locked by VFS, which
prevents namei() from stepping over the mount point.  The thread doing
namei() sleeps on the covered vnode lock, owning the write ref.

Reported by:	bdrewery
Tested by:	bdrewery (previous version), pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-09-25 20:42:25 +00:00
ngie
356c289c25 Validate the mode argument in access, eaccess, and faccessat for optional
POSIX compliance and to improve compatibility with Linux and NetBSD

The issue was identified with lib/libc/sys/t_access:access_inval from
NetBSD

Update the manpage accordingly

PR: 181155
Reviewed by: jilles (code), jmmv (code), wblock (manpage), wollman (code)
MFC after: 4 weeks
Phabric: D678 (code), D786 (manpage)
Sponsored by: EMC / Isilon Storage Division
2014-09-16 00:56:47 +00:00
kib
32a7383c85 Check for the cross-device cross-link attempt in the VFS, instead of
forcing filesystem VOP_LINK() methods to repeat the code.  In
tmpfs_link(), remove redundand check for the type of the source,
already done by VFS.

Note that NFS server already performs this check before calling
VOP_LINK().

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-16 14:04:46 +00:00
kib
7b68dd9333 In kern_linkat(), avoid passing doomed vnode to the VOP.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-14 08:41:13 +00:00
mjg
d74326bc91 fd: replace fd_nfiles with fd_lastfile where appropriate
fd_lastfile is guaranteed to be the biggest open fd, so when the intent
is to iterate over active fds or lookup one, there is no point in looking
beyond that limit.

Few places are left unpatched for now.

MFC after:	1 week
2014-06-22 01:31:55 +00:00
rwatson
33fdc14c0c Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
kib
f7d0f51921 The auio structure is only initialized when the vnode is symlink,
avoid reading from it otherwise.

Submitted by:	Conrad Meyer <cemeyer@uw.edu>
MFC after:	1 week
2014-03-12 10:23:51 +00:00
kib
f0cb8e7d88 The posix_madvise(3) and posix_fadvise(2) should return error on
failure, same as posix_fallocate(2).

Noted by:	Bob Bishop <rb@gid.co.uk>
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-01-30 18:04:39 +00:00