Commit Graph

207 Commits

Author SHA1 Message Date
Mateusz Guzik
721a81c369 vfs: stop duplicating vnode work in audit during path lookup
Duplicating the work was putting an avoidable requirement that the filedesc
lock is held across the entire operation (otherwise by the time audit reads
vnode pointers another thread in the same process can chdir somewhere else,
making audit log things using different vnode than the one which will be
used for actual lookup).

Do the obvious thing and pass down vnodes which will be used.
2020-02-21 01:44:31 +00:00
Mateusz Guzik
e126c5a3e8 vfs: use new capsicum helpers 2020-02-15 01:28:42 +00:00
Mateusz Guzik
6ebab6bad2 vfs: use mac fastpath for lookup, open, read, write, mmap 2020-02-13 22:22:55 +00:00
Kyle Evans
3d62f685d5 namei: preserve errors from fget_cap_locked
Most notably, we want to make sure we don't clobber any capabilities-related
errors. This is a regression from r357412 (O_SEARCH) that was picked up by
the capsicum tests.

PR:		243839
Reviewed by:	kib (committed form recommended by)
Tested by:	lwhsu
Differential Revision:	https://reviews.freebsd.org/D23479
2020-02-03 18:59:07 +00:00
Kyle Evans
6a5abb1ee5 Provide O_SEARCH
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.

This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.

This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23247
2020-02-02 16:34:57 +00:00
Mateusz Guzik
90f4ec3328 vfs: save on atomics on the root vnode for absolute lookups
There are 2 back-to-back atomics on the vnode, but we can check upfront if one
is sufficient. Similarly we can handle relative lookups where current working
directory == root directory.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23427
2020-02-01 06:40:35 +00:00
Mateusz Guzik
b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Mateusz Guzik
6fa079fc3f vfs: flatten vop vectors
This eliminates the following loop from all VOP calls:

while(vop != NULL && \
    vop->vop_spare2 == NULL && vop->vop_bypass == NULL)
        vop = vop->vop_default;

Reviewed by:	jeff
Tesetd by:	pho
Differential Revision:	https://reviews.freebsd.org/D22738
2019-12-16 00:06:22 +00:00
Mateusz Guzik
abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Konstantin Belousov
e28fa55a08 NDFREE(): Fix unlocking for LOCKPARENT|LOCKLEAF and ndp->ni_dvp == ndp->ni_vp.
NDFREE() calculates unlock_dvp after ndp->ni_vp is unlocked and zeroed
out. This makes the comparision of ni_dvp with ni_vp always fail.
Move the calculation of unlock_dvp right after unlock_vp, so that the
code sees correct ni_vp value.

Reproduced by
	   chdir("/usr");
	   open("/..", O_BENEATH | O_RDONLY);

Reported by:	syzkaller
Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20304
2019-05-21 15:12:13 +00:00
Konstantin Belousov
7cdb0b9d82 Fix renameat(2) for CAPABILITIES kernels.
When renameat(2) is used with:
- absolute path for to;
- tofd not set to AT_FDCWD;
- the target exists
kern_renameat() requires CAP_UNLINK capability on tofd, but
corresponding namei ni_filecap is not initialized at all because the
lookup is absolute.  As result, the check was done against empty filecap
and syscall fails erronously.

Fix it by creating a return flags namei member and reporting if the
lookup was absolute, then do not touch to.ni_filecaps at all.

PR:	222258
Reviewed by:	jilles, ngie
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-MFC-note:	KBI breakage
Differential revision:	https://reviews.freebsd.org/D19096
2019-02-08 04:18:17 +00:00
Mateusz Guzik
24d64be4c5 vfs: mostly depessimize NDINIT_ALL
1) filecaps_init was unnecesarily a function call
2) an asignment at the end was preventing tail calling of cap_rights_init

Sponsored by:	The FreeBSD Foundation
2018-12-14 03:55:08 +00:00
Konstantin Belousov
7d2b0bd7d7 If BENEATH is specified, always latch the topping directory vnode.
It is possible that we started with a relative path but during the
lookup, found an absolute symlink.  In this case, BENEATH handling
code needs the latch, but it is too late to calculate it.

While there, somewhat improve the assertions.  Clear the NI_LCF_LATCH
flag when the latch vnode is released, so that asserts know the state.
Assert that there is a latch if we entered beneath+abs path mode,
after the starting point is processed.

Reported by:	wulf
With more input from:	pho
Sponsored by:	The FreeBSD Foundation
2018-11-29 19:13:10 +00:00
Konstantin Belousov
ade85c5eec Allow absolute paths for O_BENEATH.
The path must have a tail which does not escape starting/topping
directory.  The documentation will come shortly, see the man pages
commit message for the reason of separate commit.

Reviewed by:	jilles (previous version)
Discussed with:	emaste
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17714
2018-11-11 00:04:36 +00:00
Konstantin Belousov
4f77f48884 Implement O_BENEATH and AT_BENEATH.
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory.  Supposedly the interface is similar
to the same proposed Linux flags.

Reviewed by:	jilles (code, previous version of manpages), 0mp (manpages)
Discussed with:	allanjude, emaste, jonathan
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D17547
2018-10-25 22:16:34 +00:00
Mateusz Guzik
c396945b74 vfs: remove lookup_shared tunable
Reviewed by:	kib, jhb
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17253
2018-09-20 18:25:26 +00:00
Matt Macy
84482abd21 vfs: annotate variables only used by debug builds as __unused 2018-05-19 04:59:39 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Conrad Meyer
38d84d683e vfs_lookup: Allow PATH_MAX-1 symlinks
Previously, symlinks in FreeBSD were artificially limited to PATH_MAX-2.

Add a short test case to verify the change.

Submitted by:	Gaurav Gangalwar <ggangalwar AT isilon.com>
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12589
2017-11-17 19:25:39 +00:00
Tijl Coosemans
11ce4d9f39 When a Linux program tries to access a /path the kernel tries
/compat/linux/path before /path.  Stop following symbolic links when
looking up /compat/linux/path so dead symbolic links aren't ignored.
This allows syscalls like readlink(2) and lstat(2) to work on such links.
And open(2) will return an error now instead of trying /path.
2017-10-15 18:53:21 +00:00
John Baldwin
03f7f17878 Use UMA_ALIGN_PTR instead of sizeof(void *) for zone alignment.
uma_zcreate()'s alignment argument is supposed to be sizeof(foo) - 1,
and uma.h provides a set of helper macros for common types.  Passing
sizeof(void *) results in all of the members being misaligned triggering
unaligned access faults on certain architectures (notably MIPS).

Reported by:	brooks
Obtained from:	CheriBSD
MFC after:	3 days
Sponsored by:	DARPA / AFRL
2017-03-15 18:23:32 +00:00
Konstantin Belousov
aec8391d46 Provide fallback VOP methods for crossmp vnode.
In particular, crossmp vnode might leak into rename code.

PR:	216380
Reported by:	fnacl@protonmail.com
Sponsored by:	The FreeBSD Foundation
X-MFC with:	r309425
2017-01-22 19:36:02 +00:00
Edward Tomasz Napierala
5ec7cde488 Fix bug that would result in a kernel crash in some cases involving
a symlink and an autofs mount request.  The crash was caused by namei()
calling bcopy() with a negative length, caused by numeric underflow:
in lookup(), in the relookup path, the ni_pathlen was decremented too
many times.  The bug was introduced in r296715.

Big thanks to Alex Deiter for his help with debugging this.

Reviewed by:	kib@
Tested by:	Alex Deiter <alex.deiter at gmail.com>
MFC after:	1 month
2017-01-04 14:43:57 +00:00
Mateusz Guzik
5afb134c32 vfs: add vrefact, to be used when the vnode has to be already active
This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active
vnodes.

Reviewed by:	kib (previous version)
2016-12-12 15:37:11 +00:00
Konstantin Belousov
778aa66a68 Enable lookup_cap_dotdot and lookup_cap_dotdot_nonlocal.
Requested and reviewed by:	cem
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8746
2016-12-12 11:12:04 +00:00
Mateusz Guzik
a2d3554542 vfs: provide fake locking primitives for the crossmp vnode
Since the vnode is only expected to be shared locked, we can save a
little overhead by only pretending we are locking in the first place.

Reviewed by:	kib
Tested by:	pho
2016-12-02 18:03:15 +00:00
Mateusz Guzik
a4ce25b5b0 vfs: fix a whitespace nit in r309307 2016-11-30 02:17:03 +00:00
Mateusz Guzik
1babea0341 vfs: avoid VOP_ISLOCKED in the common case in lookup 2016-11-30 02:14:53 +00:00
Konstantin Belousov
7359fdcf5f Allow some dotdot lookups in capability mode.
If dotdot lookup does not escape from the file descriptor passed as
the lookup root, we can allow the component traversal.  Track the
directories traversed, and check the result of dotdot lookup against
the recorded list of the directory vnodes.

Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently
disabled by default until more verification of the approach is done.

Disallow non-local filesystems for dotdot, since remote server might
conspire with the local process to allow it to escape the namespace.
This might be too cautious, provide the knob
vfs.lookup_cap_dotdot_nonlocal to override as well.

Idea by:	rwatson
Discussed with:	emaste, jonathan, rwatson
Reviewed by:	mjg (previous version)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
Differential revision:	https://reviews.freebsd.org/D8110
2016-11-02 12:43:15 +00:00
Konstantin Belousov
1bf6a0900d Remove tautological casts.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-02 12:10:39 +00:00
Konstantin Belousov
ec84693535 Style fixes.
Discussed with:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-02 12:02:31 +00:00
Ed Maste
69a2875821 Renumber license clauses in sys/kern to avoid skipping #3 2016-09-15 13:16:20 +00:00
Mateusz Guzik
11d3ad2eab vfs: provide a common exit point in namei for error cases
This shortens the function, adds the SDT_PROBE use for error cases and
consistenly unrefs rootdir last.

Reviewed by:	kib
MFC after:	2 weeks
2016-08-27 22:43:41 +00:00
Edward Tomasz Napierala
411455a8fb Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after:	1 month
2016-08-10 16:12:31 +00:00
Pedro F. Giffuni
e3043798aa sys/kern: spelling fixes in comments.
No functional change.
2016-04-29 22:15:33 +00:00
Pedro F. Giffuni
b85f65af68 kern: for pointers replace 0 with NULL.
These are mostly cosmetical, no functional change.

Found with devel/coccinelle.
2016-04-15 16:10:11 +00:00
Edward Tomasz Napierala
2a5a08cb38 Refactor the way we restore cn_lkflags; no functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-12 09:05:43 +00:00
Edward Tomasz Napierala
f69db55151 Remove cn_consume from 'struct componentname'. It was never set to anything
other than 0.

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5611
2016-03-12 08:50:38 +00:00
Edward Tomasz Napierala
213ed83855 Fix autofs triggering problem. Assume you have an NFS server,
192.168.1.1, with share "share". This commit fixes a problem
where "mkdir /net/192.168.1.1/share/meh" would return spurious
error instead of creating the directory if the target filesystem
wasn't mounted yet; subsequent attempts would work correctly.

The failure scenario is kind of complicated to explain, but it all
boils down to calling VOP_MKDIR() for the target filesystem (NFS)
with wrong dvp - the autofs vnode instead of the filesystem root
mounted over it.

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5442
2016-03-12 07:54:42 +00:00
Andriy Gapon
2f2f522b5d save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE
SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after:	20 days
2015-09-28 12:14:16 +00:00
Ed Schouten
8c43e4ccfa Properly return ENOTDIR when calling *at() on a non-vnode.
We already properly return ENOTDIR when calling *at() on a non-directory
vnode, but it turns out that if you call it on a socket, we see EINVAL.
Patch up namei to properly translate this to ENOTDIR.
2015-08-12 16:17:00 +00:00
Mateusz Guzik
318b946321 vfs: cosmetic changes to namei and namei_handle_root
- don't initialize cnp during declaration
- don't test error/!error, compare to 0 instead
2015-07-09 17:17:26 +00:00
Mateusz Guzik
d177f49f6f vfs: simplify error handling in namei
The logic is reorganised so that there is one exit point prior to the
lookup loop. This is an intermediate step to making audit logging
functions use found vnode instead of translating ni_dirfd on their own.

ni_startdir validation is removed. The only in-tree consumer is nfs
which already makes sure it is a directory.

Reviewed by:	kib
2015-07-09 16:32:58 +00:00
Mateusz Guzik
d19ba50e12 vfs: avoid spurious vref/vrele for absolute lookups
namei used to vref fd_cdir, which was immediatley vrele'd on entry to
the loop.

Check for absolute lookup and vref the right vnode the first time.

Reviewed by:	kib
2015-07-09 15:06:58 +00:00
Mateusz Guzik
a03f1b2970 vfs: plug a use-after-free of fd_rdir in namei
fd_rdir vnode was stored in ni_rootdir without refing it in any way,
after which the filedsc lock was being dropped.

The vnode could have been freed by mountcheckdirs or another thread doing
chroot.

VREF the vnode while the lock is held.

Reviewed by:	kib
MFC after:	1 week
2015-07-09 15:06:24 +00:00
Mark Johnston
947401dd50 Move the comment describing namei(9) back to namei()'s definition.
MFC after:	3 days
2015-07-05 22:56:41 +00:00
Konstantin Belousov
72ba3c0822 Fix two issues with lockmgr(9) LK_CAN_SHARE() test, which determines
whether the shared request for already shared-locked lock could be
granted.  Both problems result in the exclusive locker starvation.

The concurrent exclusive request is indicated by either
LK_EXCLUSIVE_WAITERS or LK_EXCLUSIVE_SPINNERS flags.  The reverse
condition, i.e. no exclusive waiters, must check that both flags are
cleared.

Add a flag LK_NODDLKTREAT for shared lock request to indicate that
current thread guarantees that it does not own the lock in shared
mode.  This turns back the exclusive lock starvation avoidance code;
see man page update for detailed description.

Use LK_NODDLKTREAT when doing lookup(9).

Reported and tested by:	pho
No objections from:	attilio
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-11-02 13:10:31 +00:00
Mateusz Guzik
8cc11167fb Plug a memory leak in case of failed lookups in capability mode.
Put common cnp cleanup into one function and use it for this purpose.

MFC after:	1 week
2014-08-24 12:51:12 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00