Commit Graph

201 Commits

Author SHA1 Message Date
mjg
f121d45000 vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
mjg
048a894ebc vfs: flatten vop vectors
This eliminates the following loop from all VOP calls:

while(vop != NULL && \
    vop->vop_spare2 == NULL && vop->vop_bypass == NULL)
        vop = vop->vop_default;

Reviewed by:	jeff
Tesetd by:	pho
Differential Revision:	https://reviews.freebsd.org/D22738
2019-12-16 00:06:22 +00:00
mjg
bcfa67ab8b vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
kib
c84facbfbd NDFREE(): Fix unlocking for LOCKPARENT|LOCKLEAF and ndp->ni_dvp == ndp->ni_vp.
NDFREE() calculates unlock_dvp after ndp->ni_vp is unlocked and zeroed
out. This makes the comparision of ni_dvp with ni_vp always fail.
Move the calculation of unlock_dvp right after unlock_vp, so that the
code sees correct ni_vp value.

Reproduced by
	   chdir("/usr");
	   open("/..", O_BENEATH | O_RDONLY);

Reported by:	syzkaller
Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D20304
2019-05-21 15:12:13 +00:00
kib
fe9ffad6ed Fix renameat(2) for CAPABILITIES kernels.
When renameat(2) is used with:
- absolute path for to;
- tofd not set to AT_FDCWD;
- the target exists
kern_renameat() requires CAP_UNLINK capability on tofd, but
corresponding namei ni_filecap is not initialized at all because the
lookup is absolute.  As result, the check was done against empty filecap
and syscall fails erronously.

Fix it by creating a return flags namei member and reporting if the
lookup was absolute, then do not touch to.ni_filecaps at all.

PR:	222258
Reviewed by:	jilles, ngie
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-MFC-note:	KBI breakage
Differential revision:	https://reviews.freebsd.org/D19096
2019-02-08 04:18:17 +00:00
mjg
8598ea893e vfs: mostly depessimize NDINIT_ALL
1) filecaps_init was unnecesarily a function call
2) an asignment at the end was preventing tail calling of cap_rights_init

Sponsored by:	The FreeBSD Foundation
2018-12-14 03:55:08 +00:00
kib
1d9125b01c If BENEATH is specified, always latch the topping directory vnode.
It is possible that we started with a relative path but during the
lookup, found an absolute symlink.  In this case, BENEATH handling
code needs the latch, but it is too late to calculate it.

While there, somewhat improve the assertions.  Clear the NI_LCF_LATCH
flag when the latch vnode is released, so that asserts know the state.
Assert that there is a latch if we entered beneath+abs path mode,
after the starting point is processed.

Reported by:	wulf
With more input from:	pho
Sponsored by:	The FreeBSD Foundation
2018-11-29 19:13:10 +00:00
kib
b609d75f3a Allow absolute paths for O_BENEATH.
The path must have a tail which does not escape starting/topping
directory.  The documentation will come shortly, see the man pages
commit message for the reason of separate commit.

Reviewed by:	jilles (previous version)
Discussed with:	emaste
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17714
2018-11-11 00:04:36 +00:00
kib
de9d57cf38 Implement O_BENEATH and AT_BENEATH.
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory.  Supposedly the interface is similar
to the same proposed Linux flags.

Reviewed by:	jilles (code, previous version of manpages), 0mp (manpages)
Discussed with:	allanjude, emaste, jonathan
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D17547
2018-10-25 22:16:34 +00:00
mjg
846a8dd029 vfs: remove lookup_shared tunable
Reviewed by:	kib, jhb
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17253
2018-09-20 18:25:26 +00:00
mmacy
4eacc08586 vfs: annotate variables only used by debug builds as __unused 2018-05-19 04:59:39 +00:00
pfg
4736ccfd9c sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
cem
30b90e55c4 vfs_lookup: Allow PATH_MAX-1 symlinks
Previously, symlinks in FreeBSD were artificially limited to PATH_MAX-2.

Add a short test case to verify the change.

Submitted by:	Gaurav Gangalwar <ggangalwar AT isilon.com>
Reviewed by:	kib
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12589
2017-11-17 19:25:39 +00:00
tijl
5b4d8cee9d When a Linux program tries to access a /path the kernel tries
/compat/linux/path before /path.  Stop following symbolic links when
looking up /compat/linux/path so dead symbolic links aren't ignored.
This allows syscalls like readlink(2) and lstat(2) to work on such links.
And open(2) will return an error now instead of trying /path.
2017-10-15 18:53:21 +00:00
jhb
fd0aeed164 Use UMA_ALIGN_PTR instead of sizeof(void *) for zone alignment.
uma_zcreate()'s alignment argument is supposed to be sizeof(foo) - 1,
and uma.h provides a set of helper macros for common types.  Passing
sizeof(void *) results in all of the members being misaligned triggering
unaligned access faults on certain architectures (notably MIPS).

Reported by:	brooks
Obtained from:	CheriBSD
MFC after:	3 days
Sponsored by:	DARPA / AFRL
2017-03-15 18:23:32 +00:00
kib
eacfb4abea Provide fallback VOP methods for crossmp vnode.
In particular, crossmp vnode might leak into rename code.

PR:	216380
Reported by:	fnacl@protonmail.com
Sponsored by:	The FreeBSD Foundation
X-MFC with:	r309425
2017-01-22 19:36:02 +00:00
trasz
5166e57c9a Fix bug that would result in a kernel crash in some cases involving
a symlink and an autofs mount request.  The crash was caused by namei()
calling bcopy() with a negative length, caused by numeric underflow:
in lookup(), in the relookup path, the ni_pathlen was decremented too
many times.  The bug was introduced in r296715.

Big thanks to Alex Deiter for his help with debugging this.

Reviewed by:	kib@
Tested by:	Alex Deiter <alex.deiter at gmail.com>
MFC after:	1 month
2017-01-04 14:43:57 +00:00
mjg
f03b37f3e8 vfs: add vrefact, to be used when the vnode has to be already active
This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active
vnodes.

Reviewed by:	kib (previous version)
2016-12-12 15:37:11 +00:00
kib
127162a8d9 Enable lookup_cap_dotdot and lookup_cap_dotdot_nonlocal.
Requested and reviewed by:	cem
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D8746
2016-12-12 11:12:04 +00:00
mjg
168e218f92 vfs: provide fake locking primitives for the crossmp vnode
Since the vnode is only expected to be shared locked, we can save a
little overhead by only pretending we are locking in the first place.

Reviewed by:	kib
Tested by:	pho
2016-12-02 18:03:15 +00:00
mjg
18de920423 vfs: fix a whitespace nit in r309307 2016-11-30 02:17:03 +00:00
mjg
35980d306d vfs: avoid VOP_ISLOCKED in the common case in lookup 2016-11-30 02:14:53 +00:00
kib
a41f4cc9a5 Allow some dotdot lookups in capability mode.
If dotdot lookup does not escape from the file descriptor passed as
the lookup root, we can allow the component traversal.  Track the
directories traversed, and check the result of dotdot lookup against
the recorded list of the directory vnodes.

Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently
disabled by default until more verification of the approach is done.

Disallow non-local filesystems for dotdot, since remote server might
conspire with the local process to allow it to escape the namespace.
This might be too cautious, provide the knob
vfs.lookup_cap_dotdot_nonlocal to override as well.

Idea by:	rwatson
Discussed with:	emaste, jonathan, rwatson
Reviewed by:	mjg (previous version)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
Differential revision:	https://reviews.freebsd.org/D8110
2016-11-02 12:43:15 +00:00
kib
b9d3dfb1e0 Remove tautological casts.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-02 12:10:39 +00:00
kib
1323c841ed Style fixes.
Discussed with:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-11-02 12:02:31 +00:00
emaste
00b67b15b9 Renumber license clauses in sys/kern to avoid skipping #3 2016-09-15 13:16:20 +00:00
mjg
7b64fd1139 vfs: provide a common exit point in namei for error cases
This shortens the function, adds the SDT_PROBE use for error cases and
consistenly unrefs rootdir last.

Reviewed by:	kib
MFC after:	2 weeks
2016-08-27 22:43:41 +00:00
trasz
255ed885fa Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after:	1 month
2016-08-10 16:12:31 +00:00
pfg
28823d0656 sys/kern: spelling fixes in comments.
No functional change.
2016-04-29 22:15:33 +00:00
pfg
5b3421712d kern: for pointers replace 0 with NULL.
These are mostly cosmetical, no functional change.

Found with devel/coccinelle.
2016-04-15 16:10:11 +00:00
trasz
8804916675 Refactor the way we restore cn_lkflags; no functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-12 09:05:43 +00:00
trasz
beb648d9cc Remove cn_consume from 'struct componentname'. It was never set to anything
other than 0.

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5611
2016-03-12 08:50:38 +00:00
trasz
faec271eeb Fix autofs triggering problem. Assume you have an NFS server,
192.168.1.1, with share "share". This commit fixes a problem
where "mkdir /net/192.168.1.1/share/meh" would return spurious
error instead of creating the directory if the target filesystem
wasn't mounted yet; subsequent attempts would work correctly.

The failure scenario is kind of complicated to explain, but it all
boils down to calling VOP_MKDIR() for the target filesystem (NFS)
with wrong dvp - the autofs vnode instead of the filesystem root
mounted over it.

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5442
2016-03-12 07:54:42 +00:00
avg
425c0bb088 save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE
SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after:	20 days
2015-09-28 12:14:16 +00:00
ed
e8ba6b4817 Properly return ENOTDIR when calling *at() on a non-vnode.
We already properly return ENOTDIR when calling *at() on a non-directory
vnode, but it turns out that if you call it on a socket, we see EINVAL.
Patch up namei to properly translate this to ENOTDIR.
2015-08-12 16:17:00 +00:00
mjg
03b75d4638 vfs: cosmetic changes to namei and namei_handle_root
- don't initialize cnp during declaration
- don't test error/!error, compare to 0 instead
2015-07-09 17:17:26 +00:00
mjg
2c26f2224b vfs: simplify error handling in namei
The logic is reorganised so that there is one exit point prior to the
lookup loop. This is an intermediate step to making audit logging
functions use found vnode instead of translating ni_dirfd on their own.

ni_startdir validation is removed. The only in-tree consumer is nfs
which already makes sure it is a directory.

Reviewed by:	kib
2015-07-09 16:32:58 +00:00
mjg
5c01a53df7 vfs: avoid spurious vref/vrele for absolute lookups
namei used to vref fd_cdir, which was immediatley vrele'd on entry to
the loop.

Check for absolute lookup and vref the right vnode the first time.

Reviewed by:	kib
2015-07-09 15:06:58 +00:00
mjg
5bbdbadc3d vfs: plug a use-after-free of fd_rdir in namei
fd_rdir vnode was stored in ni_rootdir without refing it in any way,
after which the filedsc lock was being dropped.

The vnode could have been freed by mountcheckdirs or another thread doing
chroot.

VREF the vnode while the lock is held.

Reviewed by:	kib
MFC after:	1 week
2015-07-09 15:06:24 +00:00
markj
7cfe35b5a5 Move the comment describing namei(9) back to namei()'s definition.
MFC after:	3 days
2015-07-05 22:56:41 +00:00
kib
cf11d25e18 Fix two issues with lockmgr(9) LK_CAN_SHARE() test, which determines
whether the shared request for already shared-locked lock could be
granted.  Both problems result in the exclusive locker starvation.

The concurrent exclusive request is indicated by either
LK_EXCLUSIVE_WAITERS or LK_EXCLUSIVE_SPINNERS flags.  The reverse
condition, i.e. no exclusive waiters, must check that both flags are
cleared.

Add a flag LK_NODDLKTREAT for shared lock request to indicate that
current thread guarantees that it does not own the lock in shared
mode.  This turns back the exclusive lock starvation avoidance code;
see man page update for detailed description.

Use LK_NODDLKTREAT when doing lookup(9).

Reported and tested by:	pho
No objections from:	attilio
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-11-02 13:10:31 +00:00
mjg
b110d1e264 Plug a memory leak in case of failed lookups in capability mode.
Put common cnp cleanup into one function and use it for this purpose.

MFC after:	1 week
2014-08-24 12:51:12 +00:00
hselasky
35b126e324 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
gjb
fc21f40567 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
hselasky
bd1ed65f0f Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
rwatson
33fdc14c0c Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
avg
71889a5eff dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
attilio
7ee4e910ce - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
kib
f9f4aa68f7 Both vn_close() and VFS_PROLOGUE() evaluate vp->v_mount twice, without
holding the vnode lock; vp->v_mount is checked first for NULL
equiality, and then dereferenced if not NULL.  If vnode is reclaimed
meantime, second dereference would still give NULL.  Change
VFS_PROLOGUE() to evaluate the mp once, convert MNTK_SHARED_WRITES and
MNTK_EXTENDED_SHARED tests into inline functions.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-11-09 20:30:13 +00:00
pjd
667d7255be Fix panic in ktrcapfail() when no capability rights are passed.
While here, correct all consumers to pass NULL instead of 0 as we pass
capability rights as pointers now, not uint64_t.

Reported by:	Daniel Peyrolon
Tested by:	Daniel Peyrolon
Approved by:	re (marius)
2013-09-18 19:26:08 +00:00