Commit Graph

982 Commits

Author SHA1 Message Date
Konstantin Belousov
78022527bb Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount.  To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change.  vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP.  vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by:	markj, trasz
Tested by:	mjg, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
Kirk McKusick
c11cbfd957 Update the main loop in the flushbuflist() routine to properly select
buffers for flushing when requested to flush both normal and extended
attributes buffers.

Sponsored by: Netflix
2019-03-11 22:42:33 +00:00
Michael Tuexen
735835ed5c Avoid overfow in vtruncbuf()
Using daddr_t instead of int avoids trunclbn to become negative when it
shouldn't.
This isssue was found by running syzkaller.

Reviewed by:		mckusick, kib, markj
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D18763
2019-01-08 09:04:27 +00:00
Kirk McKusick
c0029546f8 When loading an inode from disk, verify that its mode is valid.
If invalid, return EINVAL. Note that inode check-hashes greatly
reduce the chance that these errors will go undetected.

Reported by:  Christopher Krah <krah@protonmail.com>
Reported as:  FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read)
Reviewed by:  kib
MFC after:    1 week
Sponsored by: Netflix

M    sys/fs/ext2fs/ext2_vnops.c
M    sys/kern/vfs_subr.c
M    sys/ufs/ffs/ffs_snapshot.c
M    sys/ufs/ufs/ufs_vnops.c
2018-12-27 07:18:53 +00:00
Konstantin Belousov
6c59824b31 Properly test for vmio buffer in bnoreuselist().
The presence of allocated v_object does not imply that the buffer is
necessary VMIO kind.  Buffer might has been allocated before the
object created, then the buffer is malloced.  Although we try to avoid
such situation, it seems to be still legitimate.

Reported and tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-23 18:52:02 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Mark Johnston
1436ff1ebb Typo.
X-MFC with:	r337974
2018-08-17 16:07:06 +00:00
Mark Johnston
3ccbdc8254 Add INVARIANTS-only fences around lockless vnode refcount updates.
Some internal KASSERTs access the v_iflag field without the vnode
interlock held after such a refcount update.  The fences are needed for
the assertions to be correct in the face of store reordering.

Reported and tested by:	jhibbits
Reviewed by:	kib, mjg
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16756
2018-08-17 15:41:01 +00:00
Justin Hibbits
3f9e1fc8ee Revert r334708
This is the wrong place to put the barrier.
Requested by:	kib,mjg
2018-06-06 15:12:19 +00:00
Justin Hibbits
32c369f40c Add a memory barrier after taking a reference on the vnode holdcnt in _vhold
This is needed to avoid a race between the VNASSERT() below, and another
thread updating the VI_FREE flag, on weakly-ordered architectures.

On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would
panic on the assert regularly.

It may be possible to use a weaker barrier, and I'll investigate that once
all stability issues are worked out on POWER9.
2018-06-06 12:57:11 +00:00
Matt Macy
84482abd21 vfs: annotate variables only used by debug builds as __unused 2018-05-19 04:59:39 +00:00
Jamie Gritton
0e5c6bd436 Make it easier for filesystems to count themselves as jail-enabled,
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of.  This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.

Reviewed by:	kib
Differential Revision:	D14681
2018-05-04 20:54:27 +00:00
Brooks Davis
6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Andriy Gapon
f4043145f2 ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count
It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code.  Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.

Also, the change required making vfs_refcount_release_if_not_last()
public.  I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now.  While making the move I've dropped the
vfs_ prefix.

Reviewed by:	mjg
MFC after:	2 weeks
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D14869
2018-03-28 08:55:31 +00:00
Jeff Roberson
06220fa737 Further parallelize the buffer cache.
Provide multiple clean queues partitioned into 'domains'.  Each domain manages
its own bufspace and has its own bufspace daemon.  Each domain has a set of
subqueues indexed by the current cpuid to reduce lock contention on the cleanq.

Refine the sleep/wakeup around the bufspace daemon to use atomics as much as
possible.

Add a B_REUSE flag that is used to requeue bufs during the scan to approximate
LRU rather than locking the queue on every use of a frequently accessed buf.

Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14274
2018-02-20 00:06:07 +00:00
Kirk McKusick
5a35a04255 One of the vnode fields listed by vn_printf is the union of pointers
whose type depends on the type of vnode. Correct vn_printf so that
it correctly identifies the name of the pointer that it is printing.

Submitted by: Andreas Longwitz <longwitz at incore.de>
MFC after: 1 week
2018-01-31 22:49:50 +00:00
Mateusz Guzik
31c2c6e95e vfs: tidy up vdrop
Skip vfs_refcount_release_if_not_last if the interlock is held and just
go straight to refcount_release.

While here do cosmetic rearrangement of _vhold to better show it contains
equivalent behaviour.
2018-01-12 13:39:02 +00:00
Eitan Adler
caa7e52f3f kernel: Fix several typos and minor errors
- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by:	imp, benno
2017-12-27 03:23:21 +00:00
Alexander Kabaev
151ba7933a Do pass removing some write-only variables from the kernel.
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
2017-12-25 04:48:39 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Mark Johnston
a3e8a25a52 Avoid the nbp lookup in the final loop iteration in flushbuflist().
The end of the loop must re-lookup the next buf since the bufobj lock
is dropped in the loop body. If the lookup fails, the loop is restarted.
This mechanism non-obviously also terminates the loop when the end of
the buf list is reached. Split up the two loops termination cases to
make the code a bit less fragile. No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12730
2017-10-20 14:56:13 +00:00
Mark Johnston
fa00affd18 Fix a racy VI_DOOMED check in MNT_VNODE_FOREACH_ALL().
MNT_VNODE_FOREACH_ALL() is supposed to avoid returning doomed vnodes,
but the VI_DOOMED check it used was done without the vnode interlock
held, so it could race with a concurrent vgone().

Submitted by:	Don Morris <don.morris@isilon.com>
Reviewed by:	kib, mckusick
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12704
2017-10-17 19:41:45 +00:00
Konstantin Belousov
5bf949377e For unlinked files, do not msync(2) or sync on the vnode deactivation.
One consequence of the patch is that msyncing unlinked file mappings
no longer reduces the amount of the dirty memory in the system, but I
do not think that there are users of msync(2) that utilize it for such
side-effect.

Reported and tested by:	tjil
PR:	222356
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12411
2017-09-19 16:46:37 +00:00
Bryan Drewery
8359a6b7b3 Allow vdrop() of a vnode not yet on the per-mount list after r306512.
The old code allowed calling vdrop() before insmntque() to place the vnode back
onto the freelist for later recycling.  Some downstream consumers may rely on
this support.  Normally insmntque() failing is fine since is uses vgone() and
immediately frees the vnode rather than attempting to add it to the freelist if
vdrop() were used instead.

Also assert that vhold() cannot be used on such a vnode.

Reviewed by:	kib, cem, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12126
2017-08-28 19:29:51 +00:00
Konstantin Belousov
b59ea73029 Allow vinvalbuf() to operate with the shared vnode lock.
This mode allows other clean buffers to arrive while we flush the buf
lists for the vnode, which is fine for the targeted use.  We only need
that all buffers existed at the time of the function start were
flushed.  In fact, only one assert has to be relaxed.

In collaboration with:	pho
Reviewed by:	rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
X-Differential revision:	https://reviews.freebsd.org/D12083
2017-08-20 10:07:45 +00:00
Gleb Smirnoff
0c3c207ffd For UNIX sockets make vnode point not to the socket, but to the UNIX PCB,
since the latter is the thing that links together VFS and sockets.

While here, make the union in the struct vnode anonymous.
2017-06-02 17:31:25 +00:00
Konstantin Belousov
391aba32e6 mnt_vnode_next_active: use conventional lock order when trylock fails.
Previously, when the VI_TRYLOCK failed, we would spin under the mutex
that protects the vnode active list until we either succeeded or
noticed that we had hogged the CPU. Since we were violating the lock
order, this would guarantee that we would become a hog under any
deadlock condition (e.g. a race with vdrop(9) on the same vnode). In
the presence of many concurrent threads in sync(2) or vdrop etc, the
victim could hang for a long time.

Now, avoid spinning by dropping and reacquiring the locks in the
conventional lock order when the trylock fails. This requires a dance
with the vnode hold count.

Submitted by:	Tom Rix <trix@juniper.net>
Tested by:	pho
Differential revision:	https://reviews.freebsd.org/D10692
2017-05-15 10:02:45 +00:00
Konstantin Belousov
0226f65940 Add V_VMIO flag for vinvalbuf(9) to indicate that the flush request
was issued during VM-initiated i/o (pageout), so that the function
does not try to flush or remove pages or wait for the vm object
paging-in-progress counter.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D10241
2017-04-05 16:57:53 +00:00
Brooks Davis
db3625531e Correct a kernel stack leak in 32-bit compat when vfc_name is short.
Don't zero unused pointer members again.

Per discussion with secteam we are not issuing an advisory for this
issue as we have no current evidence it leaks exploitable information.

Reviewed by:	rwatson, glebius, delphij
MFC after:	1 day
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D10227
2017-04-04 17:32:08 +00:00
Ian Lepore
e3f87f6c70 Change 'Hz' back to 'HZ'... it's referring to the kernel config option
named HZ, not being used as an abbreviation of the unit of measure.
2017-03-12 18:07:03 +00:00
Ian Lepore
8a3966405e Correct the abbreviations for microseconds (us, not ms), and for Hz (not HZ). 2017-03-12 17:43:45 +00:00
Mateusz Guzik
2d78a5531e vfs: use atomic_fcmpset in vfs_refcount_* 2017-02-05 03:23:16 +00:00
Edward Tomasz Napierala
8acac5a9f5 Improve debugging printf. 2017-01-22 15:27:14 +00:00
Mateusz Guzik
067115e050 vfs: hide the getvnode NULL mp message behind DIAGNOSTIC
Since crossmp vnode changes the message was being printed on each boot.

Reported by:	trasz
Discussed with:	kib
2017-01-21 16:59:50 +00:00
Mateusz Guzik
41b0046a4d vfs: switch nodes_created, recycles_count and free_owe_inact to counter(9)
Reviewed by:	kib
2016-12-31 19:59:31 +00:00
Mateusz Guzik
5afb134c32 vfs: add vrefact, to be used when the vnode has to be already active
This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active
vnodes.

Reviewed by:	kib (previous version)
2016-12-12 15:37:11 +00:00
Mark Johnston
64910ddbff Launder VPO_NOSYNC pages upon vnode deactivation.
As of r234483, vnode deactivation causes non-VPO_NOSYNC pages to be
laundered. This behaviour has two problems:

1. Dirty VPO_NOSYNC pages must be laundered before the vnode can be
   reclaimed, and this work may be unfairly deferred to the vnlru process
   or an unrelated application when the system is under vnode pressure.
2. Deactivation of a vnode with dirty VPO_NOSYNC pages requires a scan of
   the corresponding VM object's memq for non-VPO_NOSYNC dirty pages; if
   the laundry thread needs to launder pages from an unreferenced such
   vnode, it will reactivate and deactivate the vnode with each laundering,
   potentially resulting in a large number of expensive scans.

Therefore, ensure that all dirty pages are laundered upon deactivation,
i.e., when all maps of the vnode are removed and all references are
released.

Reviewed by:	alc, kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8641
2016-11-26 21:00:27 +00:00
Mateusz Guzik
c6c44ff7eb vfs: clear the tmp free list flag before taking the free vnode list lock
Safe access is already guaranteed because of the mnt_listmx lock.
2016-10-08 13:36:59 +00:00
Bryan Drewery
32641585a9 vrefl: Assert that the interlock is held.
Sponsored by:	Dell EMC Isilon
MFC after:	2 weeks
2016-10-06 18:10:19 +00:00
Bryan Drewery
5a22c9582c Add vrecyclel() to vrecycle() a vnode with the interlock already held.
Obtained from:	OneFS
Sponsored by:	Dell EMC Isilon
MFC after:	2 weeks
2016-10-06 18:09:22 +00:00
Bryan Drewery
0617f64ec6 Correct some comments after r294299.
Sponsored by:	Dell EMC Isilon
2016-10-04 21:44:20 +00:00
Mateusz Guzik
5bb81f9b2d vfs: batch free vnodes in per-mnt lists
Previously free vnodes would always by directly returned to the global
LRU list. With this change up to mnt_free_list_batch vnodes are collected
first.

syncer runs always return the batch regardless of its size.

While vnodes on per-mnt lists are not counted as free, they can be
returned in case of vnode shortage.

Reviewed by:	kib
Tested by:	pho
2016-09-30 17:27:17 +00:00
Mateusz Guzik
8660b707ff vfs: remove the __bo_vnode field from struct vnode
The pointer can be obtained using __containerof instead.

Reviewed by:	kib
2016-09-30 17:11:03 +00:00
Ed Maste
69a2875821 Renumber license clauses in sys/kern to avoid skipping #3 2016-09-15 13:16:20 +00:00
Edward Tomasz Napierala
f83cc0aae4 Print vnode details when vnode locking assertion gets triggered.
MFC after:	1 month
2016-08-12 22:20:52 +00:00
Edward Tomasz Napierala
411455a8fb Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after:	1 month
2016-08-10 16:12:31 +00:00
Edward Tomasz Napierala
7b255097eb Remove unused - never actually implemented - vnode lock types
from vnode_if.src.

MFC after:	1 month
2016-08-04 13:45:18 +00:00
Konstantin Belousov
725496ced5 Fix grammar.
Submitted by:	alc
MFC after:	2 weeks
2016-07-11 17:04:22 +00:00
Konstantin Belousov
19efd8a5a8 In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called,
if vnode is VMIO.  For VMIO vnodes, set BO_DEAD in vm_object_terminate().

The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers.  BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone).  Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().

Reported by:	David Cross <dcrosstech@gmail.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-11 14:19:09 +00:00
Konstantin Belousov
d12d6a8476 Remove racy assert. The thread which changes vnode usecount from 0 to 1
does it under the vnode interlock, but the interlock is not owned by the
asserting thread.  As result, we might read increased use counter but also
still see VI_OWEINACT.

In collaboration with: nwhitehorn
Hardware donated by: IBM LTC
Sponsored by:	The FreeBSD Foundation (kib)
Approved by:	re (gjb)
2016-07-03 01:56:48 +00:00