freebsd-dev

Author	SHA1	Message	Date
Thomas Munro	a5e284038e	open(2): Add O_DSYNC flag. POSIX O_DSYNC means that writes include an implicit fdatasync(2), just as O_SYNC implies fsync(2). VOP_WRITE() functions that understand the new IO_DATASYNC flag can act accordingly, but we'll still pass down IO_SYNC so that file systems that don't understand it will continue to provide the stronger O_SYNC behaviour. Flag also applies to fcntl(2). Reviewed by: kib, delphij Differential Revision: https://reviews.freebsd.org/D25090	2021-01-08 13:15:56 +13:00
Chuck Silvers	11403bdeb4	vfs: fix rangelock range in vn_rdwr() for IO_APPEND vn_rdwr() must lock the entire file range for IO_APPEND just like vn_io_fault() does for O_APPEND. Reviewed by: kib, imp, mckusick Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D28008	2021-01-07 13:37:35 -08:00
Mateusz Guzik	3e506a67bb	vfs: add v_irflag accessors Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D27793	2021-01-03 06:50:06 +00:00
Konstantin Belousov	99c66d3acf	vn_read_from_obj(): fix handling of doomed vnodes. There is no reason why vp->v_object cannot be NULL. If it is, it's fine, handle it by delegating to VOP_READ(). Tested by: pho Reviewed by: markj, mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27327	2020-11-26 18:13:33 +00:00
Konstantin Belousov	3b1f974bfb	Make max ticks for pause in vn_lock_pair() adjustable at runtime. Reduce default value from hz / 10 to hz / 100. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation	2020-11-26 18:00:26 +00:00
Kirk McKusick	e75f0f2b48	Only attempt a VOP_UNLOCK() when the vn_lock() has been successful. No MFC as this code is not present in 12-stable. Reported by: Peter Holm Reviewed by: Mateusz Guzik Tested by: Peter Holm Sponsored by: Netflix	2020-11-20 20:22:01 +00:00
Konstantin Belousov	7cde2ec4fd	Implement vn_lock_pair(). In collaboration with: pho Reviewed by: mckusick (previous version), markj (previous version) Tested by: markj (syzkaller), pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26136	2020-11-13 09:31:57 +00:00
Mateusz Guzik	f6dd1aefb7	vfs: group mount per-cpu vars into one struct While here move frequently read stuff into the same cacheline. This shrinks struct mount by 64 bytes. Tested by: pho	2020-11-09 23:02:13 +00:00
Mateusz Guzik	eebc2e450f	vfs: add NDREINIT to facilitate repeated namei calls struct nameidata mixes caller arguments, internal state and output, which can be quite error prone. Recent addition of valdiating ni_resflags uncovered a caller which could repeatedly call namei, effectively operating on partially populated state. Add bare minimium validation this does not happen. The real fix would decouple aforementioned state. Reported by: pho Tested by: pho (different variant)	2020-10-29 12:56:02 +00:00
Mateusz Guzik	deb1339f3f	vfs: fix a panic when truncating comming from copy_file_range Truncating requires an exclusive lock, but it was not taken if the filesystem indicates support for shared writes. This only concerns ZFS. In particular fixes cp of files which have trailing holes. Reported by: bdrewery	2020-10-09 20:31:42 +00:00
Rick Macklem	19fe23fa2b	Make vn_generic_copy_file_range() interruptible via a signal. Without this patch, when vn_generic_copy_file_range() is doing a large copy, it will remain in the function for a considerable amount of time, delaying handling of any outstanding signals until the copy completes. This patch adds checks for signals that need to be processed after each successful data copy cycle. When sig_intr() returns non-zero, vn_generic_copy_file_range() will return. The check "if (len < savlen)" ensures that some data has been copied, so that progress will be made. Note that, since copy_file_range(2) is allowed to return fewer bytes copied than requested, it will never return EINTR/ERESTART when sig_intr() returns non-zero. Reviewed by: kib, asomers Differential Revision: https://reviews.freebsd.org/D26620	2020-10-09 01:04:28 +00:00
Rick Macklem	961afe3c99	Clip the "len" argument to vn_generic_copy_file_range() at a hole size boundary. By clipping the len argument of vn_generic_copy_file_range() to end at an exact multiple of hole size, holes are more likely to be maintained during the copy. A hole can still straddle the boundary at the end of the copy range, resulting in a block being allocated in the output file as it is being grown in size, but this will reduce the likelyhood of this happening. While here, also modify setting of blksize to better handle the case where _PC_MIN_HOLE_SIZE is returned as 1. Reviewed by: asomers Differential Revision: https://reviews.freebsd.org/D26570	2020-10-01 00:33:44 +00:00
Rick Macklem	164aa1e941	Make copy_file_range(2) Linux compatible for overflow of offset + len. Without this patch, if a call to copy_file_range(2) specifies an input file offset + len that would wrap around, EINVAL is returned. I thought that was the Linux behaviour, but recent testing showed that Linux accepts this case and does the copy_file_range() to EOF. This patch changes the FreeBSD code to exhibit the same behaviour as Linux for this case. Reviewed by: asomers, kib Differential Revision: https://reviews.freebsd.org/D26569	2020-09-30 02:18:09 +00:00
Konstantin Belousov	1317da4349	Add O_RESOLVE_BENEATH and AT_RESOLVE_BENEATH to mimic Linux' RESOLVE_BENEATH. It is like O_BENEATH, but disables to walk out of the subtree rooted in the starting directory. O_BENEATH does not care if path walks out if it returned. Requested by: Dan Gohman <dev@sunfishcode.online> PR: 248335 Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886	2020-09-22 22:48:12 +00:00
Konstantin Belousov	4a0b316d2a	Add open2nameif() the helper to calculate namei flags both for open(2) and creat(2). Suggested and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886	2020-09-22 22:23:58 +00:00
Eric van Gyzen	f9cc8410e1	vm_ooffset_t is now unsigned vm_ooffset_t is now unsigned. Remove some tests for negative values, or make other adjustments accordingly. Reported by: Coverity Reviewed by: kib markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26214	2020-09-18 16:48:08 +00:00
Konstantin Belousov	3c484f325e	Convert page cache read to VOP. There are several negative side-effects of not calling into VOP layer at all for page cache reads. The biggest is the missed activation of EVFILT_READ knotes. Also, it allows filesystem to make more fine grained decision to refuse read from page cache. Keep VIRF_PGREAD flag around, it is still useful for nullfs, and for asserts. Reviewed by: markj Tested by: pho Discussed with: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346	2020-09-15 22:06:36 +00:00
Konstantin Belousov	888636655d	vfs_subr.c: export io_hold_cnt and vn_read_from_obj(). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346	2020-09-15 22:00:58 +00:00
Mateusz Guzik	feabaaf995	cache: drop the always curthread argument from reverse lookup routines Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs. Tested by: pho	2020-08-24 08:57:02 +00:00
Konstantin Belousov	beb27033aa	Fix powerpc build. Sponsored by: The FreeBSD Foundation	2020-08-16 22:50:59 +00:00
Konstantin Belousov	fbca789fc3	VMIO read If possible, i.e. if the requested range is resident valid in the vm object queue, and some secondary conditions hold, copy data for read(2) directly from the valid cached pages, avoiding vnode lock and instantiating buffers. I intentionally do not start read-ahead, nor handle the advises on the cached range. Filesystems indicate support for VMIO reads by setting VIRF_PGREAD flag, which must not be cleared until vnode reclamation. Currently only filesystems that use vnode pager for v_objects can enable it, due to reliance on vnp_size. There is a WIP to handle it for tmpfs. Reviewed by: markj Discussed with: jeff Tested by: pho Benchmarked by: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25968	2020-08-16 21:02:45 +00:00
Mateusz Guzik	51ea7bea91	vfs: add VOP_STAT The current scheme of calling VOP_GETATTR adds avoidable overhead. An example with tmpfs doing fstat (ops/s): before: 7488958 after: 7913833 Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D25910	2020-08-07 23:06:40 +00:00
Mateusz Guzik	3ea3fbe685	vfs: fix vn_poll performance with either MAC or AUDIT The code would unconditionally lock the vnode to audit or call the mac hoook, even if neither want to do anything. Pre-check the state to avoid locking in the common case of nothing to do. Note this code should not be normally executed anyway as vnodes are always return ready. However, poll1/2 from will-it-scale use regular files for benchmarking, presumably to focus on the interface itself as the vnode handler is not supposed to do almost anything. This in particular fixes poll2 which passes 128 fds. $ ./poll2_processes -s 10 before: 134411 after: 271572	2020-07-16 14:09:18 +00:00
Mateusz Guzik	ab06a30517	vfs: fix MAC/AUDIT mismatch in vn_poll Auditing would not be performed without MAC compiled in.	2020-07-16 14:04:28 +00:00
Mateusz Guzik	422f38d8ea	vfs: fix trivial whitespace issues which don't interefere with blame .. even without the -w switch	2020-07-10 09:01:36 +00:00
Konstantin Belousov	4543c1c329	Fix typo. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2020-07-05 20:54:01 +00:00
Thomas Munro	f270658873	vfs: track sequential reads and writes separately For software like PostgreSQL and SQLite that sometimes reads sequentially while also writing sequentially some distance behind with interleaved syscalls on the same fd, performance is better on UFS if we do sequential access heuristics separately for reads and writes. Patch originally by Andrew Gierth in 2008, updated and proposed by me with his permission. Reviewed by: mjg, kib, tmunro Approved by: mjg (mentor) Obtained from: Andrew Gierth <andrew@tao11.riddles.org.uk> Differential Revision: https://reviews.freebsd.org/D25024	2020-06-21 08:51:24 +00:00
Kyle Evans	63619b6dba	vfs: add restrictions to read(2) of a directory [2/2] This commit adds the priv(9) that waters down the sysctl to make it only allow read(2) of a dirfd by the system root. Jailed root is not allowed, but jail policy and superuser policy will abstain from allowing/denying it so that a MAC module can fully control the policy. Such a MAC module has been written, and can be found at: https://people.freebsd.org/~kevans/mac_read_dir-0.1.0.tar.gz It is expected that the MAC module won't be needed by many, as most only need to do such diagnostics that require this behavior as system root anyways. Interested parties are welcome to grab the MAC module above and create a port or locally integrate it, and with enough support it could see introduction to base. As noted in mac_read_dir.c, it is released under the BSD 2 clause license and allows the restrictions to be lifted for only jailed root or for all unprivileged users. PR: 246412 Reviewed by: mckusick, kib, emaste, jilles, cy, phk, imp (all previous) Reviewed by: rgrimes (latest version) Differential Revision: https://reviews.freebsd.org/D24596	2020-06-04 18:17:25 +00:00
Kyle Evans	dcef4f65ae	vfs: add restrictions to read(2) of a directory [1/2] Historically, we've allowed read() of a directory and some filesystems will accommodate (e.g. ufs/ffs, msdosfs). From the history department staffed by Warner: <<EOF pdp-7 unix seemed to allow reading directories, but they were weird, special things there so I'm unsure (my pdp-7 assembler sucks). 1st Edition's sources are lost, mostly. The kernel allows it. The reconstructed sources from 2nd or 3rd edition read it though. V6 to V7 changed the filesystem format, and should have been a warning, but reading directories weren't materially changed. 4.1b BSD introduced readdir because of UFS. UFS broke all directory reading programs in 1983. ls, du, find, etc all had to be rewritten. readdir() and friends were introduced here. SysVr3 picked up readdir() in 1987 for the AT&T fork of Unix. SysVr4 updated all the directory reading programs in 1988 because different filesystem types were introduced. In the 90s, these interfaces became completely ubiquitous as PDP-11s running V7 faded from view and all the folks that initially started on V7 upgraded to SysV. Linux never supported this (though I've not done the software archeology to check) because it has always had a pathological diversity of filesystems. EOF Disallowing read(2) on a directory has the side-effect of masking application bugs from relying on other implementation's behavior (e.g. Linux) of rejecting these with EISDIR across the board, but allowing it has been a vector for at least one stack disclosure bug in the past[0]. By POSIX, this is implementation-defined whether read() handles directories or not. Popular implementations have chosen to reject them, and this seems sensible: the data you're reading from a directory is not structured in some unified way across filesystem implementations like with readdir(2), so it is impossible for applications to portably rely on this. With this patch, we will reject most read(2) of a dirfd with EISDIR. Users that know what they're doing can conscientiously set bsd.security.allow_read_dir=1 to allow read(2) of directories, as it has proven useful for debugging or recovery. A future commit will further limit the sysctl to allow only the system root to read(2) directories, to make it at least relatively safe to leave on for longer periods of time. While we're adding logic pertaining to directory vnodes to vn_io_fault, an additional assertion has also been added to ensure that we're not reaching vn_io_fault with any write request on a directory vnode. Such request would be a logical error in the kernel, and must be debugged rather than allowing it to potentially silently error out. Commented out shell aliases have been placed in root's chsrc/shrc to promote awareness that grep may become noisy after this change, depending on your usage. A tentative MFC plan has been put together to try and make it as trivial as possible to identify issues and collect reports; note that this will be strongly re-evaluated. Tentatively, I will MFC this knob with the default as it is in HEAD to improve our odds of actually getting reports. The future priv(9) to further restrict the sysctl WILL NOT BE MERGED BACK, so the knob will be a faithful reversion on stable/12. We will go into the merge acknowledging that the sysctl default may be flipped back to restore historical behavior at any point if it's warranted. [0] https://www.freebsd.org/security/advisories/FreeBSD-SA-19:10.ufs.asc PR: 246412 Reviewed by: mckusick, kib, emaste, jilles, cy, phk, imp (all previous) Reviewed by: rgrimes (latest version) MFC after: 1 month (note the MFC plan mentioned above) Relnotes: absolutely, but will amend previous RELNOTES entry Differential Revision: https://reviews.freebsd.org/D24596	2020-06-04 18:09:55 +00:00
Mateusz Guzik	e3d16bb6a8	vfs: use atomic_{store,load}_long to manage f_offset ... instead of depending on the compiler not to mess them up	2020-05-25 04:57:57 +00:00
Mateusz Guzik	442e617fd2	vfs: restore mtx-protected foffset locking for 32 bit platforms They depend on it to accurately read the offset. The new code is not used as it would add an interrupt enable/disable trip on top of the atomic. This also fixes a bug where 32-bit nolock request would still lock the offset. No changes for 64-bit. Reported by: emaste	2020-05-25 04:56:41 +00:00
Mateusz Guzik	3fc40153b2	vfs: scale foffset_lock by using atomics instead of serializing on mtx pool Contending cases still serialize on sleepq (which would be taken anyway). Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21626	2020-05-24 03:50:49 +00:00
Konstantin Belousov	90f29198c3	Remove extra call to vfs_op_exit() from vfs_write_suspend() when VFS_SYNC() fails. The vfs_write_resume() handler already does vfs_op_exit() for us. Reported by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation	2020-04-09 18:38:00 +00:00
Ryan Libby	2782c00c04	vfs: quiet -Wwrite-strings Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23797	2020-02-23 03:32:11 +00:00
Mateusz Guzik	074ad60a4c	vfs: make write suspension mandatory At the time opt-in was introduced adding yourself as a writer was esrializing across the mount point. Nowadays it is fully per-cpu, the only impact being a small single-threaded hit on top of what's there right now. Vast majority of the overhead stems from the call to VOP_GETWRITEMOUNT which has is done regardless. Should someone want to microoptimize this single-threaded they can coalesce looking the mount up with adding a write to it.	2020-02-15 13:00:39 +00:00
Mateusz Guzik	7b2ff0dcb2	Partially decompose priv_check by adding priv_check_cred_vfs_generation During buildkernel there are very frequent calls to priv_check and they all are for PRIV_VFS_GENERATION (coming from stat/fstat). This results in branching on several potential privileges checking if perhaps that's the one which has to be evaluated. Instead of the kitchen-sink approach provide a way to have commonly used privs directly evaluated.	2020-02-13 22:22:15 +00:00
Mateusz Guzik	2f7f11b7de	vfs: tidy up vget_finish and vn_lock - remove assertion which duplicates vn_lock - use VNPASS instead of retyping the failure - report what flags were passed if panicking on them	2020-02-08 15:52:20 +00:00
Mateusz Guzik	3ff65f71cb	Remove duplicated empty lines from kern/*.c No functional changes.	2020-01-30 20:05:05 +00:00
Kyle Evans	2856d85ecb	posix_fallocate: push vnop implementation into the fileop layer This opens the door for other descriptor types to implement posix_fallocate(2) as needed. Reviewed by: kib, bcr (manpages) Differential Revision: https://reviews.freebsd.org/D23042	2020-01-08 19:05:32 +00:00
Mateusz Guzik	7e2ea5772b	vfs: factor out avoidable branches in _vn_lock	2020-01-05 01:00:11 +00:00
Mateusz Guzik	b249ce48ea	vfs: drop the mostly unused flags argument from VOP_UNLOCK Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427	2020-01-03 22:29:58 +00:00
Mateusz Guzik	abd80ddb94	vfs: introduce v_irflag and make v_type smaller The current vnode layout is not smp-friendly by having frequently read data avoidably sharing cachelines with very frequently modified fields. In particular v_iflag inspected for VI_DOOMED can be found in the same line with v_usecount. Instead make it available in the same cacheline as the v_op, v_data and v_type which all get read all the time. v_type is avoidably 4 bytes while the necessary data will easily fit in 1. Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new flag field with a new value: VIRF_DOOMED. Reviewed by: kib, jeff Differential Revision: https://reviews.freebsd.org/D22715	2019-12-08 21:30:04 +00:00
Konstantin Belousov	fdc6b10d44	Add a VN_OPEN_INVFS flag. vn_open_cred() assumes that it is called from the top-level of a VFS syscall. Writers must call bwillwrite() before locking any VFS resource to wait for cleanup of dirty buffers. ZFS getextattr() and setextattr() VOPs do call vn_open_cred(), which results in wait for unrelated buffers while owning ZFS vnode lock (and ZFS does not use buffer cache). VN_OPEN_INVFS allows caller to skip bwillwrite. Note that ZFS is still incorrect there, because it starts write on an mp and locks a vnode while holding another vnode lock. Reported by: Willem Jan Withagen <wjw@digiware.nl> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-11-29 14:02:32 +00:00
Rick Macklem	48e4857859	Update copy_file_range(2) to be Linux5 compatible. The current linux man page and testing done on a fairly recent linux5.n kernel have identified two changes to the semantics of the linux copy_file_range system call. Since the copy_file_range(2) system call is intended to be linux compatible and is only currently in head/current and not used by any commands, it seems appropriate to update the system call to be compatible with the current linux one. The first of these semantic changes was changed to be compatible with linux5.n by r354564. For the second semantic change, the old linux man page stated that, if infd and outfd referred to the same file, EBADF should be returned. Now, the semantics is to allow infd and outfd to refer to the same file so long as the byte ranges defined by the input file offset, output file offset and len does not overlap. If the byte ranges do overlap, EINVAL should be returned. This patch modifies copy_file_range(2) to be linux5.n compatible for this semantic change.	2019-11-10 01:08:14 +00:00
Rick Macklem	15930ae180	Update copy_file_range(2) to be Linux5 compatible. The current linux man page and testing done on a fairly recent linux5.n kernel have identified two changes to the semantics of the linux copy_file_range system call. Since the copy_file_range(2) system call is intended to be linux compatible and is only currently in head/current and not used by any commands, it seems appropriate to update the system call to be compatible with the current linux one. The old linux man page stated that, if the offset + len exceeded file_size for the input file, EINVAL should be returned. Now, the semantics is to copy up to at most file_size bytes and return that number of bytes copied. If the offset is at or beyond file_size, a return of 0 bytes is done. This patch modifies copy_file_range(2) to be linux compatible for this semantic change. A separate patch will change copy_file_range(2) for the other semantic change, which allows the infd and outfd to refer to the same file, so long as the byte ranges do not overlap.	2019-11-08 23:39:17 +00:00
Andrew Turner	9bb37c03fb	Stop leaking information from the kernel through timespec The timespec struct holds a seconds value in a time_t and a nanoseconds value in a long. On most architectures these are the same size, however on 32-bit architectures other than i386 time_t is 8 bytes and long is 4 bytes. Most ABIs will then pad a struct holding an 8 byte and 4 byte value to 16 bytes with 4 bytes of padding. When copying one of these structs the compiler is free to copy the padding if it wishes. In this case the padding may contain kernel data that is then leaked to userspace. Fix this by copying the timespec elements rather than the entire struct. This doesn't affect Tier-1 architectures so no SA is expected. admbugs: 651 MFC after: 1 week Sponsored by: DARPA, AFRL	2019-10-16 13:21:01 +00:00
Konstantin Belousov	55894117b1	Return EISDIR when directory is opened with O_CREAT without O_DIRECTORY. Reviewed by: bcr (man page), emaste (previous version) PR: 240452 Sponsored by: The FreeBSD Foundation MFC after: 1 week DIfferential revision: https://reviews.freebsd.org/D21634	2019-09-17 18:32:18 +00:00
Mateusz Guzik	4cace859c2	vfs: convert struct mount counters to per-cpu There are 3 counters modified all the time in this structure - one for keeping the structure alive, one for preventing unmount and one for tracking active writers. Exact values of these counters are very rarely needed, which makes them a prime candidate for conversion to a per-cpu scheme, resulting in much better performance. Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on a 104-way 2 socket Skylake system: before: 852393 ops/s after: 76682077 ops/s Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21637	2019-09-16 21:37:47 +00:00
Mateusz Guzik	e87f3f72f1	vfs: manage mnt_writeopcount with atomics See r352424. Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21575	2019-09-16 21:33:16 +00:00
Kyle Evans	fe7bcbaf50	vm pager: writemapping accounting for OBJT_SWAP Currently writemapping accounting is only done for vnode_pager which does some accounting on the underlying vnode. Extend this to allow accounting to be possible for any of the pager types. New pageops are added to update/release writecount that need to be implemented for any pager wishing to do said accounting, and we implement these methods now for both vnode_pager (unchanged) and swap_pager. The primary motivation for this is to allow other systems with OBJT_SWAP objects to check if their objects have any write mappings and reject operations with EBUSY if so. posixshm will be the first to do so in order to reject adding write seals to the shmfd if any writable mappings exist. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21456	2019-09-03 20:31:48 +00:00

1 2 3 4 5 ...

464 Commits