freebsd-dev

Author	SHA1	Message	Date
Mateusz Guzik	eebc2e450f	vfs: add NDREINIT to facilitate repeated namei calls struct nameidata mixes caller arguments, internal state and output, which can be quite error prone. Recent addition of valdiating ni_resflags uncovered a caller which could repeatedly call namei, effectively operating on partially populated state. Add bare minimium validation this does not happen. The real fix would decouple aforementioned state. Reported by: pho Tested by: pho (different variant)	2020-10-29 12:56:02 +00:00
Mateusz Guzik	deb1339f3f	vfs: fix a panic when truncating comming from copy_file_range Truncating requires an exclusive lock, but it was not taken if the filesystem indicates support for shared writes. This only concerns ZFS. In particular fixes cp of files which have trailing holes. Reported by: bdrewery	2020-10-09 20:31:42 +00:00
Rick Macklem	19fe23fa2b	Make vn_generic_copy_file_range() interruptible via a signal. Without this patch, when vn_generic_copy_file_range() is doing a large copy, it will remain in the function for a considerable amount of time, delaying handling of any outstanding signals until the copy completes. This patch adds checks for signals that need to be processed after each successful data copy cycle. When sig_intr() returns non-zero, vn_generic_copy_file_range() will return. The check "if (len < savlen)" ensures that some data has been copied, so that progress will be made. Note that, since copy_file_range(2) is allowed to return fewer bytes copied than requested, it will never return EINTR/ERESTART when sig_intr() returns non-zero. Reviewed by: kib, asomers Differential Revision: https://reviews.freebsd.org/D26620	2020-10-09 01:04:28 +00:00
Rick Macklem	961afe3c99	Clip the "len" argument to vn_generic_copy_file_range() at a hole size boundary. By clipping the len argument of vn_generic_copy_file_range() to end at an exact multiple of hole size, holes are more likely to be maintained during the copy. A hole can still straddle the boundary at the end of the copy range, resulting in a block being allocated in the output file as it is being grown in size, but this will reduce the likelyhood of this happening. While here, also modify setting of blksize to better handle the case where _PC_MIN_HOLE_SIZE is returned as 1. Reviewed by: asomers Differential Revision: https://reviews.freebsd.org/D26570	2020-10-01 00:33:44 +00:00
Rick Macklem	164aa1e941	Make copy_file_range(2) Linux compatible for overflow of offset + len. Without this patch, if a call to copy_file_range(2) specifies an input file offset + len that would wrap around, EINVAL is returned. I thought that was the Linux behaviour, but recent testing showed that Linux accepts this case and does the copy_file_range() to EOF. This patch changes the FreeBSD code to exhibit the same behaviour as Linux for this case. Reviewed by: asomers, kib Differential Revision: https://reviews.freebsd.org/D26569	2020-09-30 02:18:09 +00:00
Konstantin Belousov	1317da4349	Add O_RESOLVE_BENEATH and AT_RESOLVE_BENEATH to mimic Linux' RESOLVE_BENEATH. It is like O_BENEATH, but disables to walk out of the subtree rooted in the starting directory. O_BENEATH does not care if path walks out if it returned. Requested by: Dan Gohman <dev@sunfishcode.online> PR: 248335 Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886	2020-09-22 22:48:12 +00:00
Konstantin Belousov	4a0b316d2a	Add open2nameif() the helper to calculate namei flags both for open(2) and creat(2). Suggested and reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D25886	2020-09-22 22:23:58 +00:00
Eric van Gyzen	f9cc8410e1	vm_ooffset_t is now unsigned vm_ooffset_t is now unsigned. Remove some tests for negative values, or make other adjustments accordingly. Reported by: Coverity Reviewed by: kib markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26214	2020-09-18 16:48:08 +00:00
Konstantin Belousov	3c484f325e	Convert page cache read to VOP. There are several negative side-effects of not calling into VOP layer at all for page cache reads. The biggest is the missed activation of EVFILT_READ knotes. Also, it allows filesystem to make more fine grained decision to refuse read from page cache. Keep VIRF_PGREAD flag around, it is still useful for nullfs, and for asserts. Reviewed by: markj Tested by: pho Discussed with: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346	2020-09-15 22:06:36 +00:00
Konstantin Belousov	888636655d	vfs_subr.c: export io_hold_cnt and vn_read_from_obj(). Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346	2020-09-15 22:00:58 +00:00
Mateusz Guzik	feabaaf995	cache: drop the always curthread argument from reverse lookup routines Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs. Tested by: pho	2020-08-24 08:57:02 +00:00
Konstantin Belousov	beb27033aa	Fix powerpc build. Sponsored by: The FreeBSD Foundation	2020-08-16 22:50:59 +00:00
Konstantin Belousov	fbca789fc3	VMIO read If possible, i.e. if the requested range is resident valid in the vm object queue, and some secondary conditions hold, copy data for read(2) directly from the valid cached pages, avoiding vnode lock and instantiating buffers. I intentionally do not start read-ahead, nor handle the advises on the cached range. Filesystems indicate support for VMIO reads by setting VIRF_PGREAD flag, which must not be cleared until vnode reclamation. Currently only filesystems that use vnode pager for v_objects can enable it, due to reliance on vnp_size. There is a WIP to handle it for tmpfs. Reviewed by: markj Discussed with: jeff Tested by: pho Benchmarked by: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25968	2020-08-16 21:02:45 +00:00
Mateusz Guzik	51ea7bea91	vfs: add VOP_STAT The current scheme of calling VOP_GETATTR adds avoidable overhead. An example with tmpfs doing fstat (ops/s): before: 7488958 after: 7913833 Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D25910	2020-08-07 23:06:40 +00:00
Mateusz Guzik	3ea3fbe685	vfs: fix vn_poll performance with either MAC or AUDIT The code would unconditionally lock the vnode to audit or call the mac hoook, even if neither want to do anything. Pre-check the state to avoid locking in the common case of nothing to do. Note this code should not be normally executed anyway as vnodes are always return ready. However, poll1/2 from will-it-scale use regular files for benchmarking, presumably to focus on the interface itself as the vnode handler is not supposed to do almost anything. This in particular fixes poll2 which passes 128 fds. $ ./poll2_processes -s 10 before: 134411 after: 271572	2020-07-16 14:09:18 +00:00
Mateusz Guzik	ab06a30517	vfs: fix MAC/AUDIT mismatch in vn_poll Auditing would not be performed without MAC compiled in.	2020-07-16 14:04:28 +00:00
Mateusz Guzik	422f38d8ea	vfs: fix trivial whitespace issues which don't interefere with blame .. even without the -w switch	2020-07-10 09:01:36 +00:00
Konstantin Belousov	4543c1c329	Fix typo. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2020-07-05 20:54:01 +00:00
Thomas Munro	f270658873	vfs: track sequential reads and writes separately For software like PostgreSQL and SQLite that sometimes reads sequentially while also writing sequentially some distance behind with interleaved syscalls on the same fd, performance is better on UFS if we do sequential access heuristics separately for reads and writes. Patch originally by Andrew Gierth in 2008, updated and proposed by me with his permission. Reviewed by: mjg, kib, tmunro Approved by: mjg (mentor) Obtained from: Andrew Gierth <andrew@tao11.riddles.org.uk> Differential Revision: https://reviews.freebsd.org/D25024	2020-06-21 08:51:24 +00:00
Kyle Evans	63619b6dba	vfs: add restrictions to read(2) of a directory [2/2] This commit adds the priv(9) that waters down the sysctl to make it only allow read(2) of a dirfd by the system root. Jailed root is not allowed, but jail policy and superuser policy will abstain from allowing/denying it so that a MAC module can fully control the policy. Such a MAC module has been written, and can be found at: https://people.freebsd.org/~kevans/mac_read_dir-0.1.0.tar.gz It is expected that the MAC module won't be needed by many, as most only need to do such diagnostics that require this behavior as system root anyways. Interested parties are welcome to grab the MAC module above and create a port or locally integrate it, and with enough support it could see introduction to base. As noted in mac_read_dir.c, it is released under the BSD 2 clause license and allows the restrictions to be lifted for only jailed root or for all unprivileged users. PR: 246412 Reviewed by: mckusick, kib, emaste, jilles, cy, phk, imp (all previous) Reviewed by: rgrimes (latest version) Differential Revision: https://reviews.freebsd.org/D24596	2020-06-04 18:17:25 +00:00
Kyle Evans	dcef4f65ae	vfs: add restrictions to read(2) of a directory [1/2] Historically, we've allowed read() of a directory and some filesystems will accommodate (e.g. ufs/ffs, msdosfs). From the history department staffed by Warner: <<EOF pdp-7 unix seemed to allow reading directories, but they were weird, special things there so I'm unsure (my pdp-7 assembler sucks). 1st Edition's sources are lost, mostly. The kernel allows it. The reconstructed sources from 2nd or 3rd edition read it though. V6 to V7 changed the filesystem format, and should have been a warning, but reading directories weren't materially changed. 4.1b BSD introduced readdir because of UFS. UFS broke all directory reading programs in 1983. ls, du, find, etc all had to be rewritten. readdir() and friends were introduced here. SysVr3 picked up readdir() in 1987 for the AT&T fork of Unix. SysVr4 updated all the directory reading programs in 1988 because different filesystem types were introduced. In the 90s, these interfaces became completely ubiquitous as PDP-11s running V7 faded from view and all the folks that initially started on V7 upgraded to SysV. Linux never supported this (though I've not done the software archeology to check) because it has always had a pathological diversity of filesystems. EOF Disallowing read(2) on a directory has the side-effect of masking application bugs from relying on other implementation's behavior (e.g. Linux) of rejecting these with EISDIR across the board, but allowing it has been a vector for at least one stack disclosure bug in the past[0]. By POSIX, this is implementation-defined whether read() handles directories or not. Popular implementations have chosen to reject them, and this seems sensible: the data you're reading from a directory is not structured in some unified way across filesystem implementations like with readdir(2), so it is impossible for applications to portably rely on this. With this patch, we will reject most read(2) of a dirfd with EISDIR. Users that know what they're doing can conscientiously set bsd.security.allow_read_dir=1 to allow read(2) of directories, as it has proven useful for debugging or recovery. A future commit will further limit the sysctl to allow only the system root to read(2) directories, to make it at least relatively safe to leave on for longer periods of time. While we're adding logic pertaining to directory vnodes to vn_io_fault, an additional assertion has also been added to ensure that we're not reaching vn_io_fault with any write request on a directory vnode. Such request would be a logical error in the kernel, and must be debugged rather than allowing it to potentially silently error out. Commented out shell aliases have been placed in root's chsrc/shrc to promote awareness that grep may become noisy after this change, depending on your usage. A tentative MFC plan has been put together to try and make it as trivial as possible to identify issues and collect reports; note that this will be strongly re-evaluated. Tentatively, I will MFC this knob with the default as it is in HEAD to improve our odds of actually getting reports. The future priv(9) to further restrict the sysctl WILL NOT BE MERGED BACK, so the knob will be a faithful reversion on stable/12. We will go into the merge acknowledging that the sysctl default may be flipped back to restore historical behavior at any point if it's warranted. [0] https://www.freebsd.org/security/advisories/FreeBSD-SA-19:10.ufs.asc PR: 246412 Reviewed by: mckusick, kib, emaste, jilles, cy, phk, imp (all previous) Reviewed by: rgrimes (latest version) MFC after: 1 month (note the MFC plan mentioned above) Relnotes: absolutely, but will amend previous RELNOTES entry Differential Revision: https://reviews.freebsd.org/D24596	2020-06-04 18:09:55 +00:00
Mateusz Guzik	e3d16bb6a8	vfs: use atomic_{store,load}_long to manage f_offset ... instead of depending on the compiler not to mess them up	2020-05-25 04:57:57 +00:00
Mateusz Guzik	442e617fd2	vfs: restore mtx-protected foffset locking for 32 bit platforms They depend on it to accurately read the offset. The new code is not used as it would add an interrupt enable/disable trip on top of the atomic. This also fixes a bug where 32-bit nolock request would still lock the offset. No changes for 64-bit. Reported by: emaste	2020-05-25 04:56:41 +00:00
Mateusz Guzik	3fc40153b2	vfs: scale foffset_lock by using atomics instead of serializing on mtx pool Contending cases still serialize on sleepq (which would be taken anyway). Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21626	2020-05-24 03:50:49 +00:00
Konstantin Belousov	90f29198c3	Remove extra call to vfs_op_exit() from vfs_write_suspend() when VFS_SYNC() fails. The vfs_write_resume() handler already does vfs_op_exit() for us. Reported by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation	2020-04-09 18:38:00 +00:00
Ryan Libby	2782c00c04	vfs: quiet -Wwrite-strings Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23797	2020-02-23 03:32:11 +00:00
Mateusz Guzik	074ad60a4c	vfs: make write suspension mandatory At the time opt-in was introduced adding yourself as a writer was esrializing across the mount point. Nowadays it is fully per-cpu, the only impact being a small single-threaded hit on top of what's there right now. Vast majority of the overhead stems from the call to VOP_GETWRITEMOUNT which has is done regardless. Should someone want to microoptimize this single-threaded they can coalesce looking the mount up with adding a write to it.	2020-02-15 13:00:39 +00:00
Mateusz Guzik	7b2ff0dcb2	Partially decompose priv_check by adding priv_check_cred_vfs_generation During buildkernel there are very frequent calls to priv_check and they all are for PRIV_VFS_GENERATION (coming from stat/fstat). This results in branching on several potential privileges checking if perhaps that's the one which has to be evaluated. Instead of the kitchen-sink approach provide a way to have commonly used privs directly evaluated.	2020-02-13 22:22:15 +00:00
Mateusz Guzik	2f7f11b7de	vfs: tidy up vget_finish and vn_lock - remove assertion which duplicates vn_lock - use VNPASS instead of retyping the failure - report what flags were passed if panicking on them	2020-02-08 15:52:20 +00:00
Mateusz Guzik	3ff65f71cb	Remove duplicated empty lines from kern/*.c No functional changes.	2020-01-30 20:05:05 +00:00
Kyle Evans	2856d85ecb	posix_fallocate: push vnop implementation into the fileop layer This opens the door for other descriptor types to implement posix_fallocate(2) as needed. Reviewed by: kib, bcr (manpages) Differential Revision: https://reviews.freebsd.org/D23042	2020-01-08 19:05:32 +00:00
Mateusz Guzik	7e2ea5772b	vfs: factor out avoidable branches in _vn_lock	2020-01-05 01:00:11 +00:00
Mateusz Guzik	b249ce48ea	vfs: drop the mostly unused flags argument from VOP_UNLOCK Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427	2020-01-03 22:29:58 +00:00
Mateusz Guzik	abd80ddb94	vfs: introduce v_irflag and make v_type smaller The current vnode layout is not smp-friendly by having frequently read data avoidably sharing cachelines with very frequently modified fields. In particular v_iflag inspected for VI_DOOMED can be found in the same line with v_usecount. Instead make it available in the same cacheline as the v_op, v_data and v_type which all get read all the time. v_type is avoidably 4 bytes while the necessary data will easily fit in 1. Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new flag field with a new value: VIRF_DOOMED. Reviewed by: kib, jeff Differential Revision: https://reviews.freebsd.org/D22715	2019-12-08 21:30:04 +00:00
Konstantin Belousov	fdc6b10d44	Add a VN_OPEN_INVFS flag. vn_open_cred() assumes that it is called from the top-level of a VFS syscall. Writers must call bwillwrite() before locking any VFS resource to wait for cleanup of dirty buffers. ZFS getextattr() and setextattr() VOPs do call vn_open_cred(), which results in wait for unrelated buffers while owning ZFS vnode lock (and ZFS does not use buffer cache). VN_OPEN_INVFS allows caller to skip bwillwrite. Note that ZFS is still incorrect there, because it starts write on an mp and locks a vnode while holding another vnode lock. Reported by: Willem Jan Withagen <wjw@digiware.nl> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-11-29 14:02:32 +00:00
Rick Macklem	48e4857859	Update copy_file_range(2) to be Linux5 compatible. The current linux man page and testing done on a fairly recent linux5.n kernel have identified two changes to the semantics of the linux copy_file_range system call. Since the copy_file_range(2) system call is intended to be linux compatible and is only currently in head/current and not used by any commands, it seems appropriate to update the system call to be compatible with the current linux one. The first of these semantic changes was changed to be compatible with linux5.n by r354564. For the second semantic change, the old linux man page stated that, if infd and outfd referred to the same file, EBADF should be returned. Now, the semantics is to allow infd and outfd to refer to the same file so long as the byte ranges defined by the input file offset, output file offset and len does not overlap. If the byte ranges do overlap, EINVAL should be returned. This patch modifies copy_file_range(2) to be linux5.n compatible for this semantic change.	2019-11-10 01:08:14 +00:00
Rick Macklem	15930ae180	Update copy_file_range(2) to be Linux5 compatible. The current linux man page and testing done on a fairly recent linux5.n kernel have identified two changes to the semantics of the linux copy_file_range system call. Since the copy_file_range(2) system call is intended to be linux compatible and is only currently in head/current and not used by any commands, it seems appropriate to update the system call to be compatible with the current linux one. The old linux man page stated that, if the offset + len exceeded file_size for the input file, EINVAL should be returned. Now, the semantics is to copy up to at most file_size bytes and return that number of bytes copied. If the offset is at or beyond file_size, a return of 0 bytes is done. This patch modifies copy_file_range(2) to be linux compatible for this semantic change. A separate patch will change copy_file_range(2) for the other semantic change, which allows the infd and outfd to refer to the same file, so long as the byte ranges do not overlap.	2019-11-08 23:39:17 +00:00
Andrew Turner	9bb37c03fb	Stop leaking information from the kernel through timespec The timespec struct holds a seconds value in a time_t and a nanoseconds value in a long. On most architectures these are the same size, however on 32-bit architectures other than i386 time_t is 8 bytes and long is 4 bytes. Most ABIs will then pad a struct holding an 8 byte and 4 byte value to 16 bytes with 4 bytes of padding. When copying one of these structs the compiler is free to copy the padding if it wishes. In this case the padding may contain kernel data that is then leaked to userspace. Fix this by copying the timespec elements rather than the entire struct. This doesn't affect Tier-1 architectures so no SA is expected. admbugs: 651 MFC after: 1 week Sponsored by: DARPA, AFRL	2019-10-16 13:21:01 +00:00
Konstantin Belousov	55894117b1	Return EISDIR when directory is opened with O_CREAT without O_DIRECTORY. Reviewed by: bcr (man page), emaste (previous version) PR: 240452 Sponsored by: The FreeBSD Foundation MFC after: 1 week DIfferential revision: https://reviews.freebsd.org/D21634	2019-09-17 18:32:18 +00:00
Mateusz Guzik	4cace859c2	vfs: convert struct mount counters to per-cpu There are 3 counters modified all the time in this structure - one for keeping the structure alive, one for preventing unmount and one for tracking active writers. Exact values of these counters are very rarely needed, which makes them a prime candidate for conversion to a per-cpu scheme, resulting in much better performance. Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on a 104-way 2 socket Skylake system: before: 852393 ops/s after: 76682077 ops/s Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21637	2019-09-16 21:37:47 +00:00
Mateusz Guzik	e87f3f72f1	vfs: manage mnt_writeopcount with atomics See r352424. Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21575	2019-09-16 21:33:16 +00:00
Kyle Evans	fe7bcbaf50	vm pager: writemapping accounting for OBJT_SWAP Currently writemapping accounting is only done for vnode_pager which does some accounting on the underlying vnode. Extend this to allow accounting to be possible for any of the pager types. New pageops are added to update/release writecount that need to be implemented for any pager wishing to do said accounting, and we implement these methods now for both vnode_pager (unchanged) and swap_pager. The primary motivation for this is to allow other systems with OBJT_SWAP objects to check if their objects have any write mappings and reject operations with EBUSY if so. posixshm will be the first to do so in order to reject adding write seals to the shmfd if any writable mappings exist. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D21456	2019-09-03 20:31:48 +00:00
Konstantin Belousov	95acb40caa	vn_vget_ino_gen(): relock the lower vnode on error. The function' interface assumes that the lower vnode is passed and returned locked always. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-08-27 08:28:38 +00:00
Rick Macklem	df9bc7df42	Map ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE). Without this patch, when an application performed lseek(SEEK_DATA/SEEK_HOLE) on a file in a file system that does not have its own VOP_IOCTL(), the lseek(2) fails with errno ENOTTY. This didn't seem appropriate, since ENOTTY is not listed as an error return by either the lseek(2) man page nor the POSIX draft for lseek(2). This was discussed on freebsd-current@ here: http://docs.FreeBSD.org/cgi/mid.cgi?CAOtMX2iiQdv1+15e1N_r7V6aCx_VqAJCTP1AW+qs3Yg7sPg9wA This trivial patch maps ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE). Reviewed by: markj Relnotes: yes Differential Revision: https://reviews.freebsd.org/D21300	2019-08-22 01:15:06 +00:00
Rick Macklem	c61b14315f	Fix copy_file_range(2) so that unneeded blocks are not allocated to the output file. When the byte range for copy_file_range(2) doesn't go to EOF on the output file and there is a hole in the input file, a hole must be "punched" in the output file. This is done by writing a block of bytes all set to 0. Without this patch, the write is done unconditionally which means that, if the output file already has a hole in that byte range, a unneeded data block of all 0 bytes would be allocated. This patch adds code to check for a hole in the output file, so that it can skip doing the write if there is already a hole in that byte range of the output file. This avoids unnecessary allocation of blocks to the output file. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D21155	2019-08-15 23:21:41 +00:00
Rick Macklem	6b1bc6f7dd	Remove some harmless cruft from vn_generic_copy_file_range(). An earlier version of the patch had code that set "error" between line#s 2797-2799. When that code was moved, the second check for "error != 0" could never be true and the check became harmless cruft. This patch removes the cruft, mainly to make Coverity happy. Reported by: asomers, cem	2019-08-08 20:07:38 +00:00
Rick Macklem	614633146f	Fix copy_file_range(2) for an unlikely race during hole finding. Since the VOP_IOCTL(FIOSEEKDATA/FIOSEEKHOLE) calls are done with the vnode unlocked, it is possible for another thread to do: - truncate(), lseek(), write() between the two calls and create a hole where FIOSEEKDATA returned the start of data. For this case, VOP_IOCTL(FIOSEEKHOLE) will return the same offset for the hole location. This could result in an infinite loop in the copy code, since copylen is set to 0 and the copy doesn't advance. Usually, this race is avoided because of the use of rangelocks, but the NFS server does not do range locking and could do a sequence like the above to create the hole. This patch checks for this case and makes the hole search fail, to avoid the infinite loop. At this time, it is an open question as to whether or not the NFS server should do range locking to avoid this race.	2019-08-08 19:53:07 +00:00
Rick Macklem	bbbbeca3e9	Add kernel support for a Linux compatible copy_file_range(2) syscall. This patch adds support to the kernel for a Linux compatible copy_file_range(2) syscall and the related VOP_COPY_FILE_RANGE(9). This syscall/VOP can be used by the NFSv4.2 client to implement the Copy operation against an NFSv4.2 server to do file copies locally on the server. The vn_generic_copy_file_range() function in this patch can be used by the NFSv4.2 server to implement the Copy operation. Fuse may also me able to use the VOP_COPY_FILE_RANGE() method. vn_generic_copy_file_range() attempts to maintain holes in the output file in the range to be copied, but may fail to do so if the input and output files are on different file systems with different _PC_MIN_HOLE_SIZE values. Separate commits will be done for the generated syscall files and userland changes. A commit for a compat32 syscall will be done later. Reviewed by: kib, asomers (plus comments by brooks, jilles) Relnotes: yes Differential Revision: https://reviews.freebsd.org/D20584	2019-07-25 05:46:16 +00:00
Alan Somers	0122532ee0	F_READAHEAD: Fix r349248's overflow protection, broken by r349391 I accidentally broke the main point of r349248 when making stylistic changes in r349391. Restore the original behavior, and also fix an additional overflow that was possible when uio->uio_resid was nearly SSIZE_MAX. Reported by: cem Reviewed by: bde MFC after: 2 weeks MFC-With: 349248 Sponsored by: The FreeBSD Foundation	2019-07-17 17:01:07 +00:00
Rick Macklem	555d8f2859	Factor out the code that does a VOP_SETATTR(size) from vn_truncate(). This patch factors the code in vn_truncate() that does the actual VOP_SETATTR() of size into a separate function called vn_truncate_locked(). This will allow the NFS server and the patch that adds a copy_file_range(2) syscall to call this function instead of duplicating the code and carrying over changes, such as the recent r347151. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D20808	2019-07-01 20:41:43 +00:00
Alan Somers	0cfc1ef38d	FIOBMAP2: inline vn_ioc_bmap2 Reported by: kib Reviewed by: kib MFC after: 2 weeks MFC-With: 349238 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20783	2019-06-27 23:39:06 +00:00
Alan Somers	4f53d57e8c	fcntl: style changes to r349248 Reported by: bde MFC after: 2 weeks MFC-With: 349248 Sponsored by: The FreeBSD Foundation	2019-06-25 19:44:22 +00:00
Alan Somers	38b06f8ac4	fcntl: fix overflow when setting F_READAHEAD VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field. However, fcntl allows you to set the seqcount in bytes to any nonnegative 31-bit value. The result can be a 16-bit overflow, which will be sign-extended in functions like ffs_read. Fix this by sanitizing the argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather than INT16_MAX. Also, fifos have overloaded the f_seqcount field for a completely different purpose ever since r238936. Formalize that by using a union type. Reviewed by: cem MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20710	2019-06-20 23:07:20 +00:00
Alan Somers	d49b446bfb	Add FIOBMAP2 ioctl This ioctl exposes VOP_BMAP information to userland. It can be used by programs like fragmentation analyzers and optimized cp implementations. But I'm using it to test fusefs's VOP_BMAP implementation. The "2" in the name distinguishes it from the similar but incompatible FIBMAP ioctls in NetBSD and Linux. FIOBMAP2 differs from FIBMAP in that it uses a 64-bit block number instead of 32-bit, and it also returns runp and runb. Reviewed by: mckusick MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20705	2019-06-20 14:13:10 +00:00
Conrad Meyer	daec92844e	Include ktr.h in more compilation units Similar to r348026, exhaustive search for uses of CTRn() and cross reference ktr.h includes. Where it was obvious that an OS compat header of some kind included ktr.h indirectly, .c files were left alone. Some of these files clearly got ktr.h via header pollution in some scenarios, or tinderbox would not be passing prior to this revision, but go ahead and explicitly include it in files using it anyway. Like r348026, these CUs did not show up in tinderbox as missing the include. Reported by: peterj (arm64/mp_machdep.c) X-MFC-With: r347984 Sponsored by: Dell EMC Isilon	2019-05-21 20:38:48 +00:00
Konstantin Belousov	78022527bb	Switch to use shared vnode locks for text files during image activation. kern_execve() locks text vnode exclusive to be able to set and clear VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0 condition. The change removes VV_TEXT, replacing it with the condition v_writecount <= -1, and puts v_writecount under the vnode interlock. Each text reference decrements v_writecount. To clear the text reference when the segment is unmapped, it is recorded in the vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and v_writecount is incremented on the map entry removal The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that v_writecount does not contradict the desired change. vn_writecheck() is now racy and its use was eliminated everywhere except access. Atomic check for writeability and increment of v_writecount is performed by the VOP. vn_truncate() now increments v_writecount around VOP_SETATTR() call, lack of which is arguably a bug on its own. nullfs bypasses v_writecount to the lower vnode always, so nullfs vnode has its own v_writecount correct, and lower vnode gets all references, since object->handle is always lower vnode. On the text vnode' vm object dealloc, the v_writecount value is reset to zero, and deadfs vop_unset_text short-circuit the operation. Reclamation of lowervp always reclaims all nullfs vnodes referencing lowervp first, so no stray references are left. Reviewed by: markj, trasz Tested by: mjg, pho Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D19923	2019-05-05 11:20:43 +00:00
Konstantin Belousov	ae90941431	Add vn_fsync_buf(). Provide a convenience function to avoid the hack with filling fake struct vop_fsync_args and then calling vop_stdfsync(). Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-04-09 20:20:04 +00:00
Konstantin Belousov	4ae3f5a7fd	vn_vmap_seekhole(): align running offset to the block boundary. Otherwise we might miss the last iteration where EOF appears below unaligned noff. Reported and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D19811	2019-04-05 16:14:16 +00:00
Konstantin Belousov	4f77f48884	Implement O_BENEATH and AT_BENEATH. Flags prevent open(2) and *at(2) vfs syscalls name lookup from escaping the starting directory. Supposedly the interface is similar to the same proposed Linux flags. Reviewed by: jilles (code, previous version of manpages), 0mp (manpages) Discussed with: allanjude, emaste, jonathan Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17547	2018-10-25 22:16:34 +00:00
Gordon Tetlow	c9e562b188	Correct ELF header parsing code to prevent invalid ELF sections from disclosing memory. Submitted by: markj Reported by: Thomas Barabosch, Fraunhofer FKIE Approved by: re (implicit) Approved by: so Security: FreeBSD-SA-18:12.elf Security: CVE-2018-6924 Sponsored by: The FreeBSD Foundation	2018-09-12 04:57:34 +00:00
Ed Maste	b8d908b71e	ANSIfy sys/kern	2018-06-01 13:26:45 +00:00
Matt Macy	b99aa0fbb2	hwpmc: don't enter epoch section across mmap hook	2018-05-29 18:03:48 +00:00
Konstantin Belousov	161bf65f8a	In vn_io_fault1(), reduce the scope where pagefaults are disabled. Most important for the future use, do not call vm_fault_quick_hold_pages() with disabled pagefaults. Reported and tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:13:52 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Konstantin Belousov	03311f117b	Use whole mnt_stat.f_fsid bits for st_dev. Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use f_fsid. In particular, NFSv3 and sometimes NFSv4, and ZFS use this method or reporting st_dev by stat(2). Provide a new helper vn_fsid() to avoid duplicating code to copy f_fsid to va_fsid. Note that the change is mostly cosmetic. Its motivation is to avoid sign-extension of f_fsid[0] into 64bit dev_t value which happens after dev_t becomes 64bit.. Reviewed by: avg(zfs), rmacklem (nfs) (both for previous version) Sponsored by: The FreeBSD Foundation	2017-05-27 17:00:30 +00:00
Konstantin Belousov	6992112349	Commit the 64-bit inode project. Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify struct dirent layout to add d_off, increase the size of d_fileno to 64-bits, increase the size of d_namlen to 16-bits, and change the required alignment. Increase struct statfs f_mntfromname[] and f_mntonname[] array length MNAMELEN to 1024. ABI breakage is mitigated by providing compatibility using versioned symbols, ingenious use of the existing padding in structures, and by employing other tricks. Unfortunately, not everything can be fixed, especially outside the base system. For instance, third-party APIs which pass struct stat around are broken in backward and forward incompatible ways. Kinfo sysctl MIBs ABI is changed in backward-compatible way, but there is no general mechanism to handle other sysctl MIBS which return structures where the layout has changed. It was considered that the breakage is either in the management interfaces, where we usually allow ABI slip, or is not important. Struct xvnode changed layout, no compat shims are provided. For struct xtty, dev_t tty device member was reduced to uint32_t. It was decided that keeping ABI compat in this case is more useful than reporting 64-bit dev_t, for the sake of pstat. Update note: strictly follow the instructions in UPDATING. Build and install the new kernel with COMPAT_FREEBSD11 option enabled, then reboot, and only then install new world. Credits: The 64-bit inode project, also known as ino64, started life many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick (mckusick) then picked up and updated the patch, and acted as a flag-waver. Feedback, suggestions, and discussions were carried by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles), and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial ports investigation followed by an exp-run by Antoine Brodin (antoine). Essential and all-embracing testing was done by Peter Holm (pho). The heavy lifting of coordinating all these efforts and bringing the project to completion were done by Konstantin Belousov (kib). Sponsored by: The FreeBSD Foundation (emaste, kib) Differential revision: https://reviews.freebsd.org/D10439	2017-05-23 09:29:05 +00:00
Ed Maste	3e85b721d6	Remove register keyword from sys/ and ANSIfy prototypes A long long time ago the register keyword told the compiler to store the corresponding variable in a CPU register, but it is not relevant for any compiler used in the FreeBSD world today. ANSIfy related prototypes while here. Reviewed by: cem, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D10193	2017-05-17 00:34:34 +00:00
Konstantin Belousov	ecc6c515ab	Apply noexec mount option for mmap(PROT_EXEC). Right now the noexec mount option disallows image activators to try execve the files on the mount point. Also, after r127187, noexec also limits max_prot map entries permissions for mappings of files from such mounts, but not the actual mapping permissions. As result, the API behaviour is inconsistent. The files from noexec mount can be mapped with PROT_EXEC, but if mprotect(2) drops execution permission, it cannot be re-enabled later. Make this consistent logically and aligned with behaviour of other systems, by disallowing PROT_EXEC for mmap(2). Note that this change only ensures aligned results from mmap(2) and mprotect(2), it does not prevent actual code execution from files coming from noexec mount. Such files can always be read into anonymous executable memory and executed from there. Reported by: shamaz.mazum@gmail.com PR: 217062 Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-02-19 20:51:04 +00:00
Konstantin Belousov	987ff18184	Consistently handle negative or wrapping offsets in the mmap(2) syscalls. For regular files and posix shared memory, POSIX requires that [offset, offset + size) range is legitimate. At the maping time, check that offset is not negative. Allowing negative offsets might expose the data that filesystem put into vm_object for internal use, esp. due to OFF_TO_IDX() signess treatment. Fault handler verifies that the mapped range is valid, assuming that mmap(2) checked that arithmetic gives no undefined results. For device mappings, leave the semantic of negative offsets to the driver. Correct object page index calculation to not erronously propagate sign. In either case, disallow overflow of offset + size. Update mmap(2) man page to explain the requirement of the range validity, and behaviour when the range becomes invalid after mapping. Reported and tested by: royger (previous version) Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-12 21:05:44 +00:00
Konstantin Belousov	e83a71c656	Fix r313495. The file type DTYPE_VNODE can be assigned as a fallback if VOP_OPEN() did not initialized file type. This is a typical code path used by normal file systems. Also, change error returned for inappropriate file type used for O_EXLOCK to EOPNOTSUPP, as declared in the open(2) man page. Reported by: cy, dhw, Iblis Lin <iblis@hs.ntnu.edu.tw> Tested by: dhw Sponsored by: The FreeBSD Foundation MFC after: 13 days	2017-02-10 14:49:04 +00:00
Konstantin Belousov	e628e1b919	Increase a chance of devfs_close() calling d_close cdevsw method. If a file opened over a vnode has an advisory lock set at close, vn_closefile() acquires additional vnode use reference to prevent freeing the vnode in vn_close(). Side effect is that for device vnodes, devfs_close() sees that vnode reference count is greater than one and refuses to call d_close(). Create internal version of vn_close() which can avoid dropping the vnode reference if needed, and use this to execute VOP_CLOSE() without acquiring a new reference. Note that any parallel reference to the vnode would still prevent d_close call, if the reference is not from an opened file, e.g. due to stat(2). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-09 23:36:50 +00:00
Konstantin Belousov	7903b00087	Do not establish advisory locks when doing open(O_EXLOCK) or open(O_SHLOCK) for files which do not have DTYPE_VNODE type. Both flock(2) and fcntl(2) syscalls refuse to acquire advisory lock on a file which type is not DTYPE_VNODE. Do the same when lock is requested from open(2). Restructure the block in vn_open_vnode() which handles O_EXLOCK and O_SHLOCK open flags to make it easier to quit its execution earlier with an error. Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-02-09 23:35:57 +00:00
Mateusz Guzik	f1f7f1cb29	hwpmc: partially depessimize mmap handling if the module is not loaded In particular this means the pmc sx lock is no longer taken when an executable mapping succeeds. MFC after: 1 week	2017-01-27 22:13:15 +00:00
Konstantin Belousov	25c6816845	More style cleanup. Use ANSI C definition for vn_closefile(). Switch to VNASSERT in _vn_lock(), simplify messages. Sponsored by: The FreeBSD Foundation X-MFC with: r312600, r312601, r312602, r312606	2017-01-22 19:38:45 +00:00
Mateusz Guzik	eaf0969bda	vfs: fix LK_RETRY logic braino in r312600	2017-01-21 20:34:20 +00:00
Mateusz Guzik	829857c893	vfs: __predict_false the need to handle F_HASLOCK Also reorder the check with DTYPE_VNODE. Passed files are vnodes vast majority of the time, so it is typically true.	2017-01-21 19:01:42 +00:00
Mateusz Guzik	abbc538d9a	vfs: fix whitespace damage in r312600 While here wrap the previously overly long line so that it fits 80 chars.	2017-01-21 18:56:58 +00:00
Mateusz Guzik	1091fb52c1	vfs: refactor _vn_lock Stop testing for LK_RETRY and error multiple times. Also postpone the VI_DOOMED until after LK_RETRY was seen as it reads from the vnode. No functional changes.	2017-01-21 18:38:16 +00:00
Ed Maste	69a2875821	Renumber license clauses in sys/kern to avoid skipping #3	2016-09-15 13:16:20 +00:00
Robert Watson	c3c0088bb0	Audit additional vnode information in the implementation of the ftruncate(2) system call. This was not required by the Common Criteria, which needed only open-time audit. MFC after: 2 weeks Sponsored by: DARPA, AFRL	2016-08-20 18:51:48 +00:00
Conrad Meyer	af326ace9d	devfs: Move most ioctl logic down to vnode layer Devfs' file layer ioctl is now just a thin shim around the vnode layer. Reviewed by: kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D7286	2016-07-25 16:28:02 +00:00
Robert Watson	971711fb7c	Call audit hooks to capture vnode attributes for three file-descriptor method implementations: fstat(2), close(2), and poll(2). This change synchronises auditing here with similar auditing for VFS-specific system calls such as stat(2) that audit more complete vnode information. Sponsored by: DARPA, AFRL Approved by: re (kib) MFC after: 1 week	2016-07-05 16:37:01 +00:00
Konstantin Belousov	3f7ca894de	Ensure that ftruncate(2) is performed synchronously when file is opened in O_SYNC mode, at least for UFS. This also handles truncation, done due to the O_SYNC \| O_TRUNC flags combination to open(2), in synchronous way. Noted by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-05-18 12:03:57 +00:00
Pedro F. Giffuni	31b6732008	sys/kern: spelling fixes. Mostly on comments but affects some debug messages. MFC after: 2 weeks	2016-04-29 21:54:28 +00:00
Pedro F. Giffuni	74b8d63dcc	Cleanup unnecessary semicolons from the kernel. Found with devel/coccinelle.	2016-04-10 23:07:00 +00:00
Konstantin Belousov	6adf19481c	The struct file f_advice member is overlaid with the devfs f_cdevpriv data. If vnode bypass for devfs file failed, vn_read/vn_write are called and might try to dereference f_advice. Limit the accesses to f_advice to VREG vnodes only, which is the type ensured by posix_fadvise(). The f_advice for regular files is protected by mtxpool lock. Recheck that f_advice is not NULL after lock is taken. Reported and tested by: bde Sponsored by: The FreeBSD Foundation MFC after: 3 weeks	2016-01-22 20:35:20 +00:00
Konstantin Belousov	ce958bdefe	When cleaning up from failed adv locking and checking for write, do not call VOP_CLOSE() manually. Instead, delegate the close to fo_close() performed as part of the fdrop() on the file failed to open. For this, finish constructing file on error, in particular, set f_vnode and f_ops. Forcibly resetting f_ops to badfileops disabled additional cleanups performed by fo_close() for some file types, in this case it was noted that cdevpriv data was corrupted. Since fo_close() call must be enabled for some file types, it makes more sense to enable it for all files opened through vn_open_cred(). In collaboration with: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2016-01-17 08:40:51 +00:00
Fabien Thomas	78e79434d2	Fix r283998 that broke mapin events for hwpmc. Reviewed by: jhb Sponsored by: Stormshield	2015-10-08 09:54:33 +00:00
Mark Johnston	3138cd3670	As a step towards the elimination of PG_CACHED pages, rework the handling of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to the head of the inactive queue instead of being cached. This affects the implementation of POSIX_FADV_NOREUSE as well, since it works by applying POSIX_FADV_DONTNEED to file ranges after they have been read or written. At that point the corresponding buffers may still be dirty, so the previous implementation would coalesce successive ranges and apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the dirty buffers would eventually be cached. To preserve this behaviour in an efficient manner, this change adds a new buf flag, B_NOREUSE, which causes the pages backing a VMIO buf to be placed at the head of the inactive queue when the buf is released. POSIX_FADV_NOREUSE then works by setting this flag in bufs that underlie the specified range. Reviewed by: alc, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3726	2015-09-30 23:06:29 +00:00
Konstantin Belousov	9e18c9eb27	For open("name", O_DIRECTORY \| O_CREAT), do not try to create the named node, open(2) cannot create directories. But do allow the flag combination to succeed if the directory already exists. Declare the open("name", O_DIRECTORY \| O_CREAT \| O_EXCL) always invalid for the same reason, since open(2) cannot create directory. Note that there is an argument that O_DIRECTORY \| O_CREAT should be invalid always, regardless of the target directory existence or O_EXCL. The current fix is conservative and allows the call to succeed in the situation where it succeeded before the patch. Reported by: Tom Ridge <freebsd@tom-ridge.com> Reviewed by: rwatson PR: 202892 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-09 19:31:08 +00:00
Conrad Meyer	14bdbaf2e4	Detect badly behaved coredump note helpers Coredump notes depend on being able to invoke dump routines twice; once in a dry-run mode to get the size of the note, and another to actually emit the note to the corefile. When a note helper emits a different length section the second time around than the length it requested the first time, the kernel produces a corrupt coredump. NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to the length of filenames corresponding to vnodes in the process' fd table via vn_fullpath. As vnodes may move around during dump, this is racy. So: - Detect badly behaved notes in putnote() and pad underfilled notes. - Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to exercise the NT_PROCSTAT_FILES corruption. It simply picks random lengths to expand or truncate paths to in fo_fill_kinfo_vnode(). - Add a sysctl, kern.coredump_pack_fileinfo, to allow users to disable kinfo packing for PROCSTAT_FILES notes. This should avoid both FILES note corruption and truncation, even if filenames change, at the cost of about 1 kiB in padding bloat per open fd. Document the new sysctl in core.5. - Fix note_procstat_files to self-limit in the 2nd pass. Since sometimes this will result in a short write, pad up to our advertised size. This addresses note corruption, at the risk of sometimes truncating the last several fd info entries. - Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the zero padding. With suggestions from: bjk, jhb, kib, wblock Approved by: markj (mentor) Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3548	2015-09-03 20:32:10 +00:00
Konstantin Belousov	8917728875	vn_io_fault() handling of the LOR for i/o into the file-backed buffers has observable overhead when the buffer pages are not resident or not mapped. The overhead comes at least from two factors, one is the additional work needed to detect the situation, prepare and execute the rollbacks. Another is the consequence of the i/o splitting into the batches of the held pages, causing filesystems see series of the smaller i/o requests instead of the single large request. Note that expected case of the resident i/o buffer does not expose these issues. Provide a prefaulting for the userspace i/o buffers, disabled by default. I am careful of not enabling prefaulting by default for now, since it would be detrimental for the applications which speculatively pass extra-large buffers of anonymous memory to not deal with buffer sizing (if such apps exist). Found and tested by: bde, emaste Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-07-31 04:12:51 +00:00
Mark Johnston	5f34e93c58	Check suspendability on the mountpoint returned by VOP_GETWRITEMOUNT. This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough filesystems like nullfs and unionfs no longer need to inherit this information from their lower layer(s). This change also restores the pre-r273336 behaviour of using the presence of a susp_clean VFS method to request suspension support. Reviewed by: kib, mjg Differential Revision: https://reviews.freebsd.org/D2937	2015-07-05 22:37:33 +00:00
Mateusz Guzik	f6f6d24062	Implement lockless resource limits. Use the same scheme implemented to manage credentials. Code needing to look at process's credentials (as opposed to thred's) is provided with *_proc variants of relevant functions. Places which possibly had to take the proc lock anyway still use the proc pointer to access limits.	2015-06-10 10:48:12 +00:00
John Baldwin	7077c42623	Add a new file operations hook for mmap operations. File type-specific logic is now placed in the mmap hook implementation rather than requiring it to be placed in sys/vm/vm_mmap.c. This hook allows new file types to support mmap() as well as potentially allowing mmap() for existing file types that do not currently support any mapping. The vm_mmap() function is now split up into two functions. A new vm_mmap_object() function handles the "back half" of vm_mmap() and accepts a referenced VM object to map rather than a (handle, handle_type) tuple. vm_mmap() is now reduced to converting a (handle, handle_type) tuple to a a VM object and then calling vm_mmap_object() to handle the actual mapping. The vm_mmap() function remains for use by other parts of the kernel (e.g. device drivers and exec) but now only supports mapping vnodes, character devices, and anonymous memory. The mmap() system call invokes vm_mmap_object() directly with a NULL object for anonymous mappings. For mappings using a file descriptor, the descriptors fo_mmap() hook is invoked instead. The fo_mmap() hook is responsible for performing type-specific checks and adjustments to arguments as well as possibly modifying mapping parameters such as flags or the object offset. The fo_mmap() hook routines then call vm_mmap_object() to handle the actual mapping. The fo_mmap() hook is optional. If it is not set, then fo_mmap() will fail with ENODEV. A fo_mmap() hook is implemented for regular files, character devices, and shared memory objects (created via shm_open()). While here, consistently use the VM_PROT_* constants for the vm_prot_t type for the 'prot' variable passed to vm_mmap() and vm_mmap_object() as well as the vm_mmap_vnode() and vm_mmap_cdev() helper routines. Previously some places were using the mmap()-specific PROT_* constants instead. While this happens to work because PROT_xx == VM_PROT_xx, using VM_PROT_* is more correct. Differential Revision: https://reviews.freebsd.org/D2658 Reviewed by: alc (glanced over), kib MFC after: 1 month Sponsored by: Chelsio	2015-06-04 19:41:15 +00:00
Konstantin Belousov	2db0e1f50d	Add V_MNTREF flag to the vn_start_write(9) and vn_start_secondary_write(9) functions. The flag indicates that the caller already owns a reference on the mount point, and the functions can consume it. The reference is released by vn_finished_write(9) and vn_finished_secondary_write(9) in due course. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:21:47 +00:00
Craig Rodrigues	d5fec48956	Support file verification in MAC. * Add VCREAT flag to indicate when a new file is being created * Add VVERIFY to indicate verification is required * Both VCREAT and VVERIFY are only passed on the MAC method vnode_check_open and are removed from the accmode after * Add O_VERIFY flag to rtld open of objects * Add 'v' flag to __sflags to set O_VERIFY flag. Submitted by: Steve Kiernan <stevek@juniper.net> Obtained from: Juniper Networks, Inc. GitHub Pull Request: https://github.com/freebsd/freebsd/pull/27 Relnotes: yes	2015-04-22 01:54:25 +00:00
Konstantin Belousov	8ee9765a9d	Add VN_OPEN_NAMECACHE flag for vn_open_cred(9), which requests that the created file name was cached. Use the flag for core dumps. Requested by: rpaulo Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-21 13:32:07 +00:00
Konstantin Belousov	6c21f6edb8	The VOP_LOOKUP() implementations for CREATE op do not put the name into namecache, to avoid cache trashing when doing large operations. E.g., tar archive extraction is not usually followed by access to many of the files created. Right now, each VOP_LOOKUP() implementation explicitely knowns about this quirk and tests for both MAKEENTRY flag presence and op != CREATE to make the call to cache_enter(). Centralize the handling of the quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP. VFS now sets NOCACHE flag for CREATE namei() calls. Note that the change in semantic is backward-compatible and could be merged to the stable branch, and is compatible with non-changed third-party filesystems which correctly handle MAKEENTRY. Suggested by: Chris Torek <torek@pi-coral.com> Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-18 10:01:12 +00:00
Konstantin Belousov	0061ddb3ed	Only sleep interruptible while waiting for suspension end when filesystem specified VFCF_SBDRY flag, i.e. for NFS. There are two issues with the sleeps. First, applications may get unexpected EINTR from the disk i/o syscalls. Second, interruptible sleep allows the stop of the process, and since mount point is referenced while thread sleeps, unmount cannot free mount point structure' memory, blocking unmount indefinitely. Even for NFS, it is probably only reasonable to enable PCATCH for intr mounts, but this information is currently not available at VFS level. Reported and tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-12-13 16:07:01 +00:00

1 2 3 4 5 ...

506 Commits