The code would unconditionally lock the vnode to audit or call the
mac hoook, even if neither want to do anything. Pre-check the state
to avoid locking in the common case of nothing to do.
Note this code should not be normally executed anyway as vnodes are
always return ready. However, poll1/2 from will-it-scale use regular
files for benchmarking, presumably to focus on the interface itself
as the vnode handler is not supposed to do almost anything.
This in particular fixes poll2 which passes 128 fds.
$ ./poll2_processes -s 10
before: 134411
after: 271572
For software like PostgreSQL and SQLite that sometimes reads sequentially
while also writing sequentially some distance behind with interleaved
syscalls on the same fd, performance is better on UFS if we do
sequential access heuristics separately for reads and writes.
Patch originally by Andrew Gierth in 2008, updated and proposed by me with
his permission.
Reviewed by: mjg, kib, tmunro
Approved by: mjg (mentor)
Obtained from: Andrew Gierth <andrew@tao11.riddles.org.uk>
Differential Revision: https://reviews.freebsd.org/D25024
This commit adds the priv(9) that waters down the sysctl to make it only
allow read(2) of a dirfd by the system root. Jailed root is not allowed, but
jail policy and superuser policy will abstain from allowing/denying it so
that a MAC module can fully control the policy.
Such a MAC module has been written, and can be found at:
https://people.freebsd.org/~kevans/mac_read_dir-0.1.0.tar.gz
It is expected that the MAC module won't be needed by many, as most only
need to do such diagnostics that require this behavior as system root
anyways. Interested parties are welcome to grab the MAC module above and
create a port or locally integrate it, and with enough support it could see
introduction to base. As noted in mac_read_dir.c, it is released under the
BSD 2 clause license and allows the restrictions to be lifted for only
jailed root or for all unprivileged users.
PR: 246412
Reviewed by: mckusick, kib, emaste, jilles, cy, phk, imp (all previous)
Reviewed by: rgrimes (latest version)
Differential Revision: https://reviews.freebsd.org/D24596
Historically, we've allowed read() of a directory and some filesystems will
accommodate (e.g. ufs/ffs, msdosfs). From the history department staffed by
Warner: <<EOF
pdp-7 unix seemed to allow reading directories, but they were weird, special
things there so I'm unsure (my pdp-7 assembler sucks).
1st Edition's sources are lost, mostly. The kernel allows it. The
reconstructed sources from 2nd or 3rd edition read it though.
V6 to V7 changed the filesystem format, and should have been a warning, but
reading directories weren't materially changed.
4.1b BSD introduced readdir because of UFS. UFS broke all directory reading
programs in 1983. ls, du, find, etc all had to be rewritten. readdir() and
friends were introduced here.
SysVr3 picked up readdir() in 1987 for the AT&T fork of Unix. SysVr4 updated
all the directory reading programs in 1988 because different filesystem
types were introduced.
In the 90s, these interfaces became completely ubiquitous as PDP-11s running
V7 faded from view and all the folks that initially started on V7 upgraded
to SysV. Linux never supported this (though I've not done the software
archeology to check) because it has always had a pathological diversity of
filesystems.
EOF
Disallowing read(2) on a directory has the side-effect of masking
application bugs from relying on other implementation's behavior
(e.g. Linux) of rejecting these with EISDIR across the board, but allowing
it has been a vector for at least one stack disclosure bug in the past[0].
By POSIX, this is implementation-defined whether read() handles directories
or not. Popular implementations have chosen to reject them, and this seems
sensible: the data you're reading from a directory is not structured in some
unified way across filesystem implementations like with readdir(2), so it is
impossible for applications to portably rely on this.
With this patch, we will reject most read(2) of a dirfd with EISDIR. Users
that know what they're doing can conscientiously set
bsd.security.allow_read_dir=1 to allow read(2) of directories, as it has
proven useful for debugging or recovery. A future commit will further limit
the sysctl to allow only the system root to read(2) directories, to make it
at least relatively safe to leave on for longer periods of time.
While we're adding logic pertaining to directory vnodes to vn_io_fault, an
additional assertion has also been added to ensure that we're not reaching
vn_io_fault with any write request on a directory vnode. Such request would
be a logical error in the kernel, and must be debugged rather than allowing
it to potentially silently error out.
Commented out shell aliases have been placed in root's chsrc/shrc to promote
awareness that grep may become noisy after this change, depending on your
usage.
A tentative MFC plan has been put together to try and make it as trivial as
possible to identify issues and collect reports; note that this will be
strongly re-evaluated. Tentatively, I will MFC this knob with the default as
it is in HEAD to improve our odds of actually getting reports. The future
priv(9) to further restrict the sysctl WILL NOT BE MERGED BACK, so the knob
will be a faithful reversion on stable/12. We will go into the merge
acknowledging that the sysctl default may be flipped back to restore
historical behavior at *any* point if it's warranted.
[0] https://www.freebsd.org/security/advisories/FreeBSD-SA-19:10.ufs.asc
PR: 246412
Reviewed by: mckusick, kib, emaste, jilles, cy, phk, imp (all previous)
Reviewed by: rgrimes (latest version)
MFC after: 1 month (note the MFC plan mentioned above)
Relnotes: absolutely, but will amend previous RELNOTES entry
Differential Revision: https://reviews.freebsd.org/D24596
They depend on it to accurately read the offset.
The new code is not used as it would add an interrupt enable/disable
trip on top of the atomic.
This also fixes a bug where 32-bit nolock request would still lock the offset.
No changes for 64-bit.
Reported by: emaste
Contending cases still serialize on sleepq (which would be taken anyway).
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21626
At the time opt-in was introduced adding yourself as a writer was esrializing
across the mount point. Nowadays it is fully per-cpu, the only impact being
a small single-threaded hit on top of what's there right now.
Vast majority of the overhead stems from the call to VOP_GETWRITEMOUNT which
has is done regardless.
Should someone want to microoptimize this single-threaded they can coalesce
looking the mount up with adding a write to it.
During buildkernel there are very frequent calls to priv_check and they
all are for PRIV_VFS_GENERATION (coming from stat/fstat).
This results in branching on several potential privileges checking if
perhaps that's the one which has to be evaluated.
Instead of the kitchen-sink approach provide a way to have commonly used
privs directly evaluated.
This opens the door for other descriptor types to implement
posix_fallocate(2) as needed.
Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D23042
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
vn_open_cred() assumes that it is called from the top-level of a VFS
syscall. Writers must call bwillwrite() before locking any VFS
resource to wait for cleanup of dirty buffers.
ZFS getextattr() and setextattr() VOPs do call vn_open_cred(), which
results in wait for unrelated buffers while owning ZFS vnode lock (and
ZFS does not use buffer cache). VN_OPEN_INVFS allows caller to skip
bwillwrite.
Note that ZFS is still incorrect there, because it starts write on an
mp and locks a vnode while holding another vnode lock.
Reported by: Willem Jan Withagen <wjw@digiware.nl>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The first of these semantic changes was changed to be compatible with
linux5.n by r354564.
For the second semantic change, the old linux man page stated that, if
infd and outfd referred to the same file, EBADF should be returned.
Now, the semantics is to allow infd and outfd to refer to the same file
so long as the byte ranges defined by the input file offset, output file offset
and len does not overlap. If the byte ranges do overlap, EINVAL should be
returned.
This patch modifies copy_file_range(2) to be linux5.n compatible for this
semantic change.
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The old linux man page stated that, if the
offset + len exceeded file_size for the input file, EINVAL should be returned.
Now, the semantics is to copy up to at most file_size bytes and return that
number of bytes copied. If the offset is at or beyond file_size, a return
of 0 bytes is done.
This patch modifies copy_file_range(2) to be linux compatible for this
semantic change.
A separate patch will change copy_file_range(2) for the other semantic
change, which allows the infd and outfd to refer to the same file, so
long as the byte ranges do not overlap.
The timespec struct holds a seconds value in a time_t and a nanoseconds
value in a long. On most architectures these are the same size, however
on 32-bit architectures other than i386 time_t is 8 bytes and long is
4 bytes.
Most ABIs will then pad a struct holding an 8 byte and 4 byte value to
16 bytes with 4 bytes of padding. When copying one of these structs the
compiler is free to copy the padding if it wishes.
In this case the padding may contain kernel data that is then leaked to
userspace. Fix this by copying the timespec elements rather than the
entire struct.
This doesn't affect Tier-1 architectures so no SA is expected.
admbugs: 651
MFC after: 1 week
Sponsored by: DARPA, AFRL
There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.
Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before: 852393 ops/s
after: 76682077 ops/s
Reviewed by: kib, jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21637
Currently writemapping accounting is only done for vnode_pager which does
some accounting on the underlying vnode.
Extend this to allow accounting to be possible for any of the pager types.
New pageops are added to update/release writecount that need to be
implemented for any pager wishing to do said accounting, and we implement
these methods now for both vnode_pager (unchanged) and swap_pager.
The primary motivation for this is to allow other systems with OBJT_SWAP
objects to check if their objects have any write mappings and reject
operations with EBUSY if so. posixshm will be the first to do so in order to
reject adding write seals to the shmfd if any writable mappings exist.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21456
The function' interface assumes that the lower vnode is passed and
returned locked always.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Without this patch, when an application performed lseek(SEEK_DATA/SEEK_HOLE)
on a file in a file system that does not have its own VOP_IOCTL(), the
lseek(2) fails with errno ENOTTY. This didn't seem appropriate, since
ENOTTY is not listed as an error return by either the lseek(2) man page
nor the POSIX draft for lseek(2).
This was discussed on freebsd-current@ here:
http://docs.FreeBSD.org/cgi/mid.cgi?CAOtMX2iiQdv1+15e1N_r7V6aCx_VqAJCTP1AW+qs3Yg7sPg9wA
This trivial patch maps ENOTTY to EINVAL for lseek(SEEK_DATA/SEEK_HOLE).
Reviewed by: markj
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D21300
When the byte range for copy_file_range(2) doesn't go to EOF on the
output file and there is a hole in the input file, a hole must be
"punched" in the output file. This is done by writing a block of bytes
all set to 0.
Without this patch, the write is done unconditionally which means that,
if the output file already has a hole in that byte range, a unneeded data block
of all 0 bytes would be allocated.
This patch adds code to check for a hole in the output file, so that it can
skip doing the write if there is already a hole in that byte range of
the output file. This avoids unnecessary allocation of blocks to the
output file.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D21155
An earlier version of the patch had code that set "error" between
line#s 2797-2799. When that code was moved, the second check for "error != 0"
could never be true and the check became harmless cruft.
This patch removes the cruft, mainly to make Coverity happy.
Reported by: asomers, cem
Since the VOP_IOCTL(FIOSEEKDATA/FIOSEEKHOLE) calls are done with the
vnode unlocked, it is possible for another thread to do:
- truncate(), lseek(), write()
between the two calls and create a hole where FIOSEEKDATA returned the
start of data.
For this case, VOP_IOCTL(FIOSEEKHOLE) will return the same offset for
the hole location. This could result in an infinite loop in the copy
code, since copylen is set to 0 and the copy doesn't advance.
Usually, this race is avoided because of the use of rangelocks, but the
NFS server does not do range locking and could do a sequence like the
above to create the hole.
This patch checks for this case and makes the hole search fail, to avoid
the infinite loop.
At this time, it is an open question as to whether or not the NFS server
should do range locking to avoid this race.
This patch adds support to the kernel for a Linux compatible
copy_file_range(2) syscall and the related VOP_COPY_FILE_RANGE(9).
This syscall/VOP can be used by the NFSv4.2 client to implement the
Copy operation against an NFSv4.2 server to do file copies locally on
the server.
The vn_generic_copy_file_range() function in this patch can be used
by the NFSv4.2 server to implement the Copy operation.
Fuse may also me able to use the VOP_COPY_FILE_RANGE() method.
vn_generic_copy_file_range() attempts to maintain holes in the output
file in the range to be copied, but may fail to do so if the input and
output files are on different file systems with different _PC_MIN_HOLE_SIZE
values.
Separate commits will be done for the generated syscall files and userland
changes. A commit for a compat32 syscall will be done later.
Reviewed by: kib, asomers (plus comments by brooks, jilles)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20584
I accidentally broke the main point of r349248 when making stylistic changes
in r349391. Restore the original behavior, and also fix an additional
overflow that was possible when uio->uio_resid was nearly SSIZE_MAX.
Reported by: cem
Reviewed by: bde
MFC after: 2 weeks
MFC-With: 349248
Sponsored by: The FreeBSD Foundation
This patch factors the code in vn_truncate() that does the actual
VOP_SETATTR() of size into a separate function called vn_truncate_locked().
This will allow the NFS server and the patch that adds a
copy_file_range(2) syscall to call this function instead of duplicating
the code and carrying over changes, such as the recent r347151.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D20808
VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field.
However, fcntl allows you to set the seqcount in bytes to any nonnegative
31-bit value. The result can be a 16-bit overflow, which will be
sign-extended in functions like ffs_read. Fix this by sanitizing the
argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather
than INT16_MAX.
Also, fifos have overloaded the f_seqcount field for a completely different
purpose ever since r238936. Formalize that by using a union type.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20710
This ioctl exposes VOP_BMAP information to userland. It can be used by
programs like fragmentation analyzers and optimized cp implementations. But
I'm using it to test fusefs's VOP_BMAP implementation. The "2" in the name
distinguishes it from the similar but incompatible FIBMAP ioctls in NetBSD
and Linux. FIOBMAP2 differs from FIBMAP in that it uses a 64-bit block
number instead of 32-bit, and it also returns runp and runb.
Reviewed by: mckusick
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20705
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.
Like r348026, these CUs did not show up in tinderbox as missing the include.
Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
Provide a convenience function to avoid the hack with filling fake
struct vop_fsync_args and then calling vop_stdfsync().
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Otherwise we might miss the last iteration where EOF appears below
unaligned noff.
Reported and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19811
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory. Supposedly the interface is similar
to the same proposed Linux flags.
Reviewed by: jilles (code, previous version of manpages), 0mp (manpages)
Discussed with: allanjude, emaste, jonathan
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17547
Most important for the future use, do not call
vm_fault_quick_hold_pages() with disabled pagefaults.
Reported and tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.