On dotdot lookup and fhtovp operations, it is possible for the file
represented by tmpfs node to be removed after the thread calculated
the pointer. In this case, tmpfs_alloc_vp() accesses freed memory.
Introduce the reference count on the nodes. The allnodes list from
tmpfs mount owns 1 reference, and threads performing unlocked
operations on the node, add one transient reference. Similarly, since
struct tmpfs_mount maintains the list where nodes are enlisted,
refcount it by one reference from struct mount and one reference from
each node on the list. Both nodes and tmpfs_mounts are removed when
refcount goes to zero.
Note that this means that nodes and tmpfs_mounts might survive some
time after the node is deleted or tmpfs_unmount() finished. The
tmpfs_alloc_vp() in these cases returns error either due to node
removal (tn_nlinks == 0) or because of insmntque1(9) error.
Tested by: pho (as part of larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Remove TMPFS_ASSERT_ELOCKED(). Its claims are already stated by other
asserts nearby and by VFS guarantees.
Change TMPFS_ASSERT_LOCKED() and one inlined place to use
ASSERT_VOP_(E)LOCKED() instead of hand-rolled imprecise asserts.
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Edit comments which explain no longer relevant details, and add
locking annotations to the struct tmpfs_node members.
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
If some process' nodes were accessed using procfs and the process
cannot exit properly at the time modunload event is reported to the
pseudofs-backed filesystem, the assertion in pfs_vncache_unload() is
triggered. Assertion is correct, the cache should be cleaned.
Approved by: des (pseudofs maintainer)
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Inums in cd9660 refer to byte offsets on the media. DVD and BD media
can have entries above 4GB, especially with multi-session images.
PR: 190655
Reported by: Thomas Schmitt <scdbackup at gmx.net>
If tmpfs vnode is only shared locked, tn_status field still needs
updates to note the access time modification. Use the same locking
scheme as for UFS, protect tn_status with the node interlock + shared
vnode lock.
Fix nearby style.
Noted and reviewed by: mjg
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Right now size of the structure is 472 bytes on amd64, which is
already large and stack allocations are indesirable. With the ino64
work, MNAMELEN is increased to 1024, which will make it impossible to have
struct statfs on the stack.
Extracted from: ino64 work by gleb
Discussed with: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
For most NFSv4.1 servers, a NFS4ERR_BAD_SESSION error is a rare failure
that indicates that the server has lost session/open/lock state.
However, recent testing by cperciva@ against the AmazonEFS server found
several problems with client recovery from this due to it generating this
failure frequently.
Briefly, the problems fixed are:
- If all session slots were in use at the time of the failure, some processes
would continue to loop waiting for a slot on the old session forever.
- If an RPC that doesn't use open/lock state failed with NFS4ERR_BAD_SESSION,
it would fail the RPC/syscall instead of initiating recovery and then
looping to retry the RPC.
- If a successful reply to an RPC for an old session wasn't processed
until after a new session was created for a NFS4ERR_BAD_SESSION error,
it would erroneously update the new session and corrupt it.
- The use of the first element of the session list in the nfs mount
structure (which is always the current metadata session) was slightly
racey. With changes for the above problems it became more racey, so all
uses of this head pointer was wrapped with a NFSLOCKMNT()/NFSUNLOCKMNT().
- Although the kernel malloc() usually allocates more bytes than requested
and, as such, this wouldn't have caused problems, the allocation of a
session structure was 1 byte smaller than it should have been.
(Null termination byte for the string not included in byte count.)
There are probably still problems with a pNFS data server that fails
with NFS4ERR_BAD_SESSION, but I have no server that does this to test
against (the AmazonEFS server doesn't do pNFS), so I can't fix these yet.
Although this patch is fairly large, it should only affect the handling
of NFS4ERR_BAD_SESSION error replies from an NFSv4.1 server.
Thanks go to cperciva@ for the extension testing he did to help isolate/fix
these problems.
Reported by: cperciva
Tested by: cperciva
MFC after: 3 months
Differential Revision: https://reviews.freebsd.org/D8745
truncation, immediately queue the page for asynchronous laundering rather
than making the page pass through inactive queue first.
Reviewed by: kib, markj
The NFSv4.1 server failed to update the nfs-stablerestart file for
a client when the client was issued its first Open. As such, recovery
of Opens after a server reboot failed with NFSERR_NOGRACE.
This patch fixes this.
It also changes the code so that it malloc()'s the 1024 byte array
instead of allocating it on the kernel stack for both NFSv4.0 and NFSv4.1.
Note that this bug only affected NFSv4.1 and only when clients attempted
to reclaim Opens after a server reboot.
MFC after: 2 weeks
the vnode is inactivated. This contradicts with the nullfs caching
which keeps upper vnode around, as consequence keeping the use
reference to lower vnode.
Add a filesystem flag to request nullfs to not cache when mounted over
that filesystem, and set the flag for nfs v4 mounts.
Reported by: asomers
Reviewed by: rmacklem
Tested by: asomers, rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The "-z" option on nfsstats was erroneously zeroing out the counts
of NFSv4 state structures. These counts will normally go back down
to zero as state is released. When zeroed out by "-z", these counts
can go negative. This patch fixes this problem.
MFC after: 2 weeks
The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D8589
longer used. More precisely, they are always zero because the code that
decremented and incremented them no longer exists.
Bump __FreeBSD_version to mark this change.
Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8583
See r294954 for the bread(9) change and r297401 for similar cd9660 fix.
Reported and tested by: Joshua Kinard <kumba@gentoo.org>
PR: 214705
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The pager, due to its construction, implements clustering for the
page-ins. In particular, buildworld load demonstrates reduction of
the READ RPCs from 39k down to 24k. No change in real or CPU time was
observed.
Discussed with, and measured by: bde
No objections from: rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Rather than printing a warning for every time we receive a fileid > 2^32
from the NFS server, count warnings and print at most one of each warning
type per minute, e.g.,
Nov 15 05:17:34 ip-172-30-1-221 kernel: NFSv4 fileid > 32bits (24730 occurrences)
Nov 15 05:17:56 ip-172-30-1-221 kernel: NFSv4 mounted on fileid > 32bits (178 occurrences)
Nov 15 05:18:53 ip-172-30-1-221 kernel: NFSv4 fileid > 32bits (7582 occurrences)
Nov 15 05:18:58 ip-172-30-1-221 kernel: NFSv4 mounted on fileid > 32bits (23 occurrences)
A buildworld with an NFS mounted /usr/obj can otherwise result in
hundreds of thousands of lines being printed, which seems unnecessarily
verbose.
When ino_t becomes a 64-bit type, these printfs will no longer be needed
(and the problems associated with truncating 64-bit fileids to generate
32-bit inode numbers will also go away).
Reviewed by: rmacklem
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8523
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497
If dotdot lookup does not escape from the file descriptor passed as
the lookup root, we can allow the component traversal. Track the
directories traversed, and check the result of dotdot lookup against
the recorded list of the directory vnodes.
Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently
disabled by default until more verification of the approach is done.
Disallow non-local filesystems for dotdot, since remote server might
conspire with the local process to allow it to escape the namespace.
This might be too cautious, provide the knob
vfs.lookup_cap_dotdot_nonlocal to override as well.
Idea by: rwatson
Discussed with: emaste, jonathan, rwatson
Reviewed by: mjg (previous version)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
Differential revision: https://reviews.freebsd.org/D8110
volume limits. In particular:
- Assert that usemap_alloc() and usemap_free() cluster number argument
is valid.
- In chainlength(), return 0 if cluster start is after the max cluster.
- In chainlength(), cut the calculated cluster chain length at the max
cluster.
- For true paranoia, after the pm_inusemap is calculated in
fillinusemap(), reset all bits in the array for clusters after the
max cluster, as in-use.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
delegations enabled and the Linux NFSv4.1 client was reported in
reviews.freebsd.org/D7891.
I believe that the FreeBSD server behaviour conforms to the RFC and that
the Linux client has a bug. Therefore, I do not think the proposed patch
is appropriate. When nfsrv_writedelegifpos is non-zero, the FreeBSD
server will issue a write delegation for a read open if possible.
The Linux client then erroneously assumes that the credentials used for
the read open can write the file.
This patch reverses the default value for nfsrv_writedelegifpos to 0 so
that the default behaviour is Linux compatible and adds a sysctl that can
be used to set nfsrv_writedelegifpos.
This change should only affect users that are mounting a FreeBSD server
with delegations enabled (they are not enabled by default) with a Linux
NFSv4.1 client mount.
Reported by: fatih.acar@gandi.net
Tested by: fatih.acar@gandi.net
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D7891
The old behavior depended on the FAT version and on what files were in the
root directory. "mount_msdosfs -o shortnames" is still supported.
Reviewed by: wblock, cem
Discussed with: trasz, adrian, imp
MFC after: 4 weeks
X-MFC-Notes: Don't MFC the removal of findwin95
Differential Revision: https://reviews.freebsd.org/D8018
The lower vnode is already referenced and nodeget is supposed to consume
the reference. Thus the extra vref call was causing a leak.
Reported by: pho
Reviewed by: kib
MFC after: 1 week
The previous code was forcing an expensive walk in vop_stdvptocnp,
which was causing performance issues on highly contended zfs.
No objections: kib
MFC after: 2 weeks
Standard VOP_FSYNC() implementation just syncs data buffers, and due
to this, is the correct and efficient implementation for msdosfs or
any other filesystem which uses bufer cache trivially. Provide
globally visible wrapper vop_stdfdatasync_buf() for future consumption
by other filesystems.
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7471
the patch in D1626 plus changes so that it includes counts for
NFSv4.1 (and the draft of NFSv4.2).
Also, make all the counts uint64_t and add a vers field at the
beginning, so that future revisions can easily be implemented.
There is code in place to handle the old vesion of the nfsstats
structure for backwards binary compatibility.
Subsequent commits will update nfsstat(8) to use the new fields.
Submitted by: will (earlier version)
Reviewed by: ken
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D1626
The offset of the directory file, passed to getdirentries(2) syscall,
is user-controllable. The value of the offset must not be asserted,
instead the invalid value should be checked and rejected if invalid.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
These are currently unused in our implementation and some even appear to
have not been implemented yet on linux but it is good to keep them for
reference.
Obtained from: NetBSD (CVS Rev. 1.41)
MFC after: 1 month
Owning Giant in the init/uninit is accidental due to the moment where
VFS modules initialization is performed, and is not enforced by the
VFS interface. The Giant lock does not prevent a parallel execution
of the code, it is VFS which implements the proper protocol.
Approved by: des (pseudofs maintainer)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
and getboottimebin(9) KPI. Change consumers of boottime to use the
KPI. The variables were renamed to avoid shadowing issues with local
variables of the same name.
Issue is that boottime* should be adjusted from tc_windup(), which
requires them to be members of the timehands structure. As a
preparation, this commit only introduces the interface.
Some uses of boottime were found doubtful, e.g. NLM uses boottime to
identify the system boot instance. Arguably the identity should not
change on the leap second adjustment, but the commit is about the
timekeeping code and the consumers were kept bug-to-bug compatible.
Tested by: pho (as part of the bigger patch)
Reviewed by: jhb (same)
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
X-Differential revision: https://reviews.freebsd.org/D7302
Devfs' file layer ioctl is now just a thin shim around the vnode layer.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7286
framework allowing to set the suspension policy for the dynamic block.
Extend the currently possible policies of stopping on interruptible
sleeps and ignoring such sleeps by two more: do not suspend at
interruptible sleeps, but interrupt them with either EINTR or ERESTART.
Reviewed by: jilles
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
executed by inactive methods, must be repeated on reclaim. In
particular, unlink and free sillyrenamed vnode both on inactivation
and reclaim.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
If strlen(hostp) was zero, the stack array 'nam' would never be initialized
before being strdup()ed. Fix this by initializing it to the empty string.
It's possible some external condition makes this case impossible, in which
case, an assertion instead of this workaround is appropriate.
Introduced in r299848.
Reported by: Coverity
CID: 1355336
Sponsored by: EMC / Isilon Storage Division
Ext2/3/4 manages generation numbers differently than UFS so adopt
some rules that should work well. When allocating a new inode,
make sure we generate a "good" random value specifically avoiding
zero.
Don't interfere with the numbers that are already generated in
the filesystem: ext2fs doesn't have the backwards compatibility
issues where there were no generation numbers.
Reviewed by: kevlo
MFC after: 1 week
on a fuse mounted file system, it will crash. Although it may be
possible to make this work correctly, this patch avoids the crash
in the meantime.
I removed the MPASS(), since panicing for the FIFO case didn't make
a lot of sense when it returns an error for the others.
PR: 195000
Submitted by: henry.hu.sh@gmail.com (earlier version)
MFC after: 2 weeks
When "cp" of a file with read-only (mode 0444) to a fuse mounted
file system was attempted it would fail with EACCES. This was because
fuse would attempt to open the file WRONLY and the open would fail.
This patch changes the fuse_vnop_open() to test for an extant read-write
open and use that, if it is available.
This makes the "cp" of a read-only file to the fuse mounted file system
work ok.
There are simpler ways to fix this than adding the fuse_filehandle_validrw()
function, but this function is useful for future patches related to
exporting a fuse filesystem via NFS.
MFC after: 2 weeks
eg an NFSv4 root over WiFi: boot from md_root (small rootfs image
preloaded by loader(8)), setup WiFi, and then reroot into the actual
root, over NFS.
Note that it's currently limited to NFSv4, and due to problems with
nfsuserd(8) it requres a workaround on the server side: one needs
to set the vfs.nfsd.enable_stringtouid=1 sysctl and not run nfsuserd(8)
on either the server or the client side.
Reviewed by: rmacklem@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6347
When I/O on a file under fuse is switched from buffered to DIRECT_IO,
it was possible to read stale (before a recent modification) data from
the buffer cache. This patch invalidates the buffer cache for the
file to fix this.
PR: 194293
MFC after: 2 weeks
When a file is opened write-only and a partial block was written,
buffered I/O would try and read the whole block in. This would
result in a hung thread, since there was no open (fuse filehandle)
that allowed reading. This patch avoids the problem by forcing
DIRECT_IO for this case.
It also sets DIRECT_IO when the file system specifies the FN_DIRECTIO
flag in its reply to the open.
Tested by: nishida@asusa.net, freebsd@moosefs.com
PR: 194293, 206238
MFC after: 2 weeks
Trivial use-after-free where stp was freed too soon in the non-error path.
To fix, simply move its release to the end of the routine.
Reported by: Coverity
CID: 1006105
Sponsored by: EMC / Isilon Storage Division
consequence, the nfs client override of VOP_LOCK1() is no longer
needed.
Reviewed and tested by: rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
When support for NFSv4.1 was added to the NFS server, it broke
the server rpc count stats, since newnfsstats.srvrpccnt[] doesn't
have entries for the new NFSv4.1 operations.
Without this patch, the code was incrementing bogus entries in
newnfsstats for the new NFSv4.1 operations.
This patch is an interim fix. The nfsstats structure needs to be
updated and that will come in a future commit.
Reported by: cem
MFC after: 2 weeks
It was reported via email that under certain heavy RPC loads
long delays before the exports would be updated was observed
when using "mountd -S". This patch reverses the priority between
the exclusive lock request to suspend the nfsd threads and the
shared lock request for performing RPCs.
As such, when mountd attempts to suspend the nfsd threads, it
gets priority over outstanding RPC requests to do this.
I suspect that the case reported was an artificial test load,
but this patch did fix the problem for the reporter.
Reported and Tested by: josephlai@qnap.com
MFC after: 2 weeks
This is only allowed by root and only used by the nfs daemon, which
should not provide an incorrect value. However, it's still good
practice to validate data provided by userland.
PR: 206626
Reported by: CTurt <cturt@hardenedbsd.org>
Reviewed by: rmacklem
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D6201
In win2unixfn() we expand Windows 95 style long names. In some cases that
requires moving the data in the nbp->nb_buf buffer backwards to make room. That
code failed to check for overflows, leading to a stack overflow in win2unixfn().
We now check for this event, and mark the entire conversion as failed in that
case. This means we present the 8 character, dos style, name instead.
PR: 204643
Differential Revision: https://reviews.freebsd.org/D6015
It was reported via email that a Linux client couldn't do a Kerberized
NFS mount when only "sec=krb5" was specified for the exports. The Linux
client attempted a mount via krb5i and the server replied NFSERR_SERVERFAULT.
Although NFSERR_WRONGSEC isn't listed as an error for SetClientID, I
think it is the correct reply, so this patch enables that.
I do not know if this fixes the mount attempt, but adding "krb5i" to the
list of allowed security flavours does allow the mount to work.
Reported by: joef@spectralogic.com
MFC after: 2 weeks
h_levels_num, as most data structs in ext2fs, is unsigned so
the index that addresses it has to be unsigned as well.
To get to overflow here we would probably be considering a
degenerate case though.
MFC after: 5 days
The ordering of acquisition of the state and session mutexes was
reversed in two cases executed when an NFSv4.1 client created/freed
a session. Since clients will typically do this only when mounting
and dismounting, the likelyhood of causing a deadlock was low but possible.
This can only occur for NFSv4.1 mounts, since the others do not
use sessions.
This was detected while testing the pNFS server/client where the
client crashed during dismounting.
The patch also reorders the unlocks, although that isn't necessary
for correct operation.
MFC after: 2 weeks
rounddown2 tends to produce longer lines than the original code
and when the code has a high indentation level it was not really
advantageous to do the replacement.
This tries to strike a balance between readability using the macros
and flexibility of having the expressions, so not everything is
converted.
the NFS server would leave the newly created vnode locked. This could
result in a file system that would not unmount and processes wedged,
waiting for the file to be unlocked.
Since this VOP_SETATTR() never fails for most file systems, this bug
doesn't normally manifest itself. I found it during testing of an
exported GlusterFS file system, which can fail.
This patch adds the vput() and changes the error to the correct NFS one.
MFC after: 2 weeks
the old and new NFS clients. He did a good job of isolating the problem
which was caused by the new NFS client not setting the post write mtime
correctly. The new NFS client code was cloned from the old client, but
was incorrect, because the mtime in the nfs vnode's cache wasn't yet
updated. This patch fixes this problem. The patch also adds missing mutex
locking.
Reported and tested by: bde
MFC after: 2 weeks
for limiting disk (actually filesystem) IO.
Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.
Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5080
the functions free the buffer and set the pointer to NULL. Also
remove useless call to brelse(9) on the error path.
PR: 208275
Submitted by: Fabian Keil <fk@fabiankeil.de>
MFC after: 2 weeks
strlens is somewhat suboptimal, but it's a temporary measure that will
be replaced with red-black trees later on.
PR: 204417
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5266
192.168.1.1, with share "share". This commit fixes a problem
where "mkdir /net/192.168.1.1/share/meh" would return spurious
error instead of creating the directory if the target filesystem
wasn't mounted yet; subsequent attempts would work correctly.
The failure scenario is kind of complicated to explain, but it all
boils down to calling VOP_MKDIR() for the target filesystem (NFS)
with wrong dvp - the autofs vnode instead of the filesystem root
mounted over it.
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5442
pfs_visible(). The recursion does not cause deadlock because the sx
implementation does not prefer exclusive waiters over the shared, but
this is an implementation detail.
Reported by: pho, Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by: jhb
Tested by: pho
Approved by: des (pseudofs maintainer)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
filesystem to the nullfs mount.
MNTK_NO_IOPF must be present on the nullfs struct mount so that struct
file fo_read and fo_write fops operate in the mode requested by the
lower mount.
MNTK_UNMAPPED_BUFS allows VOP_GETPAGES() to use unmapped buffers. It
does not matter for VOP_GETPAGES() calls from vm_fault() since handle
of the vm_object always points to the lower vnode. But it may be
useful for other situations where VOP_GETPAGES() is used.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
This adopts the same change as r291936 for UFS.
Directly clear IN_ACCESS or IN_UPDATE when user supplied the time, and
copy the value into the inode.
This keeps the behaviour cleaner and is consistent with UFS.
Reviewed by: bde
MFC after: 1 month (only 10)
unlinked. Otherwise the vnode stays cached, causing leak. This is
similar to r292961 for regular files.
Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Sync with r84642 from UFS:
The panics are inappropriate because the IN_RENAME flag only fixes a
few of the huge number of race conditions that can result in the
source path becoming invalid even prior to the VOP_RENAME() call.
Found accidentally while checking an issue from PVS Static Analysis.
MFC after: 3 days
Cleanup some checks for NULL. Most of these were always unnecessary and
starting with r294954 brelse() doesn't need any NULL checks at all.
For now keep the checks somewhat consistent with NetBSD in case we want to
merge the cleanups to older versions.
It is otherwise left dangling, and callers that request cookies always free
the cookie buffer, even when VOP_READDIR(9) returns an error. This results
in a double free if tmpfs_readdir() returns an error to the NFS server or
the Linux getdents(2) emulation code.
Reported by: pho
MFC after: 1 week
Security: double free of malloc(9)-backed memory
Sponsored by: EMC / Isilon Storage Division
This is ongoing work from Damjan Jovanovic to improve ext4 read support
with sparse files:
Keep track of the first and last block in each extent as it descends down
the extent tree, thus being able to work out that some blocks are sparse
earlier. This solves an issue on r293680.
In ext4_bmapext() start supporting the runb parameter, which appears to be
the number of adjacent blocks prior to the block being converted in the
same way that runp is the number of blocks after, speding up random access
to mmaped files.
PR: 206652
CID 1018688 is a false positive.
The initialization is done by calling vn_start_write(... &mp, flags).
mp is only an output parameter unless (flags & V_MNTREF), and fdesc
doesn't put V_MNTREF in flags.
Pointed out by: bde
ext2fs: passthrough any extra timestamps to the dinode struct.
While it passed the classic testing, the change appears to have
caused some regression and still requires some more precautions.
PR: 206820
MFC after: 3 days
In general we don't trust any of the extended timestamps unless the
EXT2F_ROCOMPAT_EXTRA_ISIZE feature is set. However, in the case where
we freshly allocated a new inode the information is valid and it is
better to pass it along instead of leaving the value undefined.
This should have no practical effect but should reduce the amount of
garbage if EXT2F_ROCOMPAT_EXTRA_ISIZE is set, like in cases where the
filesystem is converted from ext3 to ext4.
MFC after: 4 days
Directory index was introduced in ext3. We don't always use the
prefix to denote the ext2 variant they belong to but when we
do we should try to be accurate.
We use i_flag to carry some flags like IN_E4INDEX which newer
ext2fs variants uses internally.
fsck.ext3 rightfully complains after our implementation tags
non-directory inodes with INDEX_FL.
Initializing i_flag during allocation removes the noise factor
and quiets down fsck.
Patch from: Damjan Jovanovic
PR: 206530
The htree dir_index is perhaps one of the most characteristic
features of the linux ext3 implementation. It was removed
in r281670, due to repeated bug reports.
Damjan Jovanic detected and fixed three bugs and did some
stress testing by building Apache OpenOffice on top of it
so it is now in good shape to bring back.
Differential Revision: https://reviews.freebsd.org/D5007
Submitted by: Damjan Jovanovic
Reviewed by: pfg
Tested by: pho
Relnotes: Yes
MFC after: 2 months (only 10.x)
Add locking around access to bv_cnt which is currently being done unlocked
PR: 206224
Reviewed by: imp
Approved by: jhb
MFC after: 1 week
Sponsored by: Panasas, Inc.
Differential Revision: https://reviews.freebsd.org/D4931
* Use standard IPv6 SAS instead of rt->rt_ifa address.
* Make address lookup work for IPv6 LLA.
* Save address into buffer provided by caller instead of using static vars.
Discussed with: rmacklem
Add support for sparse files in ext4. Also implement read-ahead, which
greatly increases the performance when transferring files from ext4.
Both features implemented by Damjan Jovanovic.
PR: 205816
MFC after: 1 week
from int to int64.
MSDN says that SMB_SET_FILE_END_OF_FILE_INFO uses signed 64-bit integer
to specify offset, but since smbfs_smb_setfsize() has used plain int,
a value was truncated in case when offset was larger than 2G.
https://msdn.microsoft.com/en-us/library/ff469975.aspx
In particular, now `truncate -s 10G` will work correctly on the mounted
SMB share.
Reported and tested by: Eugene Grosbein <eugen at grosbein dot net>
MFC after: 1 week
unmount of devfs mounts, by restarting the failed syscall.
When restarted, failing syscalls eventually either stop finding the
node and returning ENOENT, or the vnode op vectors finally transition
to the deadfs vop. The later return EIO or other error, more
appropriate for the operation.
Submitted by: bde
Tested by: pho
MFC after: 3 weeks
lower vnode. Otherwise, reference to the lower vnode from the upper
one prevents final unlink.
PR: 178238
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
close and close due to revoke(2)-like operation.
A new FLASTCLOSE flag indicates that this is last close. FREVOKE is
set for revokes, and FNONBLOCK is also set, same as is already done
for VOP_CLOSE() call from vgonel().
The flags reuse user open(2) flags which are never stored in f_flag,
to not consume bit space in the ABI visible way. Assert this with the
static check.
Requested and reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
With the new VOP_GETPAGES() KPI the "count" argument counts pages already,
and doesn't need to be translated from bytes to pages.
While here make it consistent that *rbehind and *rahead are updated only
if we doesn't return error.
Pointy hat to: glebius
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
like the various d_*_t typedefs since it declared a function pointer rather
than a function. Add a new d_priv_dtor_t typedef that declares the function
and can be used as a function prototype. The previous typedef wasn't
useful outside of the cdevpriv implementation, so retire it.
The name d_priv_dtor_t was chosen to be more consistent with cdev methods
since it is commonly used in place of d_close_t even though it is not a
direct pointer in struct cdevsw.
Reviewed by: kib, imp
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D4340
This leak was introduced by r291527.
Since the nfscommon.ko module is rarely unloaded, this leak would not
have been much of an issue.
MFC after: 2 weeks
option that will be added to the nfsuserd daemon in a future
commit. It modifies the cache used by NFSv4 for name<-->id
translation (both username/uid and group/gid) to support this.
When "-manage-gids" is set, the server looks up each uid
for the RPC and uses the list of groups cached in the server
instead of the list of groups provided in the RPC request.
The cached group list is acquired for the cache by the nfsuserd
daemon via getgrouplist(3).
This avoids the 16 groups limit for the list in the RPC request.
Since the cache is now used for every RPC when "-manage-gids"
is enabled, the code also modifies the cache to use a separate
mutex for each hash list instead of a single global mutex.
Suggested by: jpaetzel
Tested by: jpaetzel
MFC after: 2 weeks
the name of a filesystem when setting it as the first parameter to the
getnewvnode() function. Most filesystems call getnewvnode from just one
place so can use a literal string as the first parameter. However, NFS
calls getnewvnode from two places, so we create a global constant string
that can be used by the two instances. This change also collapses two
instances of getnewvnode() in the UFS filesystem to a single call.
Reviewed by: kib
Tested by: Peter Holm
(opens, locks, etc) is retained, which I believe is correct behaviour.
However, for NFSv4.1, the server also retained a reference to the xprt
(RPC transport socket structure) for the backchannel. This caused
svcpool_destroy() to not call SVC_DESTROY() for the xprt and allowed
a socket upcall to occur after the mutexes in the svcpool were destroyed,
causing a crash.
This patch fixes the code so that the backchannel xprt structure is
dereferenced just before svcpool_destroy() is called, so the code
does do an SVC_DESTROY() on the xprt, which shuts down the socket upcall.
Tested by: g_amanakis@yahoo.com
PR: 204340
MFC after: 2 weeks
At this time I cannot see a way to fix directory caching when it
has partial blocks in the buffer cache, due to the fact that the
syscall's uio_offset won't stay the same as the lblkno * NFS_DIRBLKSIZ
offset.
Reported by: bde
MFC after: 2 weeks
the largest size of buffer cache block or the mapping of the buffer
is bogus. When a mount with rsize=4096,wsize=4096 was done, f_iosize
would be set to 4096. This resulted in corrupted directory data, since
the buffer cache block size for directories is NFS_DIRBLKSIZ (8192).
This patch fixes the code so that it always sets f_iosize to at least
NFS_DIRBLKSIZ.
Tested by: krichy@cflinux.hu
PR: 177971
MFC after: 2 weeks
is non-zero.
- Include the process address in the PROC_ASSERT_HELD() and
PROC_ASSERT_NOT_HELD() assertion messages so that the corresponding
process can be found easily when debugging.
MFC after: 1 week
file descriptor opened for complimentary access exists as well.
The implementation of the guarantee is done by counting the
generations of readers and writers opens. We return success and not
EINTR or ERESTART error, when the sleep for complimentary opening is
interrupted, but the generation was changed during the sleep.
Longer explanation: assume there are two threads, A doing open("fifo",
O_RDONLY) and B doing open("fifo", O_WRONLY), and no other threads
either trying to open the fifo, nor there are any file descriptors
referencing the fifo. Before the change, it was possible e.g. for for
thread A to return a valid file descriptor, while thread B returned
EINTR if a signal to B was delivered simultaneously with the wakeup
from A. After the change, in this situation both A::open() and
B::open() succeed and the signal is made "as if" it was noticed
slightly later. Note that the signal actual delivery is not changed,
it is done by ast on syscall return path, so signal handler is still
executed before first instruction after syscall.
See PR for the code demonstrating the issue.
PR: 203162
Reported by: Victor Stinner victor.stinner@gmail.com
Reviewed by: jilles
Tested by: bapt, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
sign on every directory exported via NFSv4 with NFSv4 ACLs enabled.
Reviewed by: rmacklem@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3502
that already has a confirmed ClientID, the nfsrv_setclient() function would
not fill in the clientidp being returned. As such, the value of ClientID
returned would be whatever garbage was on the stack.
An NFSv4.1 client would not normally do this, but it appears that it can
happen for certain Linux clients. When it happens, the client persistently
retries the ExchangeID and Create_session after Create_session fails when
it uses the bogus clientid. With this patch, the correct clientid is replied.
This problem was identified in a packet trace supplied by
Ahmed Kamal via email.
Reported by: email.ahmedkamal@googlemail.com
MFC after: 2 weeks
mappings as if MAP_SHARED was always present since in general MAP_PRIVATE
is not permitted for character devices. However, there is one exception
in that MAP_PRIVATE mappings are permitted for /dev/zero.
Only require a writable file descriptor (FWRITE) for shared, writable
mappings of character devices. vm_mmap_cdev() will reject any private
mappings for other devices.
Reviewed by: kib
Reported by: sbruno (broke qemu cross-builds), peter
Differential Revision: https://reviews.freebsd.org/D3316
BROKEN NFS SERVER OR MIDDLEWARE: Certain WAN "accelerators" attempt to cache
NFS GETATTR traffic, but actually corrupt it (e.g., responding to requests
with attributes for totally different files).
Warn very verbosely when this is detected. Linux' NFS client has a similar
warning.
Adds a sysctl/tunable (vfs.nfs.fileid_maxwarnings) to configure the quantity
of warnings; default to 10. (Zero disables; -1 is unlimited.)
Adds a failpoint to aid in validating the warning / behavior with a
non-broken server. Use something like:
sysctl 'debug.fail_point.nfscl_force_fileid_warning=10%return(1)'
Reviewed by: rmacklem
Approved by: markj (mentor)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3304
unconfirmed clientid structure for the same client on the last hash list,
this old entry would not be removed/deleted. I do not think this bug would have
caused serious problems, since the new entry would have been before the old one
on the list. This old entry would have eventually been scavenged/removed.
Detected while reading the code looking for another bug.
MFC after: 3 days
determining whether a node changed.
Other filesystems, e.g., UFS, only check on seconds, when determining
whether something changed.
This also corrects the birthtime case, where we checked tv_nsec
twice, instead of tv_sec and tv_nsec (PR).
PR: 201284
Submitted by: David Binderman
Patch suggested by: kib
Reviewed by: kib
MFC after: 2 weeks
Committed from: Essen FreeBSD Hackathon
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).
Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig. Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig. p_xexit contains complete status
and copied out into si_status.
Requested by: Joerg Schilling
Reviewed by: jilles (previous version), pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
This obviates the need for a MNTK_SUSPENDABLE flag, since passthrough
filesystems like nullfs and unionfs no longer need to inherit this
information from their lower layer(s). This change also restores the
pre-r273336 behaviour of using the presence of a susp_clean VFS method to
request suspension support.
Reviewed by: kib, mjg
Differential Revision: https://reviews.freebsd.org/D2937