memory mapped pages being written back on an NFS mount.
Since any thread can call VOP_PUTPAGES() to write back a
dirty page, the credentials of that thread may not have
write access to the file on an NFS server. (Often the uid
is 0, which may be mapped to "nobody" in the NFS server.)
Although there is no completely correct fix for this
(NFS servers check access on every write RPC instead of at
open/mmap time), this patch avoids the common cases by
holding onto a credential that recently opened the file
for writing and uses that credential for the write RPCs
being done by VOP_PUTPAGES() for both NFS clients.
Tested by: Joel Ray Holveck (joelh at juniper.net)
PR: kern/165923
Reviewed by: kib
MFC after: 2 weeks
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.
Reported by: bde
Discussed with: kib
Reviewed by: jhb
MFC after: 2 weeks
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.
Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.
Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.
Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks
appropriate timestamps. Restore the assertions which verify that
NCF_TS is set when timestamp is asked for.
Reviewed by: jhb (previous version)
MFC after: 2 weeks
we will only trust a positive name cache entry for a specified amount of
time before falling back to a LOOKUP RPC, even if the ctime for the file
handle matches the cached copy in the name cache entry. The timeout is
configured via a new 'nametimeo' mount option and defaults to 60 seconds.
It may be set to zero to disable positive name caching entirely.
Reviewed by: rmacklem
MFC after: 1 week
entries on one client when a directory was renamed on another client. The
root cause for the stale entry being trusted is that each per-vnode nfsnode
structure has a single 'n_ctime' timestamp used to validate positive name
cache entries. However, if there are multiple entries for a single vnode,
they all share a single timestamp. To fix this, extend the name cache
to allow filesystems to optionally store a timestamp value in each name
cache entry. The NFS clients now fetch the timestamp associated with
each name cache entry and use that to validate cache hits instead of the
timestamps previously stored in the nfsnode. Another part of the fix is
that the NFS clients now use timestamps from the post-op attributes of
RPCs when adding name cache entries rather than pulling the timestamps out
of the file's attribute cache. The latter is subject to races with other
lookups updating the attribute cache concurrently. Some more details:
- Add a variant of nfsm_postop_attr() to the old NFS client that can return
a vattr structure with a copy of the post-op attributes.
- Handle lookups of "." as a special case in the NFS clients since the name
cache does not store name cache entries for ".", so we cannot get a
useful timestamp. It didn't really make much sense to recheck the
attributes on the the directory to validate the namecache hit for "."
anyway.
- ABI compat shims for the name cache routines are present in this commit
so that it is safe to MFC.
MFC after: 2 weeks
Allow the NFS client to use a max file size larger than 1TB for v3 mounts.
It now allows files up to OFF_MAX subject to whatever limit the server
advertises.
Reviewed by: rmacklem
Approved by: re (kib)
MFC after: 1 week
checking at open time. It may improve performance for read-only
NFS mounts. Use deliberately.
MFC after: 1 week
Reviewed by: rmacklem, jhb (earlier version)
by both clients. Since the NLM uses various fields of the
nfsmount structure, those fields were extracted and put in a
separate nfs_mountcommon structure stored in sys/nfs/nfs_mountcommon.h.
This structure also has a function pointer for a function that
extracts the required information from the mount point and nfs vnode
for that particular client, for information stored differently by the
clients.
Reviewed by: jhb
MFC after: 2 weeks
directories for purposes of validating name cache entries. This
closes races where two updates to a file or directory within the same
second could result in stale entries in the name cache. While here,
remove the 'n_expiry' field as it is no longer used.
Reviewed by: rmacklem
MFC after: 1 week
module that can be used by both the regular and experimental nfs
clients. This fixes the problem reported by jh@ where /dev/nfslock
would be registered twice when both nfs clients were used.
I also defined the size of the lm_fh field to be the correct value,
as it should be the maximum size of an NFSv3 file handle.
Reviewed by: jh
MFC after: 2 weeks
name cache up into nfs_lookup() instead of nfs_open(). Continue this
trend by flushing the attribute cache for leaf nodes in nfs_lookup() during
an open() if we do a LOOKUP RPC. For NFSv3 this should generally be a NOP
as the attributes are flushed before fetching the post-op attributes from
the LOOKUP RPC which most (all?) NFSv3 servers provide, so the post-op
attributes should populate the cache.
Now all NFS open() calls will always clear the cached attributes during the
nfs_lookup() prior to nfs_open() in the !NMODIFIED case to provide CTOC.
As a result, we can remove the conditional flushing of the attribute
cache from nfs_open().
Reviewed by: rmacklem, bde
MFC after: 2 weeks
DIAGNOSTIC and #ifndef DIAGNOSTIC for debug assertions, prefer
KASSERT(). Also change one #ifdef DIAGNOSTIC in the new nfs server.
Submitted by: Mikolaj Golub <to.my.trociny gmail com>
MFC after: 2 weeks
file via NFS. Specifically, to satisfy close-to-open-consistency, the NFS
client always performs at least one RPC on a file during an open(2) to see
if the file has changed. Normally this RPC is an ACCESS or GETATTR RPC
that is forced by flushing a file's attribute cache during nfs_open() and
then requesting new attributes. However, if the file is noticed to be
stale during nfs_open(), the only recourse is to fail the open(2) call
with ESTALE. On the other hand, if the ACCESS or GETATTR RPC is sent
during nfs_lookup(), then the NFS client can fall back to a LOOKUP RPC to
obtain the new file handle in the case that a file has been replaced.
This change causes the NFS client to flush the attribute cache during
nfs_lookup() when validating a name cache hit if the attributes fetched
during nfs_lookup() can be reused in nfs_open(). This allows the client
to open a replaced file via the new file handle the first time that it
notices a replaced file rather than failing with ESTALE in some cases.
Reviewed by: rmacklem, bde
Reviewed by: mohans (older version)
MFC after: 1 week
Without this patch it was possible for a different thread that calls
nfs_asyncio() to snitch a newly created nfsiod thread that was
intended for another caller of nfs_asyncio(), because the nfs_iod_mtx
mutex was unlocked while the new nfsiod thread was created. This patch
labels the newly created nfsiod, so that it is not taken by another
caller of nfs_asyncio(). This is believed to fix the problem reported
on the freebsd-stable email list under the subject:
FreeBSD NFS client/Linux NFS server issue.
Tested by: to DOT my DOT trociny AT gmail DOT com
Reviewed by: jhb
MFC after: 2 weeks
This avoids a bogus negative name cache entry from persisting forever
when another client creates an entry with the same name within the
same NFS server time of day clock tick. The mount option negnametimeo
can be used to override the default timeout interval on a
per-mount-point basis. Setting negnametimeo to 0 disables negative
name caching for the mount point.
I also fixed one obvious typo where args.timeo should be
args.maxgrouplist.
Submitted by: jhb (earlier version)
Reviewed by: jhb
MFC after: 2 weeks
kernel to boot from NFS. [1]
Note: this is not a full virtualization of nfsclient. It is only does
what advertised above and nothing more.
Requested by: public demand [1]
Tested by: kris, ..
MFC after: 5 days
Specifically, clients only trust -ve cache entries while the directory
remains unchanged and discard any -ve cache entries for a directory when
they notice that the modification time of a directory entry changes. The
race involves two concurrent lookups as follows:
- Thread A does a lookup for file 'foo' which sends a lookup RPC to the
server. The lookup fails and the server replies.
- The 'foo' file is created (either by the same client or a different
client) updating the modification time on the parent directory of 'foo'.
- Thread B does a lookup for a different file 'bar' which updates the
cached attributes of the parent directory of 'foo' to reflect the new
modification time after 'foo' was created.
- Thread A finally resumes execution to parse the reply from the NFS
server. It adds a -ve cache entry and sets the cached value of the
directory's modification time that is used for invalidating -ve cached
lookups to the new modification time set by thread B.
At this point, future lookups of 'foo' will honor the -ve cached entry
until the cached entry is pushed out of the name cache's LRU or the
modification time of the parent directory is changed again by some other
change. The fix is to read the directory's modification time before
sending the lookup RPC and use that cached modification time when setting
the directory's cached modification time. Also, we do not add a -ve cache
entry if another thread has added -ve cache entry that set the directory's
cached modification time to a newer value than the value we read before
sending the lookup RPC.
Reviewed by: rmacklem
MFC after: 1 week
context inside the RPC code.
Temporarily set td's cred to mount's cred before calling socreate() via
__rpc_nconf2socket().
Submitted by: rmacklem (in part)
Reviewed by: rmacklem, rwatson
Discussed with: dfr, bz
Approved by: re (rwatson), julian (mentor)
MFC after: 3 days
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.
Reviewed by: bz
Approved by: re (vimage blanket)
r195704 for the experimental client. The patch avoids calling vn_lock()
for the case where nfs_nget() has acquired the same vnode as dvp,
since nfs_nget() has already locked the vnode.
Reviewed by: kib, jhb
Approved by: re (kensmith), kib (mentor)
kernel resources that block other threads, like vnode locks. The SIGSTOP
sent to such thread (process, rather) shall not stop it until thread
releases the resources.
Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.
Tested by: pho
Reviewed by: jeff
Approved by: re (kensmith)
MFC after: 2 weeks
in_ifaddrhead and INADDR_HASH address lists.
Previously, these lists were used unsynchronized as they were effectively
never changed in steady state, but we've seen increasing reports of
writer-writer races on very busy VPN servers as core count has gone up
(and similar configurations where address lists change frequently and
concurrently).
For the time being, use rwlocks rather than rmlocks in order to take
advantage of their better lock debugging support. As a result, we don't
enable ip_input()'s read-locking of INADDR_HASH until an rmlock conversion
is complete and a performance analysis has been done. This means that one
class of reader-writer races still exists.
MFC after: 6 weeks
Reviewed by: bz
operating on the unmounted mount point and freed mount data in case of
forced unmount performed while dvp is unlocked to nget the target vnode.
Add missed calls to m_freem(mrep) there on error exits [1].
Submitted by: rmacklem [1]
Tested by: pho
MFC after: 2 weeks
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.
Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.
Proposed by: jeff
Reviewed by: jeff
Discussed with: rmacklem, zach loafman @ isilon
Check the condition and return ENOENT then.
In nfs_lookup(), respect ENOENT return from cache_lookup() when it is caused
by dvp reclaim.
Reported and tested by: pho
- Trace non-error loads into the access cache once, not zero times or
twice.
- Sometimes attr cache loads fail due to a race, in which case they are
aborted leading to an invalidation; in this case, trace only the flush,
not a load.
MFC after: 1 month
Sponsored by: Google, Inc.
events are:
nfsclient:accesscache:flush:done
nfsclient:accesscache:get:hit
nfsclient:accesscache:get:miss
nfsclient:accesscache:load:done
They pass the vnode, uid, and requested or loaded access mode (if any);
the load event may also report a load error if the RPC fails.
The attribute cache events are:
nfsclient:attrcache:flush:done
nfsclient:attrcache:get:hit
nfsclient:attrcache:get:miss
nfsclient:attrcache:load:done
They pass the vnode, optionally the vattr if one is present (hit or load),
and in the case of a load event, also a possible RPC error.
MFC after: 1 month
Sponsored by: Google, Inc.
The number of entries in the cache defaults to 8 but is easily changed in
nfsclient/nfs.h. When the cache is filled, the oldest cache entry is
evicted when space is needed.
I mirrored the changes to the NFSv[23] client in the NFSv4 client to fix
compile breakage. However, the NFSv4 client doesn't actually use the
access cache currently.
Submitted by: rmacklem
hit in the name cache. cache_lookup() doesn't actually return ENOENT
for such requests to force the filesystem to do an explicit lookup, so
this was effectively dead code.
- Grab the nfsnode mutex while writing to n_dmtime. We don't grab the lock
when comparing the time against the cached directory mod time (just as
we don't when comparing ctime's for +ve name cache hits) since the
attribute caching is already racy for NFS clients as it is.
Discussed with: bde
an NFS file. Now the priming is conditional on a new
vfs.nfs.prime_access_cache sysctl. For now I've left the default setting
to disabling the priming.
Requested by: scottl
stale entries, we save a copy of the directory's modification time when
the first negative cache entry was added in the directory's NFS node.
When a negative cache entry is hit during a pathname lookup, the parent
directory's modification time is checked. If it has changed, all of the
negative cache entries for that parent are purged and the lookup falls
back to using the RPC. This required adding a new cache_purge_negative()
method to the name cache to purge only negative cache entries for a given
directory.
Submitted by: mohans, Rick Macklem, Ricardo Labiaga @ NetApp
Reviewed by: mohans
opportunistic ACCESS RPC to populate both the access and attribute caches
of the file and instead always use a GETATTR RPC. On many modern NFS
servers, an ACCESS RPC is much more expensive to service than a GETATTR
RPC.
Submitted by: mohans
open() of the same file will load fresh attributes, so they do not need to
be explicitly flushed in close() to guarantee close to open consistency.
However, other file desciptors may still reference this file and clearing
the attributes in close() forces those other file descriptors to fetch
fresh attributes the next time they need them.
Reviewed by: mohans
MFC after: 1 week