cache: add high level overview
Differential Revision: https://reviews.freebsd.org/D28675
This commit is contained in:
parent
dc532884d5
commit
f79bd71def
@ -82,6 +82,131 @@ __FBSDID("$FreeBSD$");
|
||||
|
||||
#include <vm/uma.h>
|
||||
|
||||
/*
|
||||
* High level overview of name caching in the VFS layer.
|
||||
*
|
||||
* Originally caching was implemented as part of UFS, later extracted to allow
|
||||
* use by other filesystems. A decision was made to make it optional and
|
||||
* completely detached from the rest of the kernel, which comes with limitations
|
||||
* outlined near the end of this comment block.
|
||||
*
|
||||
* This fundamental choice needs to be revisited. In the meantime, the current
|
||||
* state is described below. Significance of all notable routines is explained
|
||||
* in comments placed above their implementation. Scattered thoroughout the
|
||||
* file are TODO comments indicating shortcomings which can be fixed without
|
||||
* reworking everything (most of the fixes will likely be reusable). Various
|
||||
* details are omitted from this explanation to not clutter the overview, they
|
||||
* have to be checked by reading the code and associated commentary.
|
||||
*
|
||||
* Keep in mind that it's individual path components which are cached, not full
|
||||
* paths. That is, for a fully cached path "foo/bar/baz" there are 3 entries,
|
||||
* one for each name.
|
||||
*
|
||||
* I. Data organization
|
||||
*
|
||||
* Entries are described by "struct namecache" objects and stored in a hash
|
||||
* table. See cache_get_hash for more information.
|
||||
*
|
||||
* "struct vnode" contains pointers to source entries (names which can be found
|
||||
* when traversing through said vnode), destination entries (names of that
|
||||
* vnode (see "Limitations" for a breakdown on the subject) and a pointer to
|
||||
* the parent vnode.
|
||||
*
|
||||
* The (directory vnode; name) tuple reliably determines the target entry if
|
||||
* it exists.
|
||||
*
|
||||
* Since there are no small locks at this time (all are 32 bytes in size on
|
||||
* LP64), the code works around the problem by introducing lock arrays to
|
||||
* protect hash buckets and vnode lists.
|
||||
*
|
||||
* II. Filesystem integration
|
||||
*
|
||||
* Filesystems participating in name caching do the following:
|
||||
* - set vop_lookup routine to vfs_cache_lookup
|
||||
* - set vop_cachedlookup to whatever can perform the lookup if the above fails
|
||||
* - if they support lockless lookup (see below), vop_fplookup_vexec and
|
||||
* vop_fplookup_symlink are set along with the MNTK_FPLOOKUP flag on the
|
||||
* mount point
|
||||
* - call cache_purge or cache_vop_* routines to eliminate stale entries as
|
||||
* applicable
|
||||
* - call cache_enter to add entries depending on the MAKEENTRY flag
|
||||
*
|
||||
* With the above in mind, there are 2 entry points when doing lookups:
|
||||
* - ... -> namei -> cache_fplookup -- this is the default
|
||||
* - ... -> VOP_LOOKUP -> vfs_cache_lookup -- normally only called by namei
|
||||
* should the above fail
|
||||
*
|
||||
* Example code flow how an entry is added:
|
||||
* ... -> namei -> cache_fplookup -> cache_fplookup_noentry -> VOP_LOOKUP ->
|
||||
* vfs_cache_lookup -> VOP_CACHEDLOOKUP -> ufs_lookup_ino -> cache_enter
|
||||
*
|
||||
* III. Performance considerations
|
||||
*
|
||||
* For lockless case forward lookup avoids any writes to shared areas apart
|
||||
* from the terminal path component. In other words non-modifying lookups of
|
||||
* different files don't suffer any scalability problems in the namecache.
|
||||
* Looking up the same file is limited by VFS and goes beyond the scope of this
|
||||
* file.
|
||||
*
|
||||
* At least on amd64 the single-threaded bottleneck for long paths is hashing
|
||||
* (see cache_get_hash). There are cases where the code issues acquire fence
|
||||
* multiple times, they can be combined on architectures which suffer from it.
|
||||
*
|
||||
* For locked case each encountered vnode has to be referenced and locked in
|
||||
* order to be handed out to the caller (normally that's namei). This
|
||||
* introduces significant hit single-threaded and serialization multi-threaded.
|
||||
*
|
||||
* Reverse lookup (e.g., "getcwd") fully scales provided it is fully cached --
|
||||
* avoids any writes to shared areas to any components.
|
||||
*
|
||||
* Unrelated insertions are partially serialized on updating the global entry
|
||||
* counter and possibly serialized on colliding bucket or vnode locks.
|
||||
*
|
||||
* IV. Observability
|
||||
*
|
||||
* Note not everything has an explicit dtrace probe nor it should have, thus
|
||||
* some of the one-liners below depend on implementation details.
|
||||
*
|
||||
* Examples:
|
||||
*
|
||||
* # Check what lookups failed to be handled in a lockless manner. Column 1 is
|
||||
* # line number, column 2 is status code (see cache_fpl_status)
|
||||
* dtrace -n 'vfs:fplookup:lookup:done { @[arg1, arg2] = count(); }'
|
||||
*
|
||||
* # Lengths of names added by binary name
|
||||
* dtrace -n 'fbt::cache_enter_time:entry { @[execname] = quantize(args[2]->cn_namelen); }'
|
||||
*
|
||||
* # Same as above but only those which exceed 64 characters
|
||||
* dtrace -n 'fbt::cache_enter_time:entry /args[2]->cn_namelen > 64/ { @[execname] = quantize(args[2]->cn_namelen); }'
|
||||
*
|
||||
* # Who is performing lookups with spurious slashes (e.g., "foo//bar") and what
|
||||
* # path is it
|
||||
* dtrace -n 'fbt::cache_fplookup_skip_slashes:entry { @[execname, stringof(args[0]->cnp->cn_pnbuf)] = count(); }'
|
||||
*
|
||||
* V. Limitations and implementation defects
|
||||
*
|
||||
* - since it is possible there is no entry for an open file, tools like
|
||||
* "procstat" may fail to resolve fd -> vnode -> path to anything
|
||||
* - even if a filesystem adds an entry, it may get purged (e.g., due to memory
|
||||
* shortage) in which case the above problem applies
|
||||
* - hardlinks are not tracked, thus if a vnode is reachable in more than one
|
||||
* way, resolving a name may return a different path than the one used to
|
||||
* open it (even if said path is still valid)
|
||||
* - by default entries are not added for newly created files
|
||||
* - adding an entry may need to evict negative entry first, which happens in 2
|
||||
* distinct places (evicting on lookup, adding in a later VOP) making it
|
||||
* impossible to simply reuse it
|
||||
* - there is a simple scheme to evict negative entries as the cache is approaching
|
||||
* its capacity, but it is very unclear if doing so is a good idea to begin with
|
||||
* - vnodes are subject to being recycled even if target inode is left in memory,
|
||||
* which loses the name cache entries when it perhaps should not. in case of tmpfs
|
||||
* names get duplicated -- kept by filesystem itself and namecache separately
|
||||
* - struct namecache has a fixed size and comes in 2 variants, often wasting space.
|
||||
* now hard to replace with malloc due to dependence on SMR.
|
||||
* - lack of better integration with the kernel also turns nullfs into a layered
|
||||
* filesystem instead of something which can take advantage of caching
|
||||
*/
|
||||
|
||||
static SYSCTL_NODE(_vfs, OID_AUTO, cache, CTLFLAG_RW | CTLFLAG_MPSAFE, 0,
|
||||
"Name cache");
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user