separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.
From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.
If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.
Reviewed in general by: mckusick
<sys/bio.h>.
<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.
Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.
Still a few bogus uses of struct buf to track down.
Repocopy by: peter
Exceptions:
Vinum untouched. This means that it cannot be compiled.
Greg Lehey is on the case.
CCD not converted yet, casts to struct buf (still safe)
atapi-cd casts to struct buf to examine B_PHYS
(Much of this done by script)
Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.
Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.
Add bio_queue field for struct bio aware disksort.
Address a lot of stylistic issues brought up by bde.
substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)
substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)
This patch is machine generated except for the ccd.c and buf.h parts.
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.
B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.
Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.
Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.
This change is a step in the direction towards a stackable BIO capability.
A lot of this patch were machine generated (Thanks to style(9) compliance!)
Vinum users: Greg has not had time to test this yet, be careful.
was calling nfs_flush() and then clearing the NMODIFIED bit. This is
not legal since there might still be dirty buffers after the nfs_flush
(for example, pending commits). The clearing of this bit in turn prevented
a necessary vinvalbuf() from occuring leaving left over dirty buffers
even after truncating the file in a new operation. The fix is to
simply not clear NMODIFIED.
Also added a sysctl vfs.nfs.nfsv3_commit_on_close which, if set to 1,
will cause close() to do a stage 1 write AND a stage 2 commit
synchronously. By default only the stage 1 write is done synchronously.
Reviewed by: Alfred Perlstein <bright@wintelcom.net>
blocks of zeros could wind up in a file written to over NFS by a client.
The problem only occurs a few times per several gigabytes of data. This
problem turned out to be bug #3 below.
bug #1:
B_CLUSTEROK must be cleared when an NFS buffer is reverted from
stage 2 (ready for commit rpc) to stage 1 (ready for write).
Reversions can occur when a dirty NFS buffer is redirtied with new
data.
Otherwise the VFS/BIO system may end up thinking that a stage 1
NFS buffer is clusterable. Stage 1 NFS buffers are not clusterable.
bug #2:
B_CLUSTEROK was inappropriately set for a 'short' NFS buffer (short
buffers only occur near the EOF of the file). Change to only set
when the buffer is a full biosize (usually 8K). This bug has no
effect but should be fixed in -current anyway. It need not be
backported.
bug #3:
B_NEEDCOMMIT was inappropriately set in nfs_flush() (which is
typically only called by the update daemon). nfs_flush()
does a multi-pass loop but due to the lack of vnode locking it
is possible for new buffers to be added to the dirtyblkhd list
while a flush operation is going on. This may result in nfs_flush()
setting B_NEEDCOMMIT on a buffer which has *NOT* yet gone through its
stage 1 write, causing only the commit rpc to be made and thus
causing the contents of the buffer to be thrown away (never sent to
the server).
The patch also contains some cleanup, which only applies to the commit
into -current.
Reviewed by: dg, julian
Originally Reported by: Dan Nelson <dnelson@emsphone.com>
a 0 error code. The problem occured with NFSv2 mounts and also with
any NFSv3 mount returning an EEXIST error (which is translated to 0
prior to return). The reply to the rpc only contains the file handle
for the no-error case under NFSv3. The error case under NFSv3 and
all cases under NFSv2 do *not* return the file handle. The fix is
to do a secondary lookup to obtain the file handle and thus be able
to generate a return vnode for the situations where the rpc reply
does not contain the required information.
The bug was originally introduced when VOP_SYMLINK semantics were
changed for -CURRENT. The NFS symlink implementation was not properly
modified to go along with the change despite the fact that three
people reviewed the code. It took four attempts to get the current
fix correct with five people. Is NFS obfuscated? Ha!
Reviewed by: Alfred Perlstein <bright@wintelcom.net>
Testing and Discussion: "Viren R.Shah" <viren@rstcorp.com>, Eivind Eklund <eivind@FreeBSD.ORG>, Ian Dowse <iedowse@maths.tcd.ie>
Note: Previous commit to these files (except coda_vnops and devfs_vnops)
that claimed to remove WILLRELE from VOP_RENAME actually removed it from
VOP_MKNOD.
previously issued synchronously even if async daemons (nfsiod's) were
available. The commit has been moved from the strategy code to the doio
code in order to asynchronize it.
Removed use of lastr in preparation for removal of vnode->v_lastr. It
has been replaced with seqcount, which is already supported by the system
and, in fact, gives us a better heuristic for sequential detection then
lastr ever did.
Made major performance improvements to the server side commit. The
server previously fsync'd the entire file for each commit rpc. The
server now bawrite()s only those buffers related to the offset/size
specified in the commit rpc.
Note that we do not commit the meta-data yet. This works still needs
to be done.
Note that a further optimization can be done (and has not yet been done)
on the client: we can merge multiple potential commit rpc's into a
single rpc with a greater file offset/size range and greatly reduce
rpc traffic.
Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
I did some tcpdumping the other day and noticed that GETATTR calls
were frequently followed by an ACCESS call to the same file. The
attached patch changes nfs_getattr to fill the access cache as a side
effect. This is accomplished by calling ACCESS rather than
GETATTR. This implies a modest overhead of 4 bytes in the request and
8 bytes in the response compared to doing a vanilla GETATTR.
...
[The patch comprises two parts] The first
is the "real" patch, the second counts misses and hits rather than
fills and hits. The difference is subtle but important because both
nfs_getattr and nfs_access now fill the cache. It also changes the
default value of nfsaccess_cache_timeout to better match the attribute
cache. IMHO, file timestamps change much more frequently than
protection bits.
Submitted by: Bjoern Groenvall <bg@sics.se>
Reviewed by: dillon (partially)
vnodes are locked and never unlocked, which leads to processes starting
to wedge up after doing a mount -o nfsv3,tcp,rdirplus foo:/fs /fs; ls /fs.
The second is that sometimes cnp is accessed without having been
properly initialized: cnp->cn_nameptr points to an earlier name while
"len" contains the length of a current name of different size. This
leads to an attempt to dereference *(cn->cn_nameptr + len) which will
sometimes cause a page fault and a panic.
With these two fixes, client side readdirplus works correctly with
FreeBSD, IRIX 6.5.4 and Solaris 2.5.1 and 2.6 servers.
Submitted by: Matthew Dillon <dillon@backplane.com>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
The following ugly hack to the exit path of nfs_readlinkrpc() circumvents
an Auspex bug: for symlinks longer than 112 (0x70) they return a 1024 byte
xdr string - the correct data with many nulls appended. Without this fix
namei returns ENAMETOOLONG, at least it does on our source base and on
FreeBSD 3.0. Note we do not (and should not) rely upon their null padding.
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.
txdr_hyper and fxdr_hyper tweaks to avoid excessive CPU order knowledge.
nfs_serv.c: don't call nfsm_adj() with negative values, windows clients
could crash servers when doing a readdir of a large directory.
nfs_socket.c: Use IP_PORTRANGE to get a priviliged port without a spin
loop trying to bind(). Don't clobber a mbuf pointer or we get panics
on a NFS3ERR_JUKEBOX error from a server when reusing a freed mbuf.
nfs_subs.c: Don't loose st_blocks on NFSv2 mounts when > 2GB.
Obtained from: OpenBSD
udev_t in the kernel but still called dev_t in userland.
Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()
For now they're functions, they will become in-line functions
after one of the next two steps in this process.
Return major/minor/makedev to macro-hood for userland.
Register a name in cdevsw[] for the "filedescriptor" driver.
In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.
In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).
A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.
Without DEVT_FASCIST I belive this patch is a no-op.
Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.
Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).
confusing the directory read cookie cache. The nfs_access implementation
for v2 mounts attempts to read from the directory if root is the user
so that root can't access cached files when the server remaps root
to some other user.
Submitted by: Doug Rabson <dfr@nlsystems.com>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
including alan, john, me, luoqi, and kirk
Submitted by: Matt Dillon <dillon@frebsd.org>
This change implements a relatively sophisticated fix to getnewbuf().
There were two problems with getnewbuf(). First, the writerecursion
can lead to a system stack overflow when you have NFS and/or VN
devices in the system. Second, the free/dirty buffer accounting was
completely broken. Not only did the nfs routines blow it trying to
manually account for the buffer state, but the accounting that was
done did not work well with the purpose of their existance: figuring
out when getnewbuf() needs to sleep.
The meat of the change is to kern/vfs_bio.c. The remaining diffs are
all minor except for NFS, which includes both the fixes for bp
interaction AND fixes for a 'biodone(): buffer already done' lockup.
Sys/buf.h also contains a chaining structure which is not used by
this patchset but is used by other patches that are coming soon.
This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF.
(sorry for the missing T matt)
This makes it possible to change the sysctl tree at runtime.
* Change KLD to find and register any sysctl nodes contained in the loaded
file and to unregister them when the file is unloaded.
Reviewed by: Archie Cobbs <archie@whistle.com>,
Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)
the API for freeing up cnp's. This cleanup should not effect nominal
operation one way or the other since NFS VOPs just happen to be called
with flags that match what it actually does to the NAMEI components it
gets. Still, if an NFS error occured, there was probably some memory
leakage of NAMEI components with certain NFS VOP ops.
to fix the problem w/ NFSV3 whereby a make installworld would get into
high-network-bandwidth situations continuously trying to retry nfs writes
that fail with a 'stale file handle' error.
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.
Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
cache. If the cached result lets us say "yes", then go with that. If
we're not sure, or we think the answer might be "no", go to the wire to be
certain. This avoids all of the possible false veto cases, and allows us
to key the cached value with just the UID for which the cached value holds,
reducing the bloat of the nfsnode structure from 104 bytes to just 12 bytes.
Since the "yes" case is by far the most common, this should still provide
a substantial performance improvement. Also default the cache to on, with
a conservative timeout (2 seconds). This improves performance if NFS is
loaded as a KLD module, as there's not (yet) code to parse an option out
of the module arguments to set it, and sysctl doesn't work (yet) for OIDs
in modules.
The 'accelerator' mode was suggested by Bjoern Groenvall (bg@sics.se)
Feedback on this would be appreciated as testing has been necessarily
limited by Comdex, and it would be valuable to have this in 2.2.8.
This yields startling performance increases for NFS clients for many
access profiles, due to the fact that ACCESS results are persistently
cached in the namecache in many cases.
Note that the code is somewhat conservative in that it requires an
exact credential match for a cache hit. This bloats the nfsnode
structure by sizeof(struct ucred) (96 bytes). Any less conservative
approach opens the possibility for a false veto in eg. setuid
applications. Alternative suggestions would be welcomed.
The cache is normally disabled, to activate set the sysctl variable
vfs.nfs.access_cache_timeout to a nonzero value. This is the time in
seconds that a cached entry will be considered valid; useful values appear
to be 2-10 seconds. Performance of the cache can be monitored with the
vfs.nfs.access_cache_hits and vfs.nfs.access_cache_hits variables.
second argument. np_size is a 64 bit int, so is the second arg. This
might have caused needless 2G/4G file size problems.
I believe it was Bruce who queried this.