for possible buffer overflow problems. Replaced most sprintf()'s
with snprintf(); for others cases, added terminating NUL bytes where
appropriate, replaced constants like "16" with sizeof(), etc.
These changes include several bug fixes, but most changes are for
maintainability's sake. Any instance where it wasn't "immediately
obvious" that a buffer overflow could not occur was made safer.
Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Mike Spengler <mks@networkcs.com>
cache. If the cached result lets us say "yes", then go with that. If
we're not sure, or we think the answer might be "no", go to the wire to be
certain. This avoids all of the possible false veto cases, and allows us
to key the cached value with just the UID for which the cached value holds,
reducing the bloat of the nfsnode structure from 104 bytes to just 12 bytes.
Since the "yes" case is by far the most common, this should still provide
a substantial performance improvement. Also default the cache to on, with
a conservative timeout (2 seconds). This improves performance if NFS is
loaded as a KLD module, as there's not (yet) code to parse an option out
of the module arguments to set it, and sysctl doesn't work (yet) for OIDs
in modules.
The 'accelerator' mode was suggested by Bjoern Groenvall (bg@sics.se)
Feedback on this would be appreciated as testing has been necessarily
limited by Comdex, and it would be valuable to have this in 2.2.8.
This yields startling performance increases for NFS clients for many
access profiles, due to the fact that ACCESS results are persistently
cached in the namecache in many cases.
Note that the code is somewhat conservative in that it requires an
exact credential match for a cache hit. This bloats the nfsnode
structure by sizeof(struct ucred) (96 bytes). Any less conservative
approach opens the possibility for a false veto in eg. setuid
applications. Alternative suggestions would be welcomed.
The cache is normally disabled, to activate set the sysctl variable
vfs.nfs.access_cache_timeout to a nonzero value. This is the time in
seconds that a cached entry will be considered valid; useful values appear
to be 2-10 seconds. Performance of the cache can be monitored with the
vfs.nfs.access_cache_hits and vfs.nfs.access_cache_hits variables.
second argument. np_size is a 64 bit int, so is the second arg. This
might have caused needless 2G/4G file size problems.
I believe it was Bruce who queried this.
sure that this is necessary to be a sync write here since a VOP_FSYNC()
follows and it will schedule, sort and complete the writes that the
vm_object_page_clean() started (as I think I understand things).
for some of the fsinfo RPC fields. It is strictly speaking not
wrong to do this, as the spec says that "it is expected that a
server will make a best effort at supporting all the attributes",
but pretty unusual. You guessed it, it's NT servers that do it.'
Obtained from: Frank van der Linden <frank@wins.uva.nl>
is being deleted due to an forcible unmount. The problem is
that vgone calls vclean() which then calls calls nfs_inactive()
with VXLOCK set on the vnode. Nfs_inactive() was calling vget()
to get a reference on the vnode, which in turn hung on VXLOCK.
Nfs_inactive() now checks v_usecount to make sure that the vnode
is not coming from vclean() before it does a vget().
is less than NFS_MINPACKET or greater than NFS_MAXPACKET in size, it
barfs and, I think, drops the connection.
However, there's no guarantee that in a multi-fragment RPC, all the
fragments will be at least as large as NFS_MINPACKET.
In fact, with the version of "tclnfs" we have here, which supports NFS
over TCP, at least when built under SunOS 4.1.3 (i.e., with 4.1.3's
user-mode ONC RPC library), I can *repeatably* cause "tclnfs" to send a
request with more than one fragment, one of which is only 8 bytes long.
I just do a 3877-byte write to a file, at an offset of 0.
The check that "slp->ns_reclen" is greater than or equal to
NFS_MINPACKET serves no useful purpose - if the NFS server code can't
handle packets < NFS_MINPACKET bytes, it can't handle them over *any*
protocol, so the check has to be done above the RPC-over-TCP layer - and
should be removed.
Obtained from: Fix from Guy Harris, forwarded by Rick Macklem.
references to them.
The change a couple of days ago to ignore these numbers in statically
configured vfsconf structs was slightly premature because the cd9660,
cfs, devfs, ext2fs, nfs vfs's still used MOUNT_* instead of the number
in their vfsconf struct.
detachment of vfs sysctls. Unloading of vfs LKMs doesn't actually
work for any vfs, since it leaves garbage pointers to memory
allocation control structures.
type numbers in vfs attach order (modulo incomplete reuse of old
numbers after vfs LKMs are unloaded). This requires reinitializing
the sysctl tree (or at least the vfs subtree) for vfs's that support
sysctls (currently only nfs). sysctl_order() already handled
reinitialization reasonably except it checked for annulled self
references in the wrong place.
Fixed sysctls for vfs LKMs.
when nfs is an LKM. Declare it in a header file. Don't forget to use
it in non-Lite2 code. Initialize it to -1 instead of to 0, since 0
will soon be the mount type number for the first vfs loaded.
NetBSD uses strcmp() to avoid this ugly global.
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.
interface congestion (eg: nfs over a ppp link, etc). Don't log these
for UDP mounts, and don't cause syscalls to fail with EINTR.
This stops the 'nfs send error 55' warnings.
If the error is because the system is really hosed, this is the least
of your problems...
respectively. Most of the longs should probably have been
u_longs, but this changes is just to prevent warnings about
casts between pointers and integers of different sizes, not
to fix poorly chosen types.
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>
NFS_*TIMO should possibly be converted to sysctl vars (jkh's suggestion),
but in some cases it looks like nfs keeps a copy of the value in a struct
hash sizes are already ifdef'd KERNEL, so there aren't userland inpact
from them...
NFS_MINATTRTIMO VREG attrib cache timeout in sec
NFS_MAXATTRTIMO
NFS_MINDIRATTRTIMO VDIR attrib cache timeout in sec
NFS_MAXDIRATTRTIMO
NFS_GATHERDELAY Default write gather delay (msec)
NFS_UIDHASHSIZ Tune the size of nfssvc_sock with this
NFS_WDELAYHASHSIZ and with this
NFS_MUIDHASHSIZ Tune the size of nfsmount with this
NFS_NOSERVER (already documented in LINT)
NFS_DEBUG turn on NFS debugging
also, because NFS_ROOT is used by very different files, it has been
renamed to opt_nfsroot.h instead of the old opt_nfs.h....
Pre-2.8 versions of gcc generate a call to __divdi3() for all 64-bit
signed divisions, but egcs optimizes them to a shift and fixup when
the divisor is a constant power of 2. Unfortunately, it generates
a call to __cmpdi2() for the fixup, although all except possibly
ancient versions of gcc and egcs do ordinary 64-bit comparisons
inline.
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.
The prototype FreeBSD/alpha machdep will follow in a couple of days
time.
readrpc/writerpc, since they assume it's already been done. This could
break if the first read/write access to a nfs filesystem was an exec() or
mmap() instead of a read(), write() syscall. (or statfs()).
nfs_getpages() could return an errno (EOPNOTSUPP) instead of a VM_PAGER_*
return code. Some layout tweaks for the get/putpages code.
an (over?) conservative assumption about what the client can store in it's
buffer cache using a signed 32-bit 512-byte block number index. Otherwise
it's possible for some file access when maxfilesize = 0 (eg: /usr is nfs
mounted and doing an execve())
Pointed out by: bde
XXX It might make sense to do a preemptive nfs_fsinfo() call at mount time.
it just makes more work. We pass a copy of the uid/gid with the
credentials. (although, this may need to be revisited if a non AUTHUNIX
authentication method (such as NFSKERB) ever gets implemented).
Obtained from: NetBSD
of this part of commits is to minimize unnecessary differences between
the other NFS's of similar origin. Yes, there are gratuitous changes here
that the style folks won't like, but it makes the catch-up less difficult.
that I checked (eg: ufs_link()) do the ABORTOP on the directory rather than
the file itself. After Michael Hancock's patches, the abortop doesn't seem
all that critial now since something else will free the pathname buffer.
rather than assuming 2^64. It may not like files that big. :-)
On the nfs server, calculate and report the max file size as the point
that the block numbers in the cache would turn negative.
(ie: 1099511627775 bytes (1TB)).
One of the things I'm worried about however, is that directory offsets
are really cookies on a NFSv3 server and can be rather large, especially
when/if the server generates the opaque directory cookies by using a local
filesystem offset in what comes out as the upper 32 bits of the 64 bit
cookie. (a server is free to do this, it could save byte swapping
depending on the native 64 bit byte order)
Obtained from: NetBSD
for better packing. This means that we can choose better values for the
various hash entries without having to try and get it all to fit within
an artificial power of two limit for malloc's sake.
cases we ignore it (eg: read/write) to maintain chmod-after-open semantics
but in other cases we do care, eg: creating files, access() etc. Never
ignore errors from VOP_ACCESS() on immutable files.
This apparently comes from BSDI (from Keith Bostic) via NetBSD.
PR: 5148
Submitted by: Yoshiro MIHIRA <sanpei@yy.cs.keio.ac.jp>
We had run out of bits in the nfs mount flags, I have moved the internal
state flags into a seperate variable. These are no longer visible via
statfs(), but I don't know of anything that looks at them.
these two files that are almost-but-not-quite the same leads to false grep
hits, confusion etc.
Only installing one copy with a symlink would be nice but that doesn't
work with SHARED=symlinks (it changes the source tree).
in nfs_vinvalbuf() or the nfs_removeit(), we can have the nfsnode reallocated
from underneath us (eg: replaced by a ufs 'struct inode') which can cause
disk corruption ('freeing free block' when di_db[5] gets trashed).
This is not a cheap fix, but it'll do until the nfsnodes get reference
counting and/or locking.
Apparently NetBSD have a similar fix (apparently from BSDI).
I wish all PR's had this much useful detail. :-)
PR: 6611
Submitted by: Stephen Clawson <sclawson@marker.cs.utah.edu>
---------
Make callers of namei() responsible for releasing references or locks
instead of having the underlying filesystems do it. This eliminates
redundancy in all terminal filesystems and makes it possible for stacked
transport layers such as umapfs or nullfs to operate correctly.
Quality testing was done with testvn, and lat_fs from the lmbench suite.
Some NFS client testing courtesy of Patrik Kudo.
vop_mknod and vop_symlink still release the returned vpp. vop_rename
still releases 4 vnode arguments before it returns. These remaining cases
will be corrected in the next set of patches.
---------
Submitted by: Michael Hancock <michaelh@cet.co.jp>
Reverse the VFS_VRELE patch. Reference counting of vnodes does not need
to be done per-fs. I noticed this while fixing vfs layering violations.
Doing reference counting in generic code is also the preference cited by
John Heidemann in recent discussions with him.
The implementation of alternative vnode management per-fs is still a valid
requirement for some filesystems but will be revisited sometime later,
most likely using a different framework.
Submitted by: Michael Hancock <michaelh@cet.co.jp>
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.
Most uses of time.tv_sec now uses the new variable time_second instead.
gettime() changed to getmicrotime(0.
Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).
A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.
Add a new nfs_curusec() function.
Mark a couple of bogosities involving the now disappeard time variable.
Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.
Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.
Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.
Reviewed by: bde
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
a complement to all ops that return a vpp, VFS_VRELE. This is
initially only for file systems that implement the following ops
that do a WILLRELE:
vop_create, vop_whiteout, vop_mknod, vop_remove, vop_link,
vop_rename, vop_mkdir, vop_rmdir, vop_symlink
This is initial DNA that doesn't do anything yet. VFS_VRELE is
implemented but not called.
A default vfs_vrele was created for fs implementations that use the
standard vnode management routines.
VFS_VRELE implementations were made for the following file systems:
Standard (vfs_vrele)
ffs mfs nfs msdosfs devfs ext2fs
Custom
union umapfs
Just EOPNOTSUPP
fdesc procfs kernfs portal cd9660
These implementations may change as VOP changes are implemented.
In the next phase, in the vop implementations calls to vrele and the vrele
part of vput will be moved to the top layer vfs_vnops and made visible
to all layers. vput will be replaced by unlock in these cases. Unlocking
will still be done in the per fs layer but the refcount decrement will be
triggered at the top because it doesn't hurt to hold a vnode reference a
little longer. This will have minimal impact on the structure of the
existing code.
This will only be done for vnode arguments that are released by the various
fs vop implementations.
Wider use of VFS_VRELE will likely require restructuring of the code.
Reviewed by: phk, dyson, terry et. al.
Submitted by: Michael Hancock <michaelh@cet.co.jp>
to give pollution compatible with <nfs/nqfs.h>. At least mount_nfs.c
previously had to #define KERNEL before including <nfs/nfs.h> to get
this pollution, but this gave other pollution.
Moved comment about NFSINT_SIGMASK to immediately before the code that
it applies to.
depend on <sys/types.h> forward declaring common ones.
Added an underscore to `sin' in prototypes to avoid warnings for the
conflict with the ANSI sin().
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."
Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.
Make vfs_bio buffer mgmt work better.
Buffers were being used after brelse.
Make nfs_getpages work independently of other NFS
interfaces. This eliminates some difficult
recursion problems and decreases pagefault
overhead.
Remove an erroneous vfs_unbusy_pages.
Fix a reentrancy problem, with nfs_vinvalbuf when
vnode is already being rundown.
Reassignbuf wasn't being called when needed under
certain circumstances.
(Thanks to Bill Paul for help.)
Make vfs_bio buffer mgmt work better.
Buffers were being used after brelse.
Make nfs_getpages work independently of other NFS
interfaces. This eliminates some difficult
recursion problems and decreases pagefault
overhead.
Remove an erroneous vfs_unbusy_pages.
Fix a reentrancy problem, with nfs_vinvalbuf when
vnode is already being rundown.
Reassignbuf wasn't being called when needed under
certain circumstances.
(Thanks for help from Bill Paul.)
This will not make any of object files that LINT create change; there
might be differences with INET disabled, but hardly anything compiled
before without INET anyway. Now the 'obvious' things will give a
proper error if compiled without inet - ipx_ip, ipfw, tcp_debug. The
only thing that _should_ work (but can't be made to compile reasonably
easily) is sppp :-(
This commit move struct arpcom from <netinet/if_ether.h> to
<net/if_arp.h>.
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:
1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.
Be gentle, and please give me feedback asap.
nfs_bio.c code worked better than the 1.44. This commit reverts
the important parts of 1.44 to 1.41, and we will fix it when we
can get a handle on the problem.
a mistake (since softnet interrupts may occur if malloc() waits),
and doing it harmlessly but unnecessarily here interfered with
detection of the mistaken cases.
Ever since I first say the way the mount flags were used I've hated the
fact that modes, and events, internal and exported, and short-term
and long term flags are all thrown together. Finally it's annoyed me enough..
This patch to the entire FreeBSD tree adds a second mount flag word
to the mount struct. it is not exported to userspace. I have moved
some of the non exported flags over to this word. this means that we now
have 8 free bits in the mount flags. There are another two that might
well move over, but which I'm not sure about.
The only user visible change would have been in pstat -v, except
that davidg has disabled it anyhow.
I'd still like to move the state flags and the 'command' flags
apart from each other.. e.g. MNT_FORCE really doesn't have the
same semantics as MNT_RDONLY, but that's left for another day.
it in struct proc instead.
This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.
I have not removed the /*ARGSUSED*/, they will require some looking at.
libkvm, ps and other userland struct proc frobbing programs will need
recompiled.
Rename vn_default_error to vop_defaultop all over the place.
Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite.
Use vop_null instead of nullop.
Move vop_nopoll from vfs_subr.c to vfs_default.c
Move vop_sharedlock from vfs_subr.c to vfs_default.c
Move vop_nolock from vfs_subr.c to vfs_default.c
Move vop_nounlock from vfs_subr.c to vfs_default.c
Move vop_noislocked from vfs_subr.c to vfs_default.c
Use vop_ebadf instead of *_ebadf.
Add vop_defaultop for getpages on master vnode in MFS.
1. Add defaults for more VOPs
VOP_LOCK vop_nolock
VOP_ISLOCKED vop_noislocked
VOP_UNLOCK vop_nounlock
and remove direct reference in filesystems.
2. Rename the nfsv2 vnop tables to improve sorting order.
1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS}
intereface function, and now lives in the ufsmount structure.
2. Remove VOP_SEEK, it was unused.
3. Add mode default vops:
VOP_ADVLOCK vop_einval
VOP_CLOSE vop_null
VOP_FSYNC vop_null
VOP_IOCTL vop_enotty
VOP_MMAP vop_einval
VOP_OPEN vop_null
VOP_PATHCONF vop_einval
VOP_READLINK vop_einval
VOP_REALLOCBLKS vop_eopnotsupp
And remove identical functionality from filesystems
4. Add vop_stdpathconf, which returns the canonical stuff. Use
it in the filesystems. (XXX: It's probably wrong that specfs
and fifofs sets this vop, shouldn't it come from the "host"
filesystem, for instance ufs or cd9660 ?)
5. Try to make system wide VOP functions have vop_* names.
6. Initialize the um_* vectors in LFS.
(Recompile your LKMS!!!)
1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.
2. Change VOP_BLKATOFF to a normal function in cd9660.
3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.
4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.
5. Fix another VCALL in vfs_cache.c (thanks Bruce!)
1. Use the default function to access all the specfs operations.
2. Use the default function to access all the fifofs operations.
3. Use the default function to access all the ufs operations.
4. Fix VCALL usage in vfs_cache.c
5. Use VOCALL to access specfs functions in devfs_vnops.c
6. Staticize most of the spec and fifofs vnops functions.
7. Make UFS panic if it lacks bits of the underlying storage handling.
1. Remove comment stating the blatantly obvious.
2. Align in two columns.
3. Sort all but the default element alphabetically.
4. Remove XXX comments pointing out entries not needed.
Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.
A couple of finer points by: bde
plus the previous changes to use the zone allocator decrease the useage
of malloc by half. The Zone allocator will be upgradeable to be able
to use per CPU-pools, and has more intelligent usage of SPLs. Additionally,
it has reasonable stats gathering capabilities, while making most calls
inline.