reserve, in maximal NFS packets. Originally only 2 packets worth of
space was reserved. The default is now 4, which appears to greatly
improve performance for slow to mid-speed machines on gigabit networks.
Add documentation and correct some prior documentation.
Problem Researched by: Andrew Gallatin <gallatin@cs.duke.edu>
Approved by: jkh
of element [4] in both, which goes beyond the end of the array, leaving
[0], [1], [2], and [3]. This bug did not cause any problems since
the overrun fields are initialized after the bogus array init but
needs to be fixed anyway.
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
-----------------------------
The core of the signalling code has been rewritten to operate
on the new sigset_t. No methodological changes have been made.
Most references to a sigset_t object are through macros (see
signalvar.h) to create a level of abstraction and to provide
a basis for further improvements.
The NSIG constant has not been changed to reflect the maximum
number of signals possible. The reason is that it breaks
programs (especially shells) which assume that all signals
have a non-null name in sys_signame. See src/bin/sh/trap.c
for an example. Instead _SIG_MAXSIG has been introduced to
hold the maximum signal possible with the new sigset_t.
struct sigprop has been moved from signalvar.h to kern_sig.c
because a) it is only used there, and b) access must be done
though function sigprop(). The latter because the table doesn't
holds properties for all signals, but only for the first NSIG
signals.
signal.h has been reorganized to make reading easier and to
add the new and/or modified structures. The "old" structures
are moved to signalvar.h to prevent namespace polution.
Especially the coda filesystem suffers from the change, because
it contained lines like (p->p_sigmask == SIGIO), which is easy
to do for integral types, but not for compound types.
NOTE: kdump (and port linux_kdump) must be recompiled.
Thanks to Garrett Wollman and Daniel Eischen for pressing the
importance of changing sigreturn as well.
txdr_hyper and fxdr_hyper tweaks to avoid excessive CPU order knowledge.
nfs_serv.c: don't call nfsm_adj() with negative values, windows clients
could crash servers when doing a readdir of a large directory.
nfs_socket.c: Use IP_PORTRANGE to get a priviliged port without a spin
loop trying to bind(). Don't clobber a mbuf pointer or we get panics
on a NFS3ERR_JUKEBOX error from a server when reusing a freed mbuf.
nfs_subs.c: Don't loose st_blocks on NFSv2 mounts when > 2GB.
Obtained from: OpenBSD
piecemeal, middle-of-file writes for NFS. These hacks have caused no
end of trouble, especially when combined with mmap(). I've removed
them. Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write. NFS does, however,
optimize piecemeal appends to files. For most common file operations,
you will not notice the difference. The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations. NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write. There is quite a bit of room for further
optimization in these areas.
The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This
is not correct operation. The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid. A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid. This operation is
necessary to properly support mmap(). The zeroing occurs most often
when dealing with file-EOF situations. Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.
getblk() and allocbuf() have been rewritten. B_CACHE operation is now
formally defined in comments and more straightforward in
implementation. B_CACHE for VMIO buffers is based on the validity of
the backing store. B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa). biodone() is now responsible for setting B_CACHE
when a successful read completes. B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated. VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE. This means that bowrite() and bawrite() also
set B_CACHE indirectly.
There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount. These have been fixed. getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.
Major fixes to NFS/TCP have been made. A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain. The server's kernel must be
recompiled to get the benefit of the fixes.
Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
routine was [ab]used for two different things, and you couldn't tell from
the wait channel which one had wedged.
Catch a few things missing from NFS_NOSERVER.
is less than NFS_MINPACKET or greater than NFS_MAXPACKET in size, it
barfs and, I think, drops the connection.
However, there's no guarantee that in a multi-fragment RPC, all the
fragments will be at least as large as NFS_MINPACKET.
In fact, with the version of "tclnfs" we have here, which supports NFS
over TCP, at least when built under SunOS 4.1.3 (i.e., with 4.1.3's
user-mode ONC RPC library), I can *repeatably* cause "tclnfs" to send a
request with more than one fragment, one of which is only 8 bytes long.
I just do a 3877-byte write to a file, at an offset of 0.
The check that "slp->ns_reclen" is greater than or equal to
NFS_MINPACKET serves no useful purpose - if the NFS server code can't
handle packets < NFS_MINPACKET bytes, it can't handle them over *any*
protocol, so the check has to be done above the RPC-over-TCP layer - and
should be removed.
Obtained from: Fix from Guy Harris, forwarded by Rick Macklem.
detachment of vfs sysctls. Unloading of vfs LKMs doesn't actually
work for any vfs, since it leaves garbage pointers to memory
allocation control structures.
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.
interface congestion (eg: nfs over a ppp link, etc). Don't log these
for UDP mounts, and don't cause syscalls to fail with EINTR.
This stops the 'nfs send error 55' warnings.
If the error is because the system is really hosed, this is the least
of your problems...
respectively. Most of the longs should probably have been
u_longs, but this changes is just to prevent warnings about
casts between pointers and integers of different sizes, not
to fix poorly chosen types.
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.
The prototype FreeBSD/alpha machdep will follow in a couple of days
time.
of this part of commits is to minimize unnecessary differences between
the other NFS's of similar origin. Yes, there are gratuitous changes here
that the style folks won't like, but it makes the catch-up less difficult.
We had run out of bits in the nfs mount flags, I have moved the internal
state flags into a seperate variable. These are no longer visible via
statfs(), but I don't know of anything that looks at them.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.
Most uses of time.tv_sec now uses the new variable time_second instead.
gettime() changed to getmicrotime(0.
Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).
A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.
Add a new nfs_curusec() function.
Mark a couple of bogosities involving the now disappeard time variable.
Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.
Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.
Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.
Reviewed by: bde
Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.
A couple of finer points by: bde
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.
This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.
2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.
3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.
4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.
5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.
As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.
form `tv = time'. Use a new function gettime(). The current version
just forces atomicicity without fixing precision or efficiency bugs.
Simplified some related valid accesses by using the central function.
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.
The system boots and can mount UFS filesystems.
Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.
Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
if a single process is performing a large number of requests (in this
case writing a large file). The writing process could monopolise the
recieve lock and prevent any other processes from recieving their
replies.
It also adds a new sysctl variable 'vfs.nfs.dwrite' which controls the
behaviour which originally pointed out the problem. When a process
writes to a file over NFS, it usually arranges for another process
(the 'iod') to perform the request. If no iods are available, then it
turns the write into a 'delayed write' which is later picked up by the
next iod to do a write request for that file. This can cause that
particular iod to do a disproportionate number of requests from a
single process which can harm performance on some NFS servers. The
alternative is to perform the write synchronously in the context of
the original writing process if no iod is avaiable for asynchronous
writing.
The 'delayed write' behaviour is selected when vfs.nfs.dwrite=1 and
the non-delayed behaviour is selected when vfs.nfs.dwrite=0. The
default is vfs.nfs.dwrite=1; if many people tell me that performance
is better if vfs.nfs.dwrite=0 then I will change the default.
Submitted by: Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp>
pr_usrreq mechanism which was poorly designed and error-prone. This
commit renames pr_usrreq to pr_ousrreq so that old code which depended on it
would break in an obvious manner. This commit also implements the new
interface for TCP, although the old function is left as an example
(#ifdef'ed out). This commit ALSO fixes a longstanding bug in the
TCP timer processing (introduced by davidg on 1995/04/12) which caused
timer processing on a TCB to always stop after a single timer had
expired (because it misinterpreted the return value from tcp_usrreq()
to indicate that the TCB had been deleted). Finally, some code
related to polling has been deleted from if.c because it is not
relevant t -current and doesn't look at all like my current code.
Poul mentioned that he thought this was some kind of timing problem, and
that started me thinking. After a little poking around, I found that
nfs_timer() was completely disabled when NFS_NOSERVER was #defined.
But after looking at nfs_timer(), it seemed like it was something
required by both the client and server code, and disabling it outright
just didn't seem to make any sense. Parts of it relate only to the
NFS server side code, so I disabled those, but I re-enabled the rest
of the function and made sure that it would be called from nfs_init()
(in nfs_subs.c).
With nfs_timer() re-enabled, everything seems to work again. The only
other changes I made were to #ifdef away some variable declarations
in the NFS_NOSERVER case so that gcc would stop complaining about
unused variables.
Reviewed by: phk
Submitted by: Bill Paul <wpaul@skynet.ctr.columbia.edu>