Commit Graph

878 Commits

Author SHA1 Message Date
Alfred Perlstein
77465d9390 Get rid of qaddr_t.
Requested by: bde
2007-10-16 10:54:55 +00:00
Mohan Srinivasan
faf529dce5 NFS MP scaling changes.
- Eliminate the hideous nfs_sndlock that serialized NFS/TCP request senders
  thru the sndlock.
- Institute a new nfs_connectlock that serializes NFS/TCP reconnects. Add
  logic to wait for pending request senders to finish sending before
  reconnecting. Dial down the sb_timeo for NFS/TCP sockets to 1 sec.
- Break out the nfs xid manipulation under a new nfs xid lock, rather than
  over loading the nfs request lock for this purpose.
- Fix some of the locking in nfs_request.
Many thanks to Kris Kennaway for his help with this and for initiating the
MP scaling analysis and work. Kris also tested this patch thorougly.
Approved by: re@ (Ken Smith)
2007-10-12 19:12:21 +00:00
Mohan Srinivasan
17c53e4a28 Fix for a very rare race, caused by the nfsiod wakeup and nfsiod idle
timeout occurring at exactly the same time. If this happens, the nfsiod
exits although there may be a queued async IO request for it.

Found by : Kris Kennaway
Approved by: re
2007-09-25 21:08:49 +00:00
Robert Watson
0bf686c125 Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet.  As that
has now been removed, they are no longer required.  Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option.  Clean up some related gotos for
consistency.

Reviewed by:	bz, csjp
Tested by:	kris
Approved by:	re (kensmith)
2007-08-06 14:26:03 +00:00
John Baldwin
f4b65ca5d0 Fix for a race where out of order loading of NFS attrs into the
nfsnode could lead to attrs being stale. One example (that we
ran into) was a READDIR+, WRITE. The responses came back in
order, but the attrs from the WRITE were loaded before the
attrs from the READDIR+, leading to the wrong size from being
read on the next stat() call.

MFC after:	1 week
Submitted by:	mohans
Approved by:	re (kensmith)
2007-07-03 18:31:47 +00:00
John Baldwin
03e557fd5a Fix up NFS client write error handling. Errors are split into
recoverable and unrecoverable. For the former, we redirty the
buffer and hang onto it for future retries. For the latter (eg.
ESTALE), we discard the buffer and return the error back to the
user on the next syscall. This fixes a number of vfs panics and
fixes having a large number of dirty buffers (that cannot be
written out and reclaimed) from hanging around. Thanks to ups@
for discussions on this issue.

Reported by:	kris, Kai, others
Approved by:	re (kensmith)
2007-07-03 18:30:55 +00:00
Attilio Rao
b4b7081961 Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)
2007-06-04 21:45:18 +00:00
Jeff Roberson
1c4bcd050a - Move rusage from being per-process in struct pstats to per-thread in
td_ru.  This removes the requirement for per-process synchronization in
   statclock() and mi_switch().  This was previously supported by
   sched_lock which is going away.  All modifications to rusage are now
   done in the context of the owning thread.  reads proceed without locks.
 - Aggregate exiting threads rusage in thread_exit() such that the exiting
   thread's rusage is not lost.
 - Provide a new routine, rufetch() to fetch an aggregate of all rusage
   structures from all threads in a process.  This routine must be used
   in any place requiring a rusage from a process prior to it's exit.  The
   exited process's rusage is still available via p_ru.
 - Aggregate tick statistics only on demand via rufetch() or when a thread
   exits.  Tick statistics are kept in the thread and protected by sched_lock
   until it exits.

Initial patch by:	attilio
Reviewed by:		attilio, bde (some objections), arch (mostly silent)
2007-06-01 01:12:45 +00:00
Attilio Rao
2feb50bf7d Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)
2007-05-31 22:52:15 +00:00
Robert Watson
5d0dd109f4 In nfs_down(), if rep can be NULL, which we test for, then we should
lock and unlock conditionally, not just set the flag on it conditionally.
In practice, this bug couldn't manifest, as in the current revision of
the code, no callers pass a NULL rep.

CID:		1416
Found with:	Coverity Prevent(tm)
2007-05-18 19:34:54 +00:00
Jeff Roberson
222d01951f - define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts.  This can be used to abstract away pcpu details but also changes
   to use atomics for all counters now.  This means sched lock is no longer
   responsible for protecting counts in the switch routines.

Contributed by:		Attilio Rao <attilio@FreeBSD.org>
2007-05-18 07:10:50 +00:00
John Baldwin
a1054d5776 Various fixes to the NFS Directio support.
- Fix for a bug where a close would not wait for all (directio)
  dirty buffers to drain. The nfsnode was not marked NMODIFIED
  when there were directio dirtied buffers pending, causing this.
- No reason to vhold/vrele the vp when enqueueing DirectIO requests
  for the nfsiods. The vnode can't really go way since the close
  has to wait for these requests to drain.

MFC after:	1 week
Submitted by:	mohans
2007-04-25 20:34:55 +00:00
Robert Watson
dc4725135d Attempt to rationalize NFS privileges:
- Replace PRIV_NFSD with PRIV_NFS_DAEMON, add PRIV_NFS_LOCKD.

- Use PRIV_NFS_DAEMON in the NFS server.

- In the NFS client, move the privilege check from nfslockdans(), which
  occurs every time a write is performed on /dev/nfslock, and instead do it
  in nfslock_open() just once.  This allows us to avoid checking the saved
  uid for root, and just use the effective on open.  Use PRIV_NFS_LOCKD.
2007-04-21 18:11:19 +00:00
Xin LI
1247688a3e Don't destroy a mutex just before we use it, instead,
destroy it after we have used it.
2007-03-23 08:52:36 +00:00
Tor Egge
61b9d89ff0 Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount).  On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque().  Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode.  The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup.  Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by:	re (kensmith)
Reviewed by:	kib
In collaboration with:	kib
2007-03-13 01:50:27 +00:00
Mohan Srinivasan
d9915117c9 Back out a chance to nfs_timer() that inadvertantly crept in the last checkin :( 2007-03-09 04:07:54 +00:00
Mohan Srinivasan
f9bb753844 Over NFS, an open() call could result in multiple over-the-wire
GETATTRs being generated - one from lookup()/namei() and the other
from nfs_open() (for cto consistency). This change eliminates the
GETATTR in nfs_open() if an otw GETATTR was done from the namei()
path. Instead of extending the vop interface, we timestamp each attr
load, and use this to detect whether a GETATTR was done from namei()
for this syscall. Introduces a thread-local variable that counts the
syscalls made by the thread and uses <pid, tid, thread syscalls> as
the attrload timestamp. Thanks to jhb@ and peter@ for a discussion on
thread state that could be used as the timestamp with minimal overhead.
2007-03-09 04:02:38 +00:00
John Baldwin
4d70511ac3 Use pause() rather than tsleep() on stack variables and function pointers. 2007-02-27 17:23:29 +00:00
Mohan Srinivasan
0973754e14 Backing out an earlier change. It seems harmless for NFS to miss the "force
unmount" flag, making the acquisition of the MNT_ILOCK in nfs_request() and
nfs_sigintr() unnecessary. Pointed out by tegge@.
2007-02-16 03:46:55 +00:00
Mohan Srinivasan
024465d002 Add missing MNT_ILOCK around some mnt_kern_flag accesses. 2007-02-11 04:01:10 +00:00
Mohan Srinivasan
4e99994cc9 Fix for a vnode lock leak in nfs_create() in the event of an error.
Spotted by ups@.
2007-01-31 23:10:27 +00:00
Kris Kennaway
410355bf69 Instead of always hard-coding the socket type for the nfs root mount as
SOCK_DGRAM (i.e. UDP), respect the value configured earlier.  This allows
TCP NFS root mounts using e.g. the boot.nfsroot.options="tcp" tunable.

In this case some of the connection parameters like the retry timer were
previously set appropriately for TCP but inappropriately for the UDP
socket that was actually used, leading to e.g. extremely long recovery
times (O(hours)) after a nfs server reboot.

Reviewed by:    mohans
MFC After:      2 weeks
2007-01-30 00:26:04 +00:00
Bruce Evans
e43982a801 Unstaticize nfs_iosize() in nfsclient and use it in nfs4client instead
of duplicating it except for larger style bugs in the copy.

Fix some nearby style bugs (including a harmless type mismatch)
in and near the remaining copy.

This is part of fixing collisions of the 2 nfs*client's names.  Even
static names should have a unique prefixes so that they can be debugged
easily.
2007-01-25 13:07:25 +00:00
Konstantin Belousov
2cc7d26f7f Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.

Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.

Reviewed by:	tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by:	Peter Holm
X-MFC after:	3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
Mohan Srinivasan
7f3a6e42c9 NetApp filers return corrupt post op attrs in the wcc on NFS error responses.
This is easy to reproduce for EROFS. I am not sure if the attrs can be corrupt
for other NFS error responses. For now, disabling wcc pre-op attr checks and
post-op attr loads on NFS errors (sysctl'ed).
Reported by: Kris Kennaway
2006-12-11 19:54:25 +00:00
Sam Leffler
49d5157434 consolidate parsing of nfs root mount options in one place
and handle all options (some may require fixes elsewhere)

Reviewed by:	jhb, mohans
MFC after:	1 month
2006-12-06 02:15:25 +00:00
Mohan Srinivasan
594ece53bc In nfs_nget(), we must initialize the fh in the nfsnode before inserting the
vnode into the vfs hash. Otherwise, another thread walking the hash can trip
on an nfsnode with an uninitialized or partially initialized fh.
Thanks to ups@ for spotting this race.
2006-11-29 02:21:40 +00:00
Mohan Srinivasan
d4875805d7 bde@ pointed out that tprintf() acquires Giant so callers of tprintf() don't
have to explicitly acquire Giant (although they need to be aware of this and
not hold any locks at that point). Remove the acquisitions of Giant in the
NFS client wrapping tprintf().
2006-11-27 23:26:06 +00:00
Mohan Srinivasan
88d5725c38 Fix for a bug caused by a race when 2 threads lookup the same
file. Leave the loser's lock(s) initialized, so the reclaim logic can
unconditionally destroy them when that race occurs (or if the vfs hash
insert happened to fail for some other reason). Thanks to ups@ for a
careful review of the code.
Reported by : Kris Kennaway
2006-11-27 19:06:43 +00:00
Mohan Srinivasan
a18c4dc336 1) Fix up locking in nfs_up() and nfs_down.
2) Reduce the acquisitions of the Giant lock in the nfs_socket.c paths significantly.
- We don't need to acquire Giant before tsleeping on lbolt anymore,
  since jhb specialcased lbolt handling in msleep.
- nfs_up() needs to acquire Giant only if printing the "server up"
  message.
- nfs_timer() held Giant for the duration of the NFS timer processing,
  just because the printing of the message in nfs_down() needed it
  (and we acquire other locks in nfs_timer()). The acquisition of
  Giant is moved down into nfs_down() now, reducing the time Giant is
  held in that path.

Reported by: Kris Kennaway
2006-11-20 04:14:23 +00:00
Mohan Srinivasan
3c2fcc3c92 vfs_hash_insert() vputs() the losing vnode before returning, in the event of
a race where a duplicate vnode is entered into the vfs hash. nfs_nget() shouldn't
be releasing the vnode in that case.
2006-11-16 23:03:46 +00:00
Mohan Srinivasan
87c125cecc Fix to readdir+ reply handling. When inserting an entry into the namecache,
initialize the nfsnode's ctime. Otherwise a subsequent lookup purges the
just entered namecache entry.
2006-11-16 23:02:37 +00:00
Sam Leffler
83cc6b9ad2 honor nolockd flag in root mount options
MFC after:	2 weeks
2006-11-07 18:02:45 +00:00
Mohan Srinivasan
88b94fba38 Make EWOULDBLOCK a recoverable error so that the request is retransmitted.
This bug results in data corruption with NFS/TCP. Writes are silently dropped
on EWOULDBLOCK (because socket send buffer is full and sockbuf timer fires).

Reviewed by: ups@
2006-10-31 20:25:37 +00:00
Bruce Evans
35259c2c89 Fixed some style bugs (especially ones involving long lines and use
of __P(())).  There are many more.
2006-10-17 22:07:07 +00:00
Bruce Evans
6a72ff6b09 Don't do null Setattr RPCs for VA_MARK_ATIME. When we added the
VA_MARK_ATIME feature to fix POSIX conformance fore execve() and mmap(),
we thought that it was optimized well enough for the one file system
that supports it (ffs) and harmless for other file systems (except
layered ones which already get the layering for VOP_SETATTR() wrong).
However, nfs_setattr() doesn't do much parameter checking, so when
it gets a combination of parameters that it doesn't understand, it
always does a Setattr RPC.  This RPC can't do anything good, and for
VA_MARK_ATIME it is null except for wasting a lot of time.

This is the smallest and easiest to fix of several bugs that have
increased the number of RPCs for kernel builds on nfs by more than
100% since 2004-11-05.  The real-time increase depends on network
latency and parallelization and can also be very large (approaching
the same percentage for unparallelized operations like "make depend"
on systems with fast CPUs and high-latency networks).
2006-10-14 07:25:11 +00:00
Poul-Henning Kamp
f645b0b51c First part of a little cleanup in the calendar/timezone/RTC handling.
Move relevant variables to <sys/clock.h> and fix #includes as necessary.

Use libkern's much more time- & spamce-efficient BCD routines.
2006-10-02 12:59:59 +00:00
Tor Egge
a1e363f256 Add mnt_noasync counter to better handle interleaved calls to nmount(),
sync() and sync_fsync() without losing MNT_ASYNC.  Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.
2006-09-26 04:15:59 +00:00
Tor Egge
5da56ddb21 Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().
2006-09-26 04:12:49 +00:00
Mohan Srinivasan
7d7d9e2242 Fixes up the handling of shared vnode lock lookups in the NFS client,
adds a FS type specific flag indicating that the FS supports shared
vnode lock lookups, adds some logic in vfs_lookup.c to test this flag
and set lock flags appropriately.

- amd on 6.x is a non-starter (without this change). Using amd under
  heavy load results in a deadlock (with cascading vnode locks all the
  way to the root) very quickly.
- This change should also fix the more general problem of cascading
  vnode deadlocks when an NFS server goes down.

Ideally, we wouldn't need these changes, as enabling shared vnode lock
lookups globally would work. Unfortunately, UFS, for example isn't
ready for shared vnode lock lookups, crashing pretty quickly.

This change is the result of discussions with Stephan Uphoff (ups@).

Reviewed by:	ups@
2006-09-13 18:39:09 +00:00
Mohan Srinivasan
6cd7078919 Fix for a deadlock triggered by a 'umount -f' causing a NFS request to never
retransmit (or return). Thanks to John Baldwin for helping nail this one.

Found by : Kris Kennaway
2006-08-29 22:00:12 +00:00
Thomas Quinot
3401780fa0 Fix typos in comment. 2006-08-16 23:53:05 +00:00
Alan Cox
5786be7cc7 Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags.  Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().
2006-08-09 17:43:27 +00:00
Brooks Davis
a36aa44a85 Add a new kernel environment variable "boot.netif.mtu" which is used to
set the MTU prior to mounting root via NFS.  This is required if the
server supports a higher than default MTU because the client will not
see the responses otherwise.

MFC after:	3 weeks
2006-08-09 01:56:17 +00:00
Robert Watson
b0668f7151 soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod:	sam, gnn, wollman
2006-07-24 15:20:08 +00:00
Konstantin Belousov
c915bcbad2 Signals may be delivered to process as well as to the thread. Check the
thread-delivered signals in addition to the process one.

Reviewed by:	mohan
MFC after:	1 month
Approved by:	kan (mentor)
2006-07-08 15:39:11 +00:00
Konstantin Belousov
201599c3af Always supply curthread as argument to nfs_asyncio and nfs_doio
in nfs_strategy. Otherwise, for some buffers, signals would be ignored
at the intr mounts.

Reviewed by:	mohan
MFC after:	1 month
Approved by:	kan (mentor)
2006-07-08 15:36:51 +00:00
Yaroslav Tykhiy
4b97d7affd There is a consensus that ifaddr.ifa_addr should never be NULL,
except in places dealing with ifaddr creation or destruction; and
in such special places incomplete ifaddrs should never be linked
to system-wide data structures.  Therefore we can eliminate all the
superfluous checks for "ifa->ifa_addr != NULL" and get ready
to the system crashing honestly instead of masking possible bugs.

Suggested by:	glebius, jhb, ru
2006-06-29 19:22:05 +00:00
Yaroslav Tykhiy
576cdf4352 Use the elegant TAILQ_FOREACH() in place of a hand-rolled for() loop. 2006-06-29 15:37:39 +00:00
Mohan Srinivasan
64c3892747 Kris Kennaway found that for '/' NFS mounts, the MPSAFE mount flag was
not being set, which means Giant would be acquired for these mounts.
2006-05-30 20:32:44 +00:00
Mohan Srinivasan
1af6f471ca Fix for a potential attempt to sleep while holding nm_mtx. Caught and reported
by Witness (which forces the mbuf allocation flag to M_NOWAIT).

Reported by: "sekes".
2006-05-26 18:45:55 +00:00
Stephan Uphoff
6c1b7d16c2 Call vm_object_page_clean() with the object lock held.
Submitted by:	kensmith@
Reviewed by:	mohans@
MFC after:	6 days
2006-05-25 17:16:11 +00:00
Stephan Uphoff
dcf67e65d2 Do not set B_NOCACHE on buffers when releasing them in flushbuflist().
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.

This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.

Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@
Reviewed by:	tegge@
MFC after:	7 days
2006-05-25 01:00:35 +00:00
Mohan Srinivasan
5bbfbd1422 Since NFSv4 is not SMP safe, nfsiod needs to acquire Giant for NFSv4 mounts
before doing the read/write.

Reported by:	Chuck Lever.
2006-05-24 23:06:50 +00:00
Robert Watson
33c6a485bd Adjust minimum iod threads from 4 to 0 -- since we compile the NFS
client into the kernel by default, and many users won't use NFS,
don't start an extra 4 kernel threads that are unused.  Once NFS
becomes active, it will start nfsiod's as it needs them.

We might consider mandating a minimum iod's equal to the number of
active NFS mounts (truncated to some value), which would force some
to remain available without having to create a new one if the file
system is mostly inactive.

PR:		70880
MFC after:	2 weeks
Prodded by:	cel
Head nod:	peter
Pointed out by:	Joe <fbsd_user at a1poweruser dot com>
2006-05-24 21:04:46 +00:00
Chuck Lever
6d0699a5ba NFS over TCP retransmit behavior should default to a 60 second time out,
mimicing the NFS reference implementation.

NFS over TCP does not need fast retransmit timeouts, since network loss
and congestion are managed by the transport (TCP), unlike with NFS over
UDP.  A long timeout prevents the unnecessary retransmission of non-
idempotent NFS requests.

Reviewed by:	mohans, silby, rees?
Sponsored by:	Network Appliance, Incorporated
2006-05-23 18:48:07 +00:00
Chuck Lever
94163ea283 Refactor the NFS over UDP retransmit timeout estimation logic to allow
the estimator to be more easily tuned and maintained.

There should be no functional change except there is now a lower limit
on the retransmit timeout to prevent the client from retransmitting
faster than the server's disks can fill requests, and an upper limit
to prevent the estimator from taking to long to retransmit during a
server outage.

Reviewed by:	mohan, kris, silby
Sponsored by:	Network Appliance, Incorporated
2006-05-23 18:33:58 +00:00
Mohan Srinivasan
f2c48228fe Vnode locks are recursive and the NFS client support shared vnode locks.
Found by: Kris Kennaway.
2006-05-23 16:07:23 +00:00
Mohan Srinivasan
f1cdf89911 Changes to make the NFS client MP safe.
Thanks to Kris Kennaway for testing and sending lots of bugs my way.
2006-05-19 00:04:24 +00:00
Mohan Srinivasan
671d06fb2e Fix a snafu caused while patching the previous fix from another branch. 2006-05-05 18:12:13 +00:00
Mohan Srinivasan
9f5b7dea42 Fix for a NFS/TCP client bug which would cause the NFS/TCP stream to get
out of sync under heavy loads, forcing frequent reconnets, causing EBADRPC
errors etc.
2006-05-05 18:04:53 +00:00
Mohan Srinivasan
5ef7d50da5 Keep track of the number of in-progress async direct IO writes in the nfsnode.
Make fsync/close wait until all of these drain. Add a check to nfs_getpage() and
nfs_putpage().
2006-04-06 01:20:30 +00:00
Jeff Roberson
b2282f9a3f - Busy the filesystem in nfs_statfs to prevent us from creating a new
vnode after vflush() has succeeded.  This would cause a dangling vnode
   panic at unmount time otherwise.  Other filesystems may have this problem
   via their VFS_VGET() routines.

Found by:	kris
Sponsored by:	Isilon Systems, Inc.
2006-04-01 01:15:23 +00:00
Kris Kennaway
78e31796c9 Fix a bug in the NFS/TCP retransmission path.
The bug was that earlier, if a request was retransmitted,
we would do subsequent retransmits every 10 msecs.

This can cause data corruption under moderate loads by reordering
operations as seen by the client NFS attribute cache, and on the
server side when the retransmission occurs after the original request
has left the duplicate cache, since the operation will be committed
for a second time.

Further work on retransmission handling is needed (e.g. they are still
being done sent too often since they are scaled by HZ, and the size of
the dup cache is too small and easily overwhelmed on busy servers).

Submitted by:	mohans
2006-03-23 22:58:42 +00:00
Pawel Jakub Dawidek
9972deb772 Actually I wanted 'nolockd' here instead of 'lockd'.
MFC after:	2 days
2006-03-19 13:27:37 +00:00
Chuck Lever
a59b03bf0e If an NFS server returns more than a few EJUKEBOX errors for a given RPC
request, the FreeBSD NFS client will quickly back off to a excessively
long wait (days, then weeks) before retrying the request.

Change the behavior of the FreeBSD NFS client to match the behavior of
the reference NFS client implementation (Solaris).  This provides a fixed
delay of 10 seconds between each retry by default.  A sysctl, called
nfs3_jukebox_delay, is now available to tune the delay.  Unlike Solaris,
the sysctl value on FreeBSD is in seconds, rather than in HZ.

Sponsored by:	Network Appliance, Incorporated
Reviewed by:	rick
Approved by:	silby
MFC after:	3 days
2006-03-17 22:14:23 +00:00
Chuck Lever
9f5349f23d Fix a bug in NFSv3 READDIRPLUS reply processing
The client's READDIRPLUS logic skips the attributes and
filehandle of the ".." entry.  If the server doesn't send
attributes but does send a filehandle for "..", the
client's logic doesn't account for the extra "value
follows" field that indicates whether the filehandle is
present, causing the remaining entries in the reply
to be ignored.

Sponsored by:	Network Appliance, Inc.
Reviewed by:	rick, mohans
Approved by:	silby
MFC after:	2 weeks
2006-03-08 01:43:01 +00:00
Jim Rees
4b81d0eb0f Don't log an error on tcp connection reset, even if we don't get ECONNRESET.
Submitted by:	cel@citi.umich.edu
2006-01-20 15:07:18 +00:00
Alfred Perlstein
92e73f5711 I ran into an nfs client panic a couple of times in a row over the
last few days.  I tracked it down to the fact that nfs_reclaim()
is setting vp->v_data to NULL _before_ calling vnode_destroy_object().
After silence from the mailing list I checked further and discovered
that ufs_reclaim() is unique among FreeBSD filesystems for calling
vnode_destroy_object() early, long before tossing v_data or much
of anything else, for that matter.  The rest, including NFS, appear
to be identical, as if they were just clones of one original routine.

The enclosed patch fixes all file systems in essentially the same
way, by moving the call to vnode_destroy_object() to early in the
routine (before the call to vfs_hash_remove(), if any).  I have
only tested NFS, but I've now run for over eighteen hours with the
patch where I wouldn't get past four or five without it.

Submitted by: Frank Mayhar
Requested by: Mohan Srinivasan
MFC After: 1 week
2006-01-17 17:29:03 +00:00
Robert Watson
63074a901a In nfs_dolock(), GC now under-used ioflg, rendered obsolete when we moved
from using a fifo to talk to rpc.lockd to using a special device node.

Noticed by:	Coverity Prevent analysis tool
MFC after:	3 days
2006-01-13 23:16:29 +00:00
Tor Egge
82be0a5a24 Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by:	truckman
2006-01-09 20:42:19 +00:00
Xin LI
fc9fac4c78 Correct a typo 2005-12-28 10:03:48 +00:00
Paul Saab
fc6ff223c4 Improve upon rev 1.133 where NFS/TCP would not reconnect.
Submitted by:	Mohan Srinivasan
2005-12-12 23:18:05 +00:00
Ruslan Ermilov
2f1b461447 Unexpand LLADDR(). 2005-11-29 09:51:47 +00:00
Paul Saab
38b29f71ef Fix for a bug where NFS/TCP would not reconnect (in the case where
the server FIN'ed). Seen with Solaris NFS servers.

Reported by:	TOMITA Yoshinori <yoshint@flab.fujitsu.co.jp>
Submitted by:	Mohan Strinivasan
2005-11-21 19:25:24 +00:00
Paul Saab
3834aac17e - Always return success from NFS strategy. nfs_doio(), in the
event of an error, does the right thing, in terms of setting
  the error flags in the buf header. That fixes a crash from
  bstrategy().
- Treat ETIMEDOUT as a "recoverable" error, causing the buffer
  to be re-dirtied. ETIMEDOUT can occur on soft mounts, when
  the number of retries are exceeded, and we don't want data loss
  in that case.

Submitted by:	Mohan Srinivasan
2005-11-21 19:23:46 +00:00
Jim Rees
cb156cc603 fix a problem with XID re-use when a server returns NFSERR_JUKEBOX.
Submitted by:	cel@citi.umich.edu
Fixed by:	rick@snowhite.cis.uoguelph.ca
Approved by:	alfred
MFC after:	3 weeks
2005-11-21 18:39:18 +00:00
Jonathan Chen
0b3e7451da fix a crash when an nfsv2 mount fails
MFC after:	1 week
2005-11-10 23:25:16 +00:00
Paul Saab
9c31df40bb Fix for a crash (from nfs_lookup() in an error case).
Submitted by:	Mohan Srinivasan
2005-11-03 19:24:54 +00:00
Paul Saab
41ce2892bb In nfs_flush(), clear the NMODIFIED bit only if there are no dirty
buffers *and* there are no buffers queued up for writing.  The bug
was that NMODIFIED was being cleared even while there were buffers
scheduled to be written out, which leads to all sorts of interesting
bugs - one where the file could shrink (because of a post-op getattr
load, say) causing data in buffer(s) queued for write to be tossed,
resulting in data corruption.

Submitted by:	Mohan Srinivasan
2005-11-03 07:42:15 +00:00
Paul Saab
120c58288c Fix for a race between the thread transmitting the request and the
thread processing the reply.

Submitted by:	Mohan Srinivasan
2005-11-03 07:31:06 +00:00
Robert Watson
5bb84bc84b Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in
  memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
  as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
  memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
  attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion.  Similar changes are required for UMA zone names.
2005-10-31 15:41:29 +00:00
Gleb Smirnoff
c0bc2867c1 - Fix leak of struct nlminfo on process exit.
- Fix malloc type collision, that made the above problem
  difficult to understand.

Reported by:	Vladimir Sharun <sharun ukr.net>
2005-10-26 07:18:37 +00:00
Pawel Jakub Dawidek
df71afde00 - Use strsep() instead of strtok().
- strdup() uses M_WAITOK, so we don't need to check it's return value
  against NULL.

MFC after:	2 weeks
2005-10-06 19:04:08 +00:00
Pawel Jakub Dawidek
720f3948c0 Add boot.nfsroot.options loader tunable.
It allows to specify options for NFS root file system.
Currently supported options are: soft, intr, conn, lockd.

I'm adding this functionality mostly for 'lockd' option, which is only
honored when performing the initial mount and will be silently ignored
if used while updating the mount options.

This will allow to use flock(2) without the need of using varmfs or
rpc.lockd and friends.

Example of use:
boot.nfsroot.options="intr,lockd"

MFC after:	2 weeks
2005-10-06 11:18:34 +00:00
Robert Watson
84d2b7df26 Add GIANT_REQUIRED and WITNESS sleep warnings to uprintf() and tprintf(),
as they both interact with the tty code (!MPSAFE) and may sleep if the
tty buffer is full (per comment).

Modify all consumers of uprintf() and tprintf() to hold Giant around
calls into these functions.  In most cases, this means adding an
acquisition of Giant immediately around the function.  In some cases
(nfs_timer()), it means acquiring Giant higher up in the callout.

With these changes, UFS no longer panics on SMP when either blocks are
exhausted or inodes are exhausted under load due to races in the tty
code when running without Giant.

NB: Some reduction in calls to uprintf() in the svr4 code is probably
desirable.

NB: In the case of nfs_timer(), calling uprintf() while holding a mutex,
or even in a callout at all, is a bad idea, and will generate warnings
and potential upset.  This needs to be fixed, but was a problem before
this change.

NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having
non-MPSAFE tty code.

MFC after:	1 week
2005-09-19 16:51:43 +00:00
Paul Saab
250614c5ab FIx for a bug in the change that made nfs_timer() MPSAFE. We need to
grab Giant before calling pru_send() (if running with mpsafenet = 0).

Found by:	Jeremie Le Hen.
Fixed by:	Maxime Henrion
2005-07-27 15:06:26 +00:00
Paul Saab
4fb48d10b0 In nfs_nget() if two threads race on the same filehandle, the loser should
cause the nfsnode to get freed. This fixes a potential vnode (and nfsnode)
leak in that path.

Submitted by:	Mohan Srinivasan
Reviewed by:	phk
2005-07-27 15:05:31 +00:00
Paul Saab
865b5cc7fd Remove the NFS client rslock. The rslock was used to serialize
writers that want to extend the file. It was also used to serialize
readers that might want to read the last block of the file (with a
writer extending the file).  Now that we support vnode locking for
NFS, the rslock is unnecessary. Writers grab the exclusive vnode
lock before writing and readers grab the shared (or in some cases
the exclusive) lock.

Submitted by:	Mohan Srinivasan
2005-07-21 22:46:56 +00:00
Paul Saab
4321eae6b7 Make nfs_timer() MPSAFE. With this change, the bottom half of the NFS
client (the interface with the protocol stack and callouts) is
Giant-free.

Submitted by:	Mohan Srinivasan.
2005-07-19 21:27:25 +00:00
Paul Saab
38b8570c55 Fix for a NFS soft mounts bug where if the number of retries exceeds
the max rexmits, the request was not being bounced back with a
ETIMEDOUT error.

Reported by:	Oliver Lehmann
Submitted by:	Mohan Srinivasan
2005-07-18 02:12:17 +00:00
Paul Saab
0e38f5365b Fixes for NFS crashes on architectures that require strict alignment.
- Fix nfsm_disct() so that after pulling up data, the remaining data
  is aligned if necessary.
- Fix nfs_clnt_tcp_soupcall() to bcopy() the rpc length out of the
  mbuf (instead of casting m_data to a uint32).

Submitted by:	Pyun YongHyeon
Reviewed by:	Mohan Srinivasan
2005-07-14 20:08:27 +00:00
Brian Feldman
6979a7592a Ifdef out the incomplete non-blocking IO implementation for NFS
pending discussion of how implementation would proceed.  Applications
like -lc_r expect select(3) to match the EAGAIN-status of IO
functions.

Approved by:	re
2005-06-16 15:43:17 +00:00
Brian Feldman
cc3149b1ea Fix a serious deadlock with the NFS client. Given a large enough
atomic write request, it can fill the buffer cache with the entirety
of that write in order to handle retries.  However, it never drops
the vnode lock, or else it wouldn't be atomic, so it ends up waiting
indefinitely for more buf memory that cannot be gotten as it has it
all, and it waits in an uncancellable state.

To fix this, hibufspace is exported and scaled to a reasonable
fraction.  This is used as the limit of how much of an atomic write
request by the NFS client will be handled asynchronously.  If the
request is larger than this, it will be turned into a synchronous
request which won't deadlock the system.  It's possible this value is
far off from what is required by some, so it shall be tunable as soon
as mount_nfs(8) learns of the new field.

The slowdown between an asynchronous and a synchronous write on NFS
appears to be on the order of 2x-4x.

General nod by:	gad
MFC after:	2 weeks
More testing:	wes
PR:		kern/79208
2005-06-10 23:50:41 +00:00
Dag-Erling Smørgrav
3f54cc0505 Ugh. Previous commit got the logic exactly backward.
Submitted by:	bland
Pointy hat to:	des
2005-05-17 18:23:03 +00:00
Dag-Erling Smørgrav
ff17c7a727 Revision 1.173 broke updating a mount from ro to rw. Fix that by clearing
the MNT_RDONLY flag if MNT_UPDATE is set and "ro" was not specified.

Suggested by:	cognet
2005-05-17 12:00:43 +00:00
Jim Rees
3785bdbe7f set R_MUSTRESEND flag in mark_for_reconnect so re-connected requests get
re-sent instead of timing out.

don't log an error message on reconnection, which is not an error.

remove unused nfs_mrep_before_tsleep.

Reviewed by:	Mohan Srinivasan
Approved by:	alfred
2005-05-10 14:25:14 +00:00
Paul Saab
15ec3fe2f0 Fix a bug in NFS/TCP where retransmissions would not reliably happen
if the server rebooted or tore down the connection for any reason.

Found by:	Jonathan Noack.
Submitted by:	Mohan Srinivasan.
2005-05-04 16:37:31 +00:00
Ian Dowse
2c443c417c Don't copy the NFSMNT_* flags into struct statfs's f_flags field,
as they have no connection with the expected MNT_* flags. This bug
was exposed 18 months ago when the assignments to f_flags in
vfs_syscalls.c were moved to before the VFS_STATFS() call. It was
fixed in the CSRG source 10 years ago, but we never picked up that
change.

PR:		kern/80390
MFC after:	1 week
2005-05-02 15:57:10 +00:00
Dag-Erling Smørgrav
4104e6bc1d When NFS was converted to the new mount syscall, code was written that sets
the MNT_RDONLY flag if the "ro" option was passed in from userland, and
clears it otherwise.  In the diskless case, the MNT_RDONLY flag is already
set when this code is reached, but there are no mount options, so it was
incorrectly cleared.  Change the logic so the MNT_RDONLY flag is set if the
"ro" option was specified, and left alone otherwise.

Note that the NFS code will still happily let you mount a filesystem RW
even if the server exports it RO.  I'm not sure how to fix that.
2005-04-27 14:46:02 +00:00