Commit Graph

975 Commits

Author SHA1 Message Date
rwatson
23574c8673 Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet.  As that
has now been removed, they are no longer required.  Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option.  Clean up some related gotos for
consistency.

Reviewed by:	bz, csjp
Tested by:	kris
Approved by:	re (kensmith)
2007-08-06 14:26:03 +00:00
jhb
7c4c058305 Fix for a race where out of order loading of NFS attrs into the
nfsnode could lead to attrs being stale. One example (that we
ran into) was a READDIR+, WRITE. The responses came back in
order, but the attrs from the WRITE were loaded before the
attrs from the READDIR+, leading to the wrong size from being
read on the next stat() call.

MFC after:	1 week
Submitted by:	mohans
Approved by:	re (kensmith)
2007-07-03 18:31:47 +00:00
jhb
788b235895 Fix up NFS client write error handling. Errors are split into
recoverable and unrecoverable. For the former, we redirty the
buffer and hang onto it for future retries. For the latter (eg.
ESTALE), we discard the buffer and return the error back to the
user on the next syscall. This fixes a number of vfs panics and
fixes having a large number of dirty buffers (that cannot be
written out and reclaimed) from hanging around. Thanks to ups@
for discussions on this issue.

Reported by:	kris, Kai, others
Approved by:	re (kensmith)
2007-07-03 18:30:55 +00:00
attilio
9bd4fdf7ce Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)
2007-06-04 21:45:18 +00:00
jeff
a7a8bac81f - Move rusage from being per-process in struct pstats to per-thread in
td_ru.  This removes the requirement for per-process synchronization in
   statclock() and mi_switch().  This was previously supported by
   sched_lock which is going away.  All modifications to rusage are now
   done in the context of the owning thread.  reads proceed without locks.
 - Aggregate exiting threads rusage in thread_exit() such that the exiting
   thread's rusage is not lost.
 - Provide a new routine, rufetch() to fetch an aggregate of all rusage
   structures from all threads in a process.  This routine must be used
   in any place requiring a rusage from a process prior to it's exit.  The
   exited process's rusage is still available via p_ru.
 - Aggregate tick statistics only on demand via rufetch() or when a thread
   exits.  Tick statistics are kept in the thread and protected by sched_lock
   until it exits.

Initial patch by:	attilio
Reviewed by:		attilio, bde (some objections), arch (mostly silent)
2007-06-01 01:12:45 +00:00
attilio
7dd8ed88a9 Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)
2007-05-31 22:52:15 +00:00
rwatson
3b7f8bde24 In nfs_down(), if rep can be NULL, which we test for, then we should
lock and unlock conditionally, not just set the flag on it conditionally.
In practice, this bug couldn't manifest, as in the current revision of
the code, no callers pass a NULL rep.

CID:		1416
Found with:	Coverity Prevent(tm)
2007-05-18 19:34:54 +00:00
jeff
e1996cb960 - define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts.  This can be used to abstract away pcpu details but also changes
   to use atomics for all counters now.  This means sched lock is no longer
   responsible for protecting counts in the switch routines.

Contributed by:		Attilio Rao <attilio@FreeBSD.org>
2007-05-18 07:10:50 +00:00
jhb
961de35e83 Various fixes to the NFS Directio support.
- Fix for a bug where a close would not wait for all (directio)
  dirty buffers to drain. The nfsnode was not marked NMODIFIED
  when there were directio dirtied buffers pending, causing this.
- No reason to vhold/vrele the vp when enqueueing DirectIO requests
  for the nfsiods. The vnode can't really go way since the close
  has to wait for these requests to drain.

MFC after:	1 week
Submitted by:	mohans
2007-04-25 20:34:55 +00:00
rwatson
32f12b60cc Attempt to rationalize NFS privileges:
- Replace PRIV_NFSD with PRIV_NFS_DAEMON, add PRIV_NFS_LOCKD.

- Use PRIV_NFS_DAEMON in the NFS server.

- In the NFS client, move the privilege check from nfslockdans(), which
  occurs every time a write is performed on /dev/nfslock, and instead do it
  in nfslock_open() just once.  This allows us to avoid checking the saved
  uid for root, and just use the effective on open.  Use PRIV_NFS_LOCKD.
2007-04-21 18:11:19 +00:00
delphij
e4435eebba Don't destroy a mutex just before we use it, instead,
destroy it after we have used it.
2007-03-23 08:52:36 +00:00
tegge
214bc5723c Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount).  On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque().  Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode.  The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup.  Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by:	re (kensmith)
Reviewed by:	kib
In collaboration with:	kib
2007-03-13 01:50:27 +00:00
mohans
f2213cb325 Back out a chance to nfs_timer() that inadvertantly crept in the last checkin :( 2007-03-09 04:07:54 +00:00
mohans
a332cb00d5 Over NFS, an open() call could result in multiple over-the-wire
GETATTRs being generated - one from lookup()/namei() and the other
from nfs_open() (for cto consistency). This change eliminates the
GETATTR in nfs_open() if an otw GETATTR was done from the namei()
path. Instead of extending the vop interface, we timestamp each attr
load, and use this to detect whether a GETATTR was done from namei()
for this syscall. Introduces a thread-local variable that counts the
syscalls made by the thread and uses <pid, tid, thread syscalls> as
the attrload timestamp. Thanks to jhb@ and peter@ for a discussion on
thread state that could be used as the timestamp with minimal overhead.
2007-03-09 04:02:38 +00:00
jhb
9081d44243 Use pause() rather than tsleep() on stack variables and function pointers. 2007-02-27 17:23:29 +00:00
mohans
fd33eed884 Backing out an earlier change. It seems harmless for NFS to miss the "force
unmount" flag, making the acquisition of the MNT_ILOCK in nfs_request() and
nfs_sigintr() unnecessary. Pointed out by tegge@.
2007-02-16 03:46:55 +00:00
mohans
6d93e14637 Add missing MNT_ILOCK around some mnt_kern_flag accesses. 2007-02-11 04:01:10 +00:00
mohans
5f0bd46234 Fix for a vnode lock leak in nfs_create() in the event of an error.
Spotted by ups@.
2007-01-31 23:10:27 +00:00
kris
365d5c4ba7 Instead of always hard-coding the socket type for the nfs root mount as
SOCK_DGRAM (i.e. UDP), respect the value configured earlier.  This allows
TCP NFS root mounts using e.g. the boot.nfsroot.options="tcp" tunable.

In this case some of the connection parameters like the retry timer were
previously set appropriately for TCP but inappropriately for the UDP
socket that was actually used, leading to e.g. extremely long recovery
times (O(hours)) after a nfs server reboot.

Reviewed by:    mohans
MFC After:      2 weeks
2007-01-30 00:26:04 +00:00
bde
5a952c0766 Unstaticize nfs_iosize() in nfsclient and use it in nfs4client instead
of duplicating it except for larger style bugs in the copy.

Fix some nearby style bugs (including a harmless type mismatch)
in and near the remaining copy.

This is part of fixing collisions of the 2 nfs*client's names.  Even
static names should have a unique prefixes so that they can be debugged
easily.
2007-01-25 13:07:25 +00:00
kib
fdd50404d1 Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.

Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.

Reviewed by:	tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by:	Peter Holm
X-MFC after:	3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
mohans
a3a728eabb NetApp filers return corrupt post op attrs in the wcc on NFS error responses.
This is easy to reproduce for EROFS. I am not sure if the attrs can be corrupt
for other NFS error responses. For now, disabling wcc pre-op attr checks and
post-op attr loads on NFS errors (sysctl'ed).
Reported by: Kris Kennaway
2006-12-11 19:54:25 +00:00
sam
17d1a5f84e consolidate parsing of nfs root mount options in one place
and handle all options (some may require fixes elsewhere)

Reviewed by:	jhb, mohans
MFC after:	1 month
2006-12-06 02:15:25 +00:00
mohans
9a2d94c135 In nfs_nget(), we must initialize the fh in the nfsnode before inserting the
vnode into the vfs hash. Otherwise, another thread walking the hash can trip
on an nfsnode with an uninitialized or partially initialized fh.
Thanks to ups@ for spotting this race.
2006-11-29 02:21:40 +00:00
mohans
ef00496038 bde@ pointed out that tprintf() acquires Giant so callers of tprintf() don't
have to explicitly acquire Giant (although they need to be aware of this and
not hold any locks at that point). Remove the acquisitions of Giant in the
NFS client wrapping tprintf().
2006-11-27 23:26:06 +00:00
mohans
9ca7ca48b8 Fix for a bug caused by a race when 2 threads lookup the same
file. Leave the loser's lock(s) initialized, so the reclaim logic can
unconditionally destroy them when that race occurs (or if the vfs hash
insert happened to fail for some other reason). Thanks to ups@ for a
careful review of the code.
Reported by : Kris Kennaway
2006-11-27 19:06:43 +00:00
mohans
4800f71e23 1) Fix up locking in nfs_up() and nfs_down.
2) Reduce the acquisitions of the Giant lock in the nfs_socket.c paths significantly.
- We don't need to acquire Giant before tsleeping on lbolt anymore,
  since jhb specialcased lbolt handling in msleep.
- nfs_up() needs to acquire Giant only if printing the "server up"
  message.
- nfs_timer() held Giant for the duration of the NFS timer processing,
  just because the printing of the message in nfs_down() needed it
  (and we acquire other locks in nfs_timer()). The acquisition of
  Giant is moved down into nfs_down() now, reducing the time Giant is
  held in that path.

Reported by: Kris Kennaway
2006-11-20 04:14:23 +00:00
mohans
755a61c0b7 vfs_hash_insert() vputs() the losing vnode before returning, in the event of
a race where a duplicate vnode is entered into the vfs hash. nfs_nget() shouldn't
be releasing the vnode in that case.
2006-11-16 23:03:46 +00:00
mohans
3ef54eed4c Fix to readdir+ reply handling. When inserting an entry into the namecache,
initialize the nfsnode's ctime. Otherwise a subsequent lookup purges the
just entered namecache entry.
2006-11-16 23:02:37 +00:00
sam
2619dbffd2 honor nolockd flag in root mount options
MFC after:	2 weeks
2006-11-07 18:02:45 +00:00
mohans
6afec858bc Make EWOULDBLOCK a recoverable error so that the request is retransmitted.
This bug results in data corruption with NFS/TCP. Writes are silently dropped
on EWOULDBLOCK (because socket send buffer is full and sockbuf timer fires).

Reviewed by: ups@
2006-10-31 20:25:37 +00:00
bde
fea5a64567 Fixed some style bugs (especially ones involving long lines and use
of __P(())).  There are many more.
2006-10-17 22:07:07 +00:00
bde
d0962ab7a5 Don't do null Setattr RPCs for VA_MARK_ATIME. When we added the
VA_MARK_ATIME feature to fix POSIX conformance fore execve() and mmap(),
we thought that it was optimized well enough for the one file system
that supports it (ffs) and harmless for other file systems (except
layered ones which already get the layering for VOP_SETATTR() wrong).
However, nfs_setattr() doesn't do much parameter checking, so when
it gets a combination of parameters that it doesn't understand, it
always does a Setattr RPC.  This RPC can't do anything good, and for
VA_MARK_ATIME it is null except for wasting a lot of time.

This is the smallest and easiest to fix of several bugs that have
increased the number of RPCs for kernel builds on nfs by more than
100% since 2004-11-05.  The real-time increase depends on network
latency and parallelization and can also be very large (approaching
the same percentage for unparallelized operations like "make depend"
on systems with fast CPUs and high-latency networks).
2006-10-14 07:25:11 +00:00
phk
50c81b8a9a First part of a little cleanup in the calendar/timezone/RTC handling.
Move relevant variables to <sys/clock.h> and fix #includes as necessary.

Use libkern's much more time- & spamce-efficient BCD routines.
2006-10-02 12:59:59 +00:00
tegge
f42473d76b Add mnt_noasync counter to better handle interleaved calls to nmount(),
sync() and sync_fsync() without losing MNT_ASYNC.  Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.
2006-09-26 04:15:59 +00:00
tegge
83154f853d Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().
2006-09-26 04:12:49 +00:00
mohans
21daa650a9 Fixes up the handling of shared vnode lock lookups in the NFS client,
adds a FS type specific flag indicating that the FS supports shared
vnode lock lookups, adds some logic in vfs_lookup.c to test this flag
and set lock flags appropriately.

- amd on 6.x is a non-starter (without this change). Using amd under
  heavy load results in a deadlock (with cascading vnode locks all the
  way to the root) very quickly.
- This change should also fix the more general problem of cascading
  vnode deadlocks when an NFS server goes down.

Ideally, we wouldn't need these changes, as enabling shared vnode lock
lookups globally would work. Unfortunately, UFS, for example isn't
ready for shared vnode lock lookups, crashing pretty quickly.

This change is the result of discussions with Stephan Uphoff (ups@).

Reviewed by:	ups@
2006-09-13 18:39:09 +00:00
mohans
0b0d785f6a Fix for a deadlock triggered by a 'umount -f' causing a NFS request to never
retransmit (or return). Thanks to John Baldwin for helping nail this one.

Found by : Kris Kennaway
2006-08-29 22:00:12 +00:00
thomas
2f330cd31b Fix typos in comment. 2006-08-16 23:53:05 +00:00
alc
b98eae58a6 Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags.  Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().
2006-08-09 17:43:27 +00:00
brooks
f9c7fb948d Add a new kernel environment variable "boot.netif.mtu" which is used to
set the MTU prior to mounting root via NFS.  This is required if the
server supports a higher than default MTU because the client will not
see the responses otherwise.

MFC after:	3 weeks
2006-08-09 01:56:17 +00:00
rwatson
40868fda8a soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod:	sam, gnn, wollman
2006-07-24 15:20:08 +00:00
kib
574ae24779 Signals may be delivered to process as well as to the thread. Check the
thread-delivered signals in addition to the process one.

Reviewed by:	mohan
MFC after:	1 month
Approved by:	kan (mentor)
2006-07-08 15:39:11 +00:00
kib
54b7822070 Always supply curthread as argument to nfs_asyncio and nfs_doio
in nfs_strategy. Otherwise, for some buffers, signals would be ignored
at the intr mounts.

Reviewed by:	mohan
MFC after:	1 month
Approved by:	kan (mentor)
2006-07-08 15:36:51 +00:00
yar
ba19b1ecd4 There is a consensus that ifaddr.ifa_addr should never be NULL,
except in places dealing with ifaddr creation or destruction; and
in such special places incomplete ifaddrs should never be linked
to system-wide data structures.  Therefore we can eliminate all the
superfluous checks for "ifa->ifa_addr != NULL" and get ready
to the system crashing honestly instead of masking possible bugs.

Suggested by:	glebius, jhb, ru
2006-06-29 19:22:05 +00:00
yar
af09d16f73 Use the elegant TAILQ_FOREACH() in place of a hand-rolled for() loop. 2006-06-29 15:37:39 +00:00
mohans
b2a8ee65bc Kris Kennaway found that for '/' NFS mounts, the MPSAFE mount flag was
not being set, which means Giant would be acquired for these mounts.
2006-05-30 20:32:44 +00:00
mohans
ac7b19c8b2 Fix for a potential attempt to sleep while holding nm_mtx. Caught and reported
by Witness (which forces the mbuf allocation flag to M_NOWAIT).

Reported by: "sekes".
2006-05-26 18:45:55 +00:00
ups
d193a40233 Call vm_object_page_clean() with the object lock held.
Submitted by:	kensmith@
Reviewed by:	mohans@
MFC after:	6 days
2006-05-25 17:16:11 +00:00
ups
4eb5a7d9ee Do not set B_NOCACHE on buffers when releasing them in flushbuflist().
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.

This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.

Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@
Reviewed by:	tegge@
MFC after:	7 days
2006-05-25 01:00:35 +00:00
mohans
365e894b0f Since NFSv4 is not SMP safe, nfsiod needs to acquire Giant for NFSv4 mounts
before doing the read/write.

Reported by:	Chuck Lever.
2006-05-24 23:06:50 +00:00
rwatson
726780d475 Adjust minimum iod threads from 4 to 0 -- since we compile the NFS
client into the kernel by default, and many users won't use NFS,
don't start an extra 4 kernel threads that are unused.  Once NFS
becomes active, it will start nfsiod's as it needs them.

We might consider mandating a minimum iod's equal to the number of
active NFS mounts (truncated to some value), which would force some
to remain available without having to create a new one if the file
system is mostly inactive.

PR:		70880
MFC after:	2 weeks
Prodded by:	cel
Head nod:	peter
Pointed out by:	Joe <fbsd_user at a1poweruser dot com>
2006-05-24 21:04:46 +00:00
cel
8b94e52728 NFS over TCP retransmit behavior should default to a 60 second time out,
mimicing the NFS reference implementation.

NFS over TCP does not need fast retransmit timeouts, since network loss
and congestion are managed by the transport (TCP), unlike with NFS over
UDP.  A long timeout prevents the unnecessary retransmission of non-
idempotent NFS requests.

Reviewed by:	mohans, silby, rees?
Sponsored by:	Network Appliance, Incorporated
2006-05-23 18:48:07 +00:00
cel
ec80996e6b Refactor the NFS over UDP retransmit timeout estimation logic to allow
the estimator to be more easily tuned and maintained.

There should be no functional change except there is now a lower limit
on the retransmit timeout to prevent the client from retransmitting
faster than the server's disks can fill requests, and an upper limit
to prevent the estimator from taking to long to retransmit during a
server outage.

Reviewed by:	mohan, kris, silby
Sponsored by:	Network Appliance, Incorporated
2006-05-23 18:33:58 +00:00
mohans
f29eefcdec Vnode locks are recursive and the NFS client support shared vnode locks.
Found by: Kris Kennaway.
2006-05-23 16:07:23 +00:00
mohans
60ef615733 Changes to make the NFS client MP safe.
Thanks to Kris Kennaway for testing and sending lots of bugs my way.
2006-05-19 00:04:24 +00:00
mohans
cbd19813ff Fix a snafu caused while patching the previous fix from another branch. 2006-05-05 18:12:13 +00:00
mohans
a10657a423 Fix for a NFS/TCP client bug which would cause the NFS/TCP stream to get
out of sync under heavy loads, forcing frequent reconnets, causing EBADRPC
errors etc.
2006-05-05 18:04:53 +00:00
mohans
8a6c02da7d Keep track of the number of in-progress async direct IO writes in the nfsnode.
Make fsync/close wait until all of these drain. Add a check to nfs_getpage() and
nfs_putpage().
2006-04-06 01:20:30 +00:00
jeff
9fb762d231 - Busy the filesystem in nfs_statfs to prevent us from creating a new
vnode after vflush() has succeeded.  This would cause a dangling vnode
   panic at unmount time otherwise.  Other filesystems may have this problem
   via their VFS_VGET() routines.

Found by:	kris
Sponsored by:	Isilon Systems, Inc.
2006-04-01 01:15:23 +00:00
kris
79229ffb86 Fix a bug in the NFS/TCP retransmission path.
The bug was that earlier, if a request was retransmitted,
we would do subsequent retransmits every 10 msecs.

This can cause data corruption under moderate loads by reordering
operations as seen by the client NFS attribute cache, and on the
server side when the retransmission occurs after the original request
has left the duplicate cache, since the operation will be committed
for a second time.

Further work on retransmission handling is needed (e.g. they are still
being done sent too often since they are scaled by HZ, and the size of
the dup cache is too small and easily overwhelmed on busy servers).

Submitted by:	mohans
2006-03-23 22:58:42 +00:00
pjd
e93fc5d214 Actually I wanted 'nolockd' here instead of 'lockd'.
MFC after:	2 days
2006-03-19 13:27:37 +00:00
cel
6441a8bff9 If an NFS server returns more than a few EJUKEBOX errors for a given RPC
request, the FreeBSD NFS client will quickly back off to a excessively
long wait (days, then weeks) before retrying the request.

Change the behavior of the FreeBSD NFS client to match the behavior of
the reference NFS client implementation (Solaris).  This provides a fixed
delay of 10 seconds between each retry by default.  A sysctl, called
nfs3_jukebox_delay, is now available to tune the delay.  Unlike Solaris,
the sysctl value on FreeBSD is in seconds, rather than in HZ.

Sponsored by:	Network Appliance, Incorporated
Reviewed by:	rick
Approved by:	silby
MFC after:	3 days
2006-03-17 22:14:23 +00:00
cel
85c8b06cbe Fix a bug in NFSv3 READDIRPLUS reply processing
The client's READDIRPLUS logic skips the attributes and
filehandle of the ".." entry.  If the server doesn't send
attributes but does send a filehandle for "..", the
client's logic doesn't account for the extra "value
follows" field that indicates whether the filehandle is
present, causing the remaining entries in the reply
to be ignored.

Sponsored by:	Network Appliance, Inc.
Reviewed by:	rick, mohans
Approved by:	silby
MFC after:	2 weeks
2006-03-08 01:43:01 +00:00
rees
fc1f7144ff Don't log an error on tcp connection reset, even if we don't get ECONNRESET.
Submitted by:	cel@citi.umich.edu
2006-01-20 15:07:18 +00:00
alfred
a0282ebc04 I ran into an nfs client panic a couple of times in a row over the
last few days.  I tracked it down to the fact that nfs_reclaim()
is setting vp->v_data to NULL _before_ calling vnode_destroy_object().
After silence from the mailing list I checked further and discovered
that ufs_reclaim() is unique among FreeBSD filesystems for calling
vnode_destroy_object() early, long before tossing v_data or much
of anything else, for that matter.  The rest, including NFS, appear
to be identical, as if they were just clones of one original routine.

The enclosed patch fixes all file systems in essentially the same
way, by moving the call to vnode_destroy_object() to early in the
routine (before the call to vfs_hash_remove(), if any).  I have
only tested NFS, but I've now run for over eighteen hours with the
patch where I wouldn't get past four or five without it.

Submitted by: Frank Mayhar
Requested by: Mohan Srinivasan
MFC After: 1 week
2006-01-17 17:29:03 +00:00
rwatson
a062d55bf0 In nfs_dolock(), GC now under-used ioflg, rendered obsolete when we moved
from using a fifo to talk to rpc.lockd to using a special device node.

Noticed by:	Coverity Prevent analysis tool
MFC after:	3 days
2006-01-13 23:16:29 +00:00
tegge
d344c11861 Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by:	truckman
2006-01-09 20:42:19 +00:00
delphij
fc7124b631 Correct a typo 2005-12-28 10:03:48 +00:00
ps
3b0631d180 Improve upon rev 1.133 where NFS/TCP would not reconnect.
Submitted by:	Mohan Srinivasan
2005-12-12 23:18:05 +00:00
ru
573fa22624 Unexpand LLADDR(). 2005-11-29 09:51:47 +00:00
ps
3278e302f0 Fix for a bug where NFS/TCP would not reconnect (in the case where
the server FIN'ed). Seen with Solaris NFS servers.

Reported by:	TOMITA Yoshinori <yoshint@flab.fujitsu.co.jp>
Submitted by:	Mohan Strinivasan
2005-11-21 19:25:24 +00:00
ps
6364b280f8 - Always return success from NFS strategy. nfs_doio(), in the
event of an error, does the right thing, in terms of setting
  the error flags in the buf header. That fixes a crash from
  bstrategy().
- Treat ETIMEDOUT as a "recoverable" error, causing the buffer
  to be re-dirtied. ETIMEDOUT can occur on soft mounts, when
  the number of retries are exceeded, and we don't want data loss
  in that case.

Submitted by:	Mohan Srinivasan
2005-11-21 19:23:46 +00:00
rees
1a3808ebdf fix a problem with XID re-use when a server returns NFSERR_JUKEBOX.
Submitted by:	cel@citi.umich.edu
Fixed by:	rick@snowhite.cis.uoguelph.ca
Approved by:	alfred
MFC after:	3 weeks
2005-11-21 18:39:18 +00:00
jon
9b47705fc0 fix a crash when an nfsv2 mount fails
MFC after:	1 week
2005-11-10 23:25:16 +00:00
ps
e5615c0136 Fix for a crash (from nfs_lookup() in an error case).
Submitted by:	Mohan Srinivasan
2005-11-03 19:24:54 +00:00
ps
7b692a9f88 In nfs_flush(), clear the NMODIFIED bit only if there are no dirty
buffers *and* there are no buffers queued up for writing.  The bug
was that NMODIFIED was being cleared even while there were buffers
scheduled to be written out, which leads to all sorts of interesting
bugs - one where the file could shrink (because of a post-op getattr
load, say) causing data in buffer(s) queued for write to be tossed,
resulting in data corruption.

Submitted by:	Mohan Srinivasan
2005-11-03 07:42:15 +00:00
ps
65fc4c8f22 Fix for a race between the thread transmitting the request and the
thread processing the reply.

Submitted by:	Mohan Srinivasan
2005-11-03 07:31:06 +00:00
rwatson
be4f357149 Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in
  memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
  as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
  memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
  attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion.  Similar changes are required for UMA zone names.
2005-10-31 15:41:29 +00:00
glebius
d9ad5313fd - Fix leak of struct nlminfo on process exit.
- Fix malloc type collision, that made the above problem
  difficult to understand.

Reported by:	Vladimir Sharun <sharun ukr.net>
2005-10-26 07:18:37 +00:00
pjd
79642efc0e - Use strsep() instead of strtok().
- strdup() uses M_WAITOK, so we don't need to check it's return value
  against NULL.

MFC after:	2 weeks
2005-10-06 19:04:08 +00:00
pjd
a73b4cad05 Add boot.nfsroot.options loader tunable.
It allows to specify options for NFS root file system.
Currently supported options are: soft, intr, conn, lockd.

I'm adding this functionality mostly for 'lockd' option, which is only
honored when performing the initial mount and will be silently ignored
if used while updating the mount options.

This will allow to use flock(2) without the need of using varmfs or
rpc.lockd and friends.

Example of use:
boot.nfsroot.options="intr,lockd"

MFC after:	2 weeks
2005-10-06 11:18:34 +00:00
rwatson
c479a90eb8 Add GIANT_REQUIRED and WITNESS sleep warnings to uprintf() and tprintf(),
as they both interact with the tty code (!MPSAFE) and may sleep if the
tty buffer is full (per comment).

Modify all consumers of uprintf() and tprintf() to hold Giant around
calls into these functions.  In most cases, this means adding an
acquisition of Giant immediately around the function.  In some cases
(nfs_timer()), it means acquiring Giant higher up in the callout.

With these changes, UFS no longer panics on SMP when either blocks are
exhausted or inodes are exhausted under load due to races in the tty
code when running without Giant.

NB: Some reduction in calls to uprintf() in the svr4 code is probably
desirable.

NB: In the case of nfs_timer(), calling uprintf() while holding a mutex,
or even in a callout at all, is a bad idea, and will generate warnings
and potential upset.  This needs to be fixed, but was a problem before
this change.

NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having
non-MPSAFE tty code.

MFC after:	1 week
2005-09-19 16:51:43 +00:00
ps
142d92ef25 FIx for a bug in the change that made nfs_timer() MPSAFE. We need to
grab Giant before calling pru_send() (if running with mpsafenet = 0).

Found by:	Jeremie Le Hen.
Fixed by:	Maxime Henrion
2005-07-27 15:06:26 +00:00
ps
c30cfa7b9b In nfs_nget() if two threads race on the same filehandle, the loser should
cause the nfsnode to get freed. This fixes a potential vnode (and nfsnode)
leak in that path.

Submitted by:	Mohan Srinivasan
Reviewed by:	phk
2005-07-27 15:05:31 +00:00
ps
3c7af06c7a Remove the NFS client rslock. The rslock was used to serialize
writers that want to extend the file. It was also used to serialize
readers that might want to read the last block of the file (with a
writer extending the file).  Now that we support vnode locking for
NFS, the rslock is unnecessary. Writers grab the exclusive vnode
lock before writing and readers grab the shared (or in some cases
the exclusive) lock.

Submitted by:	Mohan Srinivasan
2005-07-21 22:46:56 +00:00
ps
46ea7f6a70 Make nfs_timer() MPSAFE. With this change, the bottom half of the NFS
client (the interface with the protocol stack and callouts) is
Giant-free.

Submitted by:	Mohan Srinivasan.
2005-07-19 21:27:25 +00:00
ps
f2e72f270b Fix for a NFS soft mounts bug where if the number of retries exceeds
the max rexmits, the request was not being bounced back with a
ETIMEDOUT error.

Reported by:	Oliver Lehmann
Submitted by:	Mohan Srinivasan
2005-07-18 02:12:17 +00:00
ps
3494397059 Fixes for NFS crashes on architectures that require strict alignment.
- Fix nfsm_disct() so that after pulling up data, the remaining data
  is aligned if necessary.
- Fix nfs_clnt_tcp_soupcall() to bcopy() the rpc length out of the
  mbuf (instead of casting m_data to a uint32).

Submitted by:	Pyun YongHyeon
Reviewed by:	Mohan Srinivasan
2005-07-14 20:08:27 +00:00
green
b4b5044eed Ifdef out the incomplete non-blocking IO implementation for NFS
pending discussion of how implementation would proceed.  Applications
like -lc_r expect select(3) to match the EAGAIN-status of IO
functions.

Approved by:	re
2005-06-16 15:43:17 +00:00
green
ff904ffb64 Fix a serious deadlock with the NFS client. Given a large enough
atomic write request, it can fill the buffer cache with the entirety
of that write in order to handle retries.  However, it never drops
the vnode lock, or else it wouldn't be atomic, so it ends up waiting
indefinitely for more buf memory that cannot be gotten as it has it
all, and it waits in an uncancellable state.

To fix this, hibufspace is exported and scaled to a reasonable
fraction.  This is used as the limit of how much of an atomic write
request by the NFS client will be handled asynchronously.  If the
request is larger than this, it will be turned into a synchronous
request which won't deadlock the system.  It's possible this value is
far off from what is required by some, so it shall be tunable as soon
as mount_nfs(8) learns of the new field.

The slowdown between an asynchronous and a synchronous write on NFS
appears to be on the order of 2x-4x.

General nod by:	gad
MFC after:	2 weeks
More testing:	wes
PR:		kern/79208
2005-06-10 23:50:41 +00:00
des
0bbbcadeb1 Ugh. Previous commit got the logic exactly backward.
Submitted by:	bland
Pointy hat to:	des
2005-05-17 18:23:03 +00:00
des
d3a9750001 Revision 1.173 broke updating a mount from ro to rw. Fix that by clearing
the MNT_RDONLY flag if MNT_UPDATE is set and "ro" was not specified.

Suggested by:	cognet
2005-05-17 12:00:43 +00:00
rees
59c5573379 set R_MUSTRESEND flag in mark_for_reconnect so re-connected requests get
re-sent instead of timing out.

don't log an error message on reconnection, which is not an error.

remove unused nfs_mrep_before_tsleep.

Reviewed by:	Mohan Srinivasan
Approved by:	alfred
2005-05-10 14:25:14 +00:00
ps
40a0d434da Fix a bug in NFS/TCP where retransmissions would not reliably happen
if the server rebooted or tore down the connection for any reason.

Found by:	Jonathan Noack.
Submitted by:	Mohan Srinivasan.
2005-05-04 16:37:31 +00:00
iedowse
2593dad93c Don't copy the NFSMNT_* flags into struct statfs's f_flags field,
as they have no connection with the expected MNT_* flags. This bug
was exposed 18 months ago when the assignments to f_flags in
vfs_syscalls.c were moved to before the VFS_STATFS() call. It was
fixed in the CSRG source 10 years ago, but we never picked up that
change.

PR:		kern/80390
MFC after:	1 week
2005-05-02 15:57:10 +00:00
des
5e15cfc3fa When NFS was converted to the new mount syscall, code was written that sets
the MNT_RDONLY flag if the "ro" option was passed in from userland, and
clears it otherwise.  In the diskless case, the MNT_RDONLY flag is already
set when this code is reached, but there are no mount options, so it was
incorrectly cleared.  Change the logic so the MNT_RDONLY flag is set if the
"ro" option was specified, and left alone otherwise.

Note that the NFS code will still happily let you mount a filesystem RW
even if the server exports it RO.  I'm not sure how to fix that.
2005-04-27 14:46:02 +00:00
des
de2d951ab7 While I'm here, list the new kenv (boot.netif.name) along with the others. 2005-04-26 20:47:59 +00:00
des
37881dde0f When netbooting, as soon as we've figured out which interface we booted
from, store its name in a kenv variable.
2005-04-26 20:45:29 +00:00
rees
3e9035accc TCP reconnect is not an error.
Change the message from LOG_ERR to LOG_INFO.

Approved by:	alfred
2005-04-18 13:42:13 +00:00
jeff
e4eab9fb69 - cache_lookup() relocks the parent in the DOTDOT case for us.
Spotted by:	phk
Sponsored by:	Isilon Systems, Inc.
2005-04-14 07:08:34 +00:00
jeff
afab3762a0 - Change all filesystems and vfs_cache to relock the dvp once the child is
locked in the ISDOTDOT case.  Se vfs_lookup.c r1.79 for details.

Sponsored by:	Isilon Systems, Inc.
2005-04-13 10:59:09 +00:00
jeff
97c40ebd49 - LK_NOPAUSE is a nop now.
Sponsored by:   Isilon Systems, Inc.
2005-03-31 04:37:09 +00:00
jeff
ca1e4c2fe0 - Remove wantparent, it is no longer necessary. An assert in vfs_lookup.c
prevents any callers from doing a modifying op without
   LOCKPARENT or WANTPARENT.
2005-03-29 13:09:42 +00:00
jeff
141aba2c7b - cache_lookup() now locks the new vnode for us to prevent some races.
Remove redundant code.

Sponsored by:	Isilon Systems, Inc.
2005-03-29 13:00:37 +00:00
jeff
5f8bc80203 - We no longer have to bother with PDIRUNLOCK, lookup() handles it for us.
- Network filesystems are written with a special idiom that checks the
   cache first, and may even unlock dvp before discovering that a network
   round-trip is required to resolve the name.  I believe dvp is prevented
   from being recycled even in the forced unmount case by the shared lock
   on the mount point.  If not, this code should grow checks for VI_DOOMED
   after it relocks dvp or it will access NULL v_data fields.

Sponsored by:	Isilon Systems, Inc.
2005-03-28 09:29:58 +00:00
jeff
56f1fc7189 - Update vfs_root implementations to match the new prototype. None of
these filesystems will support shared locks until they are explicitly
   modified to do so.  Careful review must be done to ensure that this
   is safe for each individual filesystem.

Sponsored by:   Isilon Systems, Inc.
2005-03-24 07:39:03 +00:00
ps
114057c633 - The NFS client was incorrectly masking SIGSTOP (which is
non-maskable).
- The NFS client needs to guard against spurious wakeups
  while waiting for the response. ltrace causes the process
  under question to wakeup (possibly from ptrace()), which
  causes NFS to wakeup from tsleep without the response being
  delivered.

Submitted by:	Mohan Srinivasan
2005-03-23 22:10:10 +00:00
das
89bc04ad2d Don't brelse(bp) if bp is null. Also, eliminate some redundancy
and dead code.

Found by:	Coverity Prevent analysis tool
2005-03-18 21:23:32 +00:00
phk
172eba2632 Use vfs_hash. 2005-03-16 11:28:19 +00:00
jmg
64c69bfb4e MFp4: use the function to fix the packet header length instead of rolling
our own...
2005-03-16 08:13:08 +00:00
jeff
29a4f75b9b - VOP_INACTIVE should no longer drop the vnode lock.
Sponsored by:	Isilon Systems, Inc.
2005-03-13 12:15:36 +00:00
jeff
5bd51ec6e6 - The VI_DOOMED flag now signals the end of a vnode's relationship with
the filesystem.  Check that rather than VI_XLOCK.

Sponsored by:	Isilon Systems, Inc.
2005-03-13 12:14:56 +00:00
jeff
5f59e0cd19 - It is no longer necessary to lock and unlock the vnode in nfs_close() as
the top level does this for us now.

Sponsored by:	Isilon Systems, Inc.
2005-03-13 12:11:23 +00:00
ps
d4a5a3bc89 Minor cleanup in nfs_request() and removal of a comment that doesn't
reflect reality.

Submitted by:	Mohan Srinivasan
2005-02-26 18:55:36 +00:00
phk
33d6741fda vp->v_id is a private field for the vfs namecache and it is a big mistake
that NFS ever started using it.  Long time ago I added the necessary
vhold()/vdrop() calls to replace it, but forgot to remove the v_id code.

Do it now.
2005-02-22 14:52:00 +00:00
phk
66dfd63961 Try to unbreak the vnode locking around vop_reclaim() (based mostly on
patch from kan@).

Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close.  This is not yet a generally safe function, but for this very
specific use it is safe.  This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.
2005-02-19 11:44:57 +00:00
ps
f6b334da2c Fix for a potential NFS client race where shared data is updated from
base context as well as the socket callback.

Submitted by:	Mohan Srinivasan
2005-02-18 23:41:39 +00:00
jhb
685dd13b54 Drop Giant before calling kthread_exit(). 2005-02-07 18:21:50 +00:00
rwatson
39c4afac56 Style cleanup for O_DIRECT sysctl comment introduced in nfs_vnops.c:1.242. 2005-01-29 23:19:08 +00:00
phk
1b21636022 Make filesystems get rid of their own vnodes vnode_pager object in
VOP_RECLAIM().
2005-01-28 14:42:17 +00:00
phk
d0599e9c31 Create a vnode_pager object when a file is opened. 2005-01-24 23:03:29 +00:00
phk
8dba90be16 Remove unused cred arg from nfs_vinvalbuf() and many bogus arguments
passed for it.
2005-01-24 12:31:06 +00:00
peter
e4129c1fb1 Mostly back out rev 1.33 from quite some time ago, and the followup fixes
and tweaks.  The code was actually quite broken because it discarded the
upper bits of the 64 bit division.  We only had a 50% chance of scaling up
the blocksize for large NFS client mounts when it was needed.  For 5.x and
beyond, this was harmless because we could represent the result in either
case.  For 4.x this was a big problem though.  (4.x also has a df(1) bug to
compound the problem)
2005-01-18 21:59:44 +00:00
phk
cc0cbc6b34 Eliminate unused and unnecessary "cred" argument from vinvalbuf() 2005-01-14 07:33:51 +00:00
brian
2b05c4cf78 Include opt_bootp.h for BOOTP_NFSROOT
PR:		73183
Submitted by:	Darrin Smith sdar at salseast dot org
MFC after:	7 days
2005-01-12 12:42:46 +00:00
phk
5a497775d6 Add BO_SYNC() and add a default which uses the secret vnode pointer
and VOP_FSYNC() for now.
2005-01-11 10:43:08 +00:00
phk
da2718f1af Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().
I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

	The credentials for syncing a file (ability to write to the
	file) should be checked at the system call level.

	Credentials for syncing one or more filesystems ("none")
	should be checked at the system call level as well.

	If the filesystem implementation needs a particular credential
	to carry out the syncing it would logically have to the
	cached mount credential, or a credential cached along with
	any delayed write data.

Discussed with:	rwatson
2005-01-11 07:36:22 +00:00
imp
a50ffc2912 /* -> /*- for license, minor formatting changes 2005-01-07 01:45:51 +00:00
ps
86f9a5d44a If the NFS/TCP stream is out of sync between the client and server,
and if the client (erroneously) reads the RPC length as 0 bytes, the
client can loop around in the socket callback. Explicitly check for
the length being 0 case and teardown/re-connect.

Submitted by:	Mohan Srinivasan
2005-01-05 23:21:13 +00:00
ps
ad001884ff Turn NFS directio off until the stability issues are resolved. 2004-12-23 21:30:30 +00:00
ps
0a2e8227c4 Change the NFS sillyrename convention so that we won't run out
of sillyrenames (which were limited to 58 per pid per directory,
for no good reason). The new format of sillyrenames looks like

	.nfs.0000b31a.00d24.4
	     ^^^^^^^^ ^^^^^
	     ticks    pid

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
Obtained from:	Yahoo!
2004-12-16 19:28:37 +00:00
ps
7c0944d56c First cut of NFS direct IO support.
- NFS direct IO completely bypasses the buffer and page caches.
  If a file is open for direct IO all caching is disabled.
- Direct IO for Directories will be addressed later.
- 2 new NFS directio related sysctls are added. One is a knob to
  disable NFS direct IO completely (direct IO is enabled by default).
  The other is to disallow mmaped IO on a file that has at least one
  O_DIRECT open (see the comment in nfs_vnops.c for more details).
  The default is to allow mmaps on a file that has O_DIRECT opens.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
Obtained from:	Yahoo!
2004-12-15 22:20:22 +00:00
marcel
4b90107750 Revert rev 1.233. The null-pointer function call (a dereference on
ia64) was not the result of a change in the vector operations. It
was caused by the NFS locking code using a FIFO and those bypassing
the vnode. This indirectly caused the panic. The NFS locking code has
been changed.

Requested by: phk
2004-12-11 21:36:29 +00:00
ps
b4a200824a In nfs_rename(), skip the otw rename operation if the fsync (to
either src or dst) fails. This closes a potential data loss case
(where the fsync failed with ENOSPC, for example).

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
Obtained from:	Yahoo!
2004-12-10 03:29:02 +00:00
ps
f46c52047f Store a hint in the nfsnode to detect sequential access of the file.
Kick off a readahead only when sequential access is detected.  This
eliminates wasteful readaheads in random file access.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
Obtained from:	Yahoo!
2004-12-10 03:27:12 +00:00
ps
81f484b21d Fix for a Lock Order Reversal in the nfs_flush() path, between the
vnode interlock and the proc lock.

Reported by:	marcel
Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
2004-12-07 21:16:32 +00:00
phk
849729c60b Don't clobber mnt_stat.f_mntonname 2004-12-07 14:26:39 +00:00
phk
4a639d6164 The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:

cd9660:
	Convert to nmount.
	Add omount compat shims.
	Remove dedicated rootfs mounting code.
	Use vfs_mountedfrom()
	Rely on vfs_mount.c calling VFS_STATFS()

nfs(client):
	Convert to nmount (the simple way, mount_nfs(8) is still necessary).
	Add omount compat shims.
	Drop COMPAT_PRELITE2 mount arg compatibility.

ffs:
	Convert to nmount.
	Add omount compat shims.
	Remove dedicated rootfs mounting code.
	Use vfs_mountedfrom()
	Rely on vfs_mount.c calling VFS_STATFS()

Remove vfs_omount() method, all filesystems are now converted.

Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.

Change rootmounting to use DEVFS trampoline:

vfs_mount.c:
	Mount devfs on /.  Devfs needs no 'from' so this is clean.
	symlink /dev to /.  This makes it possible to lookup /dev/foo.
	Mount "real" root filesystem on /.
	Surgically move the devfs mountpoint from under the real root
	filesystem onto /dev in the real root filesystem.

Remove now unnecessary getdiskbyname().

kern_init.c:
	Don't do devfs mounting and rootvnode assignment here, it was
	already handled by vfs_mount.c.

Remove now unused bdevvp(), addaliasu() and addalias().  Put the
few necessary lines in devfs where they belong.  This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.

Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway.  Correct information is provided by
statfs(/).
2004-12-07 08:15:41 +00:00
ps
1d9d717d90 Always issue wakeups() to the NFS requestors under the mutex
to close all potential cases of missed wakeups.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
2004-12-07 03:39:52 +00:00
ps
eeccf3813d Rewrite of the NFS client's reply handling. We now have NFS socket
upcalls which do RPC header parsing and match up the reply with the
request. NFS calls now sleep on the nfsreq structure. This enables
us to eliminate the NFS recvlock.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
2004-12-06 21:11:15 +00:00
ps
8eaa4f53e4 2 fixes that improve on the consistency of the NFS client cache.
- Change the cached mtime to a 'struct timespec' from a
  time_t. Improving the precision of the cached mtime tightens up
  NFS' "close-to-open" consistency considerably.
- Always force an over-the-wire consistency check from nfs_open()
  (unless the file is marked modified). This further improves
  NFS' "close-to-open" consistency.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
2004-12-06 19:18:00 +00:00
ps
aa4aa62af0 Serialize NFS vinvalbuf operations by acquiring/upgrading to the
vnode EXCLUSIVE lock. This prevents threads from adding pages to
the vnode while an invalidation is in progress, closing potential
races. In the bioread() path, callers acquire the SHARED vnode lock
- so while an invalidate was in progress, it was possible to fault
in new pages onto the vnode causing the invalidation to take a while
or fail. We saw these races at Yahoo! with very large files+heavy
concurrent access. Forcing an upgrade to EXCLUSIVE lock before doing
the invalidation closes all these races.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
2004-12-06 18:52:28 +00:00
ps
5feadd3eba Add non-blocking versions of nfsm_dissect() and friends, for use from
socket callbacks or similar callers, from both the NFS client and the
server.
Instituted nfsm_dissect_nonblock(), nfsm_dissect_xx_nonblock(). And
nfsm_disct() now takes an extra M_TRYWAIT/M_DONTWAIT argument.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
2004-12-06 17:33:52 +00:00
ps
ebd6438ae1 - If all data has been committed to stable storage on the server, it
is safe to turn off the nfsnode's NMODIFIED flag.
- Move the check for signals to the top of the loop where we loop
  around the dirty buffers on the vnode, scheduling writes. This
  ensures that we'll break ouf of the flush operation on reception of
  a signal.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
2004-12-06 16:35:58 +00:00
rwatson
a98750ac1b Correct a typo in a comment. 2004-12-06 16:11:25 +00:00
phk
753d615ec0 For reasons unknown, the nfs locking code used a fifo to send requests to
userland and a dedicated system call to get replies.

The vnode-bypass of fifos broke this into a panic.

Ditch all the magic and create a device /dev/nfslock instead, and
use that for both directions apart from the shorter path, this is
also faster because the device driver runs Giant free using the
vnode bypass.

Noticed by:	marcel
2004-12-06 08:31:32 +00:00
rwatson
6b017b90b9 Convert GIANT_REQUIRED; in nfs_mountroot() to NET_ASSERT_GIANT(),
and annotate that nfs_mountroot assumes it is OK to step on the
values in the global NFSv3 diskless structure as the mountroot
function is called during a serialized part of the boot, before
any other NFS client activity occurs.

MFC after:	2 weeks
2004-12-05 22:53:17 +00:00
rwatson
22be685755 Convert a GIANT_REQUIRED; into a NET_ASSERT_GIANT();, as sockets are
now only conditionally protected by Giant based on debug.mpsafenet.
2004-12-05 22:50:09 +00:00
phk
6c14f71ef7 VFS_STATFS(mp, ...) is mostly called with &mp->mnt_stat, but a few cases
doesn't.  Most of the implementations have grown weeds for this so they
copy some fields from mnt_stat if the passed argument isn't that.

Fix this the cleaner way:  Always call the implementation on mnt_stat
and copy that in toto to the VFS_STATFS argument if different.
2004-12-05 22:41:02 +00:00
marcel
8b42e21d12 Fix null-pointer indirect function calls introduced in the previous
commit. In the new world order, the transitive closure on the vector
operations is not precomputed. As such, it's unsafe to actually use
any of the function pointers in an indirect function call. They can
be null, and we need to use the default vector in that case.
This is mostly a quick fix for the four function pointers that are
ed explicitly. A more generic or scalable solution is likely to see
the light of day.

No pathos on: current@
2004-12-05 22:30:28 +00:00
phk
59f305606c Back when VOP_* was introduced, we did not have new-style struct
initializations but we did have lofty goals and big ideals.

Adjust to more contemporary circumstances and gain type checking.

	Replace the entire vop_t frobbing thing with properly typed
	structures.  The only casualty is that we can not add a new
	VOP_ method with a loadable module.  History has not given
	us reason to belive this would ever be feasible in the the
	first place.

	Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.

	Give coda correct prototypes and function definitions for
	all vop_()s.

	Generate a bit more data from the vnode_if.src file:  a
	struct vop_vector and protype typedefs for all vop methods.

	Add a new vop_bypass() and make vop_default be a pointer
	to another struct vop_vector.

	Remove a lot of vfs_init since vop_vector is ready to use
	from the compiler.

	Cast various vop_mumble() to void * with uppercase name,
	for instance VOP_PANIC, VOP_NULL etc.

	Implement VCALL() by making vdesc_offset the offsetof() the
	relevant function pointer in vop_vector.  This is disgusting
	but since the code is generated by a script comparatively
	safe.  The alternative for nullfs etc. would be much worse.

	Fix up all vnode method vectors to remove casts so they
	become typesafe.  (The bulk of this is generated by scripts)
2004-12-01 23:16:38 +00:00
phk
cb64ed501e Remove redundant functions (repo-copied from nfsclient) for dealing with
fifos.
2004-12-01 20:18:56 +00:00
phk
ab549174e2 Scripted modification of vop_* prototypes to use typedefs. 2004-12-01 19:08:40 +00:00
phk
4eaab0b383 Add missing #include 2004-12-01 07:34:08 +00:00
ps
3601987765 Fix for a race between lookup and readdirplus, that causes
a deadlock (with NFS exclusive vnode locks enabled). Lookup
grabs the parent's lock and wants to lock child. Readdirplus
locks the child and wants to lock parent (for loading the attrs
for ".."). The fix is to not load the attrs for ".." in
readdirplus.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by:	rwatson
2004-12-01 06:51:07 +00:00
ps
531cb416ae Clean all dirty pages (dirtied by mmap'ed writes) in nfs_close().
This closes a major hole in close-to-open consistency support.
Added a new sysctl so that this can be disabled for single NFS
client applications with very large amounts of mmap'ed IO (for
performance).

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by:	rwatson
2004-12-01 06:48:54 +00:00
ps
69d7e65011 Fix for a (blocks) underrun bug where negative values were being
returned back to df from a statfs call. Causing df to print negative
values.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by:	rwatson
2004-12-01 06:42:21 +00:00
ps
2b85447398 Fix for a bug in nfs_mkdir() that called vrele() instead of vput()
in the error cases, causing panics.

Submitted by:	Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by:	rwatson
2004-11-29 23:05:30 +00:00
jeff
9caab2e843 - Eliminate the acquisition and release of the bqlock in bremfree() by
setting the B_REMFREE flag in the buf.  This is done to prevent lock order
   reversals with code that must call bremfree() with a local lock held.
   This also reduces overhead by removing two lock operations per buf for
   fsync() and similar.
 - Check for the B_REMFREE flag in brelse() and bqrelse() after the bqlock
   has been acquired so that we may remove ourself from the free-list.
 - Provide a bremfreef() function to immediately remove a buf from a
   free-list for use only by NFS.  This is done because the nfsclient code
   overloads the b_freelist queue for its own async. io queue.
 - Simplify the numfreebuffers accounting by removing a switch statement
   that executed the same code in every possible case.
 - getnewbuf() can encounter locked bufs on free-lists once Giant is removed.
   Remove a panic associated with this condition and delay asserts that
   inspect the buf until after it is locked.

Reviewed by:	phk
Sponsored by:	Isilon Systems, Inc.
2004-11-18 08:44:09 +00:00
phk
5eae02ee76 Detect root mount attempts on the flag, not on the NULL path. 2004-11-09 22:21:52 +00:00
phk
e5715b2cc1 Retire b_magic now, we have the bufobj containing the same hint. 2004-11-04 09:48:18 +00:00
phk
1b25a59886 Move the buffer method vector (buf->b_op) to the bufobj.
Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().
2004-10-24 20:03:41 +00:00
phk
52a089c526 Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.
Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.
2004-10-22 08:47:20 +00:00
phk
3833976d12 Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT
Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj.  Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.
2004-10-21 15:53:54 +00:00
pjd
5515efcf61 Add a missing newline character. 2004-10-14 19:00:44 +00:00
das
c32ecae436 nfsclient/nfs_bio.c has a PHOLD() without a PRELE(). Neither should
be necessary here.  Also, use killproc() instead of psignal().
2004-10-01 05:01:41 +00:00
phk
d3ceec948f Remove support for using NFS device nodes. 2004-09-28 08:50:01 +00:00
phk
5c67a82c63 Remove NFS4 vop method vector for devices: we are desupporing device nodes
on anything but DEVFS and in this case it was not even used (see below).

Put the NFS4 vop method for fifo's behind "#if 0" because it is unused.
Add a XXX comment to say that I think the unusedness is a bug.
2004-09-27 20:02:50 +00:00
phk
46bdd46105 style consistency. 2004-09-27 19:44:39 +00:00
phk
02df7323ee Remove unused B_WRITEINPROG flag 2004-09-15 21:49:22 +00:00
phk
9f1a2f23b2 Explicitly pass vnode to nfs_doio() and mountpoint to nfs_asyncio(). 2004-09-07 08:56:43 +00:00
rwatson
68779f8b5e In nfs_timer(), pass curthread rather than &thread0 into the protocol
send routine.  In IPv6 UDP, the thread will be passed to suser(), which
asserts that if a thread is used for a super user check, it be
curthread.  Many of these protocol entry points probably need to
accept credentials instead of threads.

MT5 candidate.

Noticed/tested by:	kuriyama
2004-08-25 01:23:38 +00:00
phk
2d868d02cf Put a version element in the VFS filesystem configuration structure
and refuse initializing filesystems with a wrong version.  This will
aid maintenance activites on the 5-stable branch.

s/vfs_mount/vfs_omount/

s/vfs_nmount/vfs_mount/

Name our filesystems mount function consistently.

Eliminate the namiedata argument to both vfs_mount and vfs_omount.
It was originally there to save stack space.  A few places abused
it to get hold of some credentials to pass around.  Effectively
it is unused.

Reorganize the root filesystem selection code.
2004-07-30 22:08:52 +00:00
phk
98d8f3741c Move a relic to its correct location(s): Put nfs diskless initialization
calls with the code they call.  (Yet another example of mindless copy&paste).
2004-07-28 21:54:57 +00:00
phk
075684f5fd Remove global variable rootdevs and rootvp, they are unused as such.
Add local rootvp variables as needed.

Remove checks for miniroot's in the swappartition.  We never did that
and most of the filesystems could never be used for that, but it had
still been copy&pasted all over the place.
2004-07-28 20:21:04 +00:00
phk
5297516e02 Eliminate unused second argument to reassignbuf() and simplify it
accordingly.
2004-07-25 21:24:23 +00:00
alfred
b4f778e20c Turn off SO_REUSEADDR and SO_REUSEPORT, they were causing EADDRINUSE
to be returned from the protocol stack.

Pointy hat to me for not groking what those options _really_ mean.
2004-07-13 05:42:59 +00:00
dwmalone
6ff1185c1d Rename Alfred's kern_setsockopt to so_setsockopt, as this seems a
a better name. I have a kern_[sg]etsockopt which I plan to commit
shortly, but the arguments to these function will be quite different
from so_setsockopt.

Approved by:	alfred
2004-07-12 21:42:33 +00:00
alfred
8a1713aada Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.
2004-07-12 08:14:09 +00:00
alfred
031e087d2c Use SO_REUSEADDR and SO_REUSEPORT when reconnecting NFS mounts.
Tune the timeout from 5 seconds to 12 seconds.
Provide a sysctl to show how many reconnects the NFS client has done.

Seems to fix IPv6 from: kuriyama
2004-07-12 06:22:42 +00:00
brian
aae31dbf32 Change the following environment variables to kernel options:
bootp -> BOOTP
    bootp.nfsroot -> BOOTP_NFSROOT
    bootp.nfsv3 -> BOOTP_NFSV3
    bootp.compat -> BOOTP_COMPAT
    bootp.wired_to -> BOOTP_WIRED_TO

- i.e. back out the previous commit.  It's already possible to
pxeboot(8) with a GENERIC kernel.

Pointed out by: dwmalone
2004-07-08 22:35:36 +00:00
brian
2821a50eaa Change the following kernel options to environment variables:
BOOTP -> bootp
    BOOTP_NFSROOT -> bootp.nfsroot
    BOOTP_NFSV3 -> bootp.nfsv3
    BOOTP_COMPAT -> bootp.compat
    BOOTP_WIRED_TO -> bootp.wired_to

This lets you PXE boot with a GENERIC kernel by putting this sort of thing
in loader.conf:

    bootp="YES"
    bootp.nfsroot="YES"
    bootp.nfsv3="YES"
    bootp.wired_to="bge1"

or even setting the variables manually from the OK prompt.
2004-07-08 13:40:33 +00:00
rwatson
7e85c099fc Acquire socket lock in nfs_connect() connection/sleep loop to protect
socket state and avoid missed wakeups.
2004-07-06 16:55:41 +00:00
alfred
864fa13b59 use vfs_suser() to restrict access to the nfs mount's timeout. 2004-07-06 09:40:44 +00:00
alfred
8fd8b8c57f NFS mobility Phase VI:
Export NFS mount state via sysctl.
Export timeout via sysctl.
2004-07-06 09:23:17 +00:00
alfred
97a6f04270 NFS mobility PHASE I, II & III (phase VI, and V pending):
Rebind the client socket when we experience a timeout.  This fixes
the case where our IP changes for some reason.

Signal a VFS event when NFS transitions from up to down and vice
versa.

Add a placeholder vfs_sysctl where we will put status reporting
shortly.

Also:
Make down NFS mounts return EIO instead of EINTR when there is a
soft timeout or force unmount in progress.
2004-07-06 09:12:03 +00:00
phk
070a613a48 When we traverse the vnodes on a mountpoint we need to look out for
our cached 'next vnode' being removed from this mountpoint.  If we
find that it was recycled, we restart our traversal from the start
of the list.

Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:

		MNT_ILOCK(mp);
	loop:
		for (vp = TAILQ_FIRST(&mp...);
		    (vp = nvp) != NULL;
		    nvp = TAILQ_NEXT(vp,...)) {
			if (vp->v_mount != mp)
				goto loop;
			MNT_IUNLOCK(mp);
			...
			MNT_ILOCK(mp);
		}
		MNT_IUNLOCK(mp);

The code which takes vnodes off a mountpoint looks like this:

	MNT_ILOCK(vp->v_mount);
	...
	TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
	...
	MNT_IUNLOCK(vp->v_mount);
	...
	vp->v_mount = something;

(Take a moment and try to spot the locking error before you read on.)

On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.

Fix:

Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go.  This saves approx 65 lines of
duplicated code.

Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.

Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.
2004-07-04 08:52:35 +00:00
rwatson
6b9af88e9d When updating sb_flags, acquire the socket buffer lock to prevent
races.
2004-06-24 03:12:13 +00:00
phk
40dd98a3bd Second half of the dev_t cleanup.
The big lines are:
	NODEV -> NULL
	NOUDEV -> NODEV
	udev_t -> dev_t
	udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.
2004-06-17 17:16:53 +00:00
rwatson
10cdb7ab20 Remove bad cookie vp kernel printf; while it does notify about an
interesting event, there's little or nothing the user can do about
it.
2004-06-17 00:15:37 +00:00
rwatson
65f0bd9a10 Convert GIANT_REQUIRED to NET_ASSERT_GIANT where Giant is used to
protect socket operations.  Leave one "as-is" as it also frobs
rootvp.
2004-06-16 03:12:50 +00:00
alc
b57e5e03fd Make vm_page's PG_ZERO flag immutable between the time of the page's
allocation and deallocation.  This flag's principal use is shortly after
allocation.  For such cases, clearing the flag is pointless.  The only
unusual use of PG_ZERO is in vfs_bio_clrbuf().  However, allocbuf() never
requests a prezeroed page.  So, vfs_bio_clrbuf() never sees a prezeroed
page.

Reviewed by:	tegge@
2004-05-06 05:03:23 +00:00
peadar
5617193e91 Let the NFS client notice a file's size changing as a modification.
This avoids presenting invalid data to the client's applications
when the file is modified, and then extended within the window of
the resolution of the modifcation timestamp.

Reviewed By:	iedowse
PR:		kern/64091
2004-04-14 23:23:55 +00:00
marcel
6dbee1d482 Unbreak build: s/TAILQ_ISEMPTY/TAILQ_EMPTY/g 2004-04-11 17:15:36 +00:00
peadar
9bb40b73ee Clean up properly when unloading NFS client module.
This includes a modified form of some code from Thomas Moestl (tmm@)
to properly clean up the UMA zone and the "nfsnodehashtbl" hash
table.

Reviewed By:	iedowse
PR:		16299
2004-04-11 13:30:20 +00:00
imp
ebf059d1df Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson
2004-04-07 05:00:01 +00:00
rwatson
18b25cc43d Spell 2 as SHUT_RDWR when used as an argument to soshutdown(). 2004-04-04 19:24:08 +00:00
peadar
461541ae31 Flush cached access mode after modifying a files attributes for
NFSv3. It's likely that modifying the attributes will affect the
file's accessibility. This version of the patch is one suggested
by Ian Dowse after reviewing my original attempt in the PR

Reviewed By: iedowse
PR: kern/44336
MFC after: 3 days
2004-04-03 17:23:46 +00:00
kan
97b7fb767e Reset callout if in nfs_timeout and rpcclnt_timeout functions. Timer
are supposed to continue firing as long as there is work to do, not
stop after the first invocation.

This is damage control after a patch that has been committed prematurely.

Tested by:	kris
2004-03-28 05:55:27 +00:00