Commit Graph

1048 Commits

Author SHA1 Message Date
Konstantin Belousov
3f54086eba Cache_lookup() for DOTDOT drops dvp vnode lock, allowing dvp to be reclaimed.
Check the condition and return ENOENT then.

In nfs_lookup(), respect ENOENT return from cache_lookup() when it is caused
by dvp reclaim.

Reported and tested by:	pho
2009-04-10 10:22:44 +00:00
John Baldwin
3429118095 When a stale file handle is encountered, purge all cached information about
an NFS node including the access and attribute caches.  Previously the NFS
client only purged any name cache entries associated with the file.

PR:		kern/123755
Submitted by:	Jaakko Heinonen  jh of saunalahti fi
Reported by:	Timo Sirainen  tss of iki fi
Reviewed by:	rwatson, rmacklem
MFC after:	1 month
2009-04-06 21:11:08 +00:00
John Baldwin
a58158ef08 Change the default timeout for caching attributes of a directory in the NFS
client from 30 seconds to 3 seconds.  After the recent changes to add
caching of negative name cache lookups, a negative name cache hit will
persist until the client notices the parent directory has changed.  The
higher the attribute cache timeout on directories, the longer that can take,
so lower the default timeout for directories to match that of regular files.

Suggested by:	bde, mohans
MFC after:	1 month
2009-04-06 19:12:47 +00:00
Robert Watson
455f3aa24f Move dtnfsclient.c in the cddl tree to nfs_kdtrace.c in the nfsclient
directory, since it's under a BSD license, and this keeps NFS internals-
aware tracing parts close to NFS.

MFC after:	1 month
Suggested by:	jhb
2009-03-25 17:47:22 +00:00
Robert Watson
346ef8cd59 Fix two bugs in DTrace tracing of accesscache and attrcache load events:
- Trace non-error loads into the access cache once, not zero times or
  twice.
- Sometimes attr cache loads fail due to a race, in which case they are
  aborted leading to an invalidation; in this case, trace only the flush,
  not a load.

MFC after:	1 month
Sponsored by:	Google, Inc.
2009-03-24 23:16:48 +00:00
Robert Watson
10263f0832 Add DTrace probes to the NFS access and attribute caches. Access cache
events are:

  nfsclient:accesscache:flush:done
  nfsclient:accesscache:get:hit
  nfsclient:accesscache:get:miss
  nfsclient:accesscache:load:done

They pass the vnode, uid, and requested or loaded access mode (if any);
the load event may also report a load error if the RPC fails.

The attribute cache events are:

  nfsclient:attrcache:flush:done
  nfsclient:attrcache:get:hit
  nfsclient:attrcache:get:miss
  nfsclient:attrcache:load:done

They pass the vnode, optionally the vattr if one is present (hit or load),
and in the case of a load event, also a possible RPC error.

MFC after:	1 month
Sponsored by:	Google, Inc.
2009-03-24 17:14:34 +00:00
Robert Watson
47294818f9 Add dtnfsclient, a first cut at an NFSv2/v3 client reuest DTrace
provider.  The NFS client exposes 'start' and 'done' probes for NFSv2
and NFSv3 RPCs when using the new RPC implementation, passing in the
vnode, mbuf chain, credential, and NFSv2 or NFSv3 procedure number.
For 'done' probes, the error number is also available.

Probes are named in the following way:

  ...
  nfsclient:nfs2:write:start
  nfsclient:nfs2:write:done
  ...
  nfsclient:nfs3:access:start
  nfsclient:nfs3:access:done
  ...

Access to the unmarshalled arguments is not easily available at this
point in the stack, but the passed probe arguments are sufficient to
to a lot of interesting things in practice.  Technically, these probes
may cover multiple RPC retransmits, and even transactions if the
transaction ID change as a result of authentication failure or a
jukebox error from the server, but usefully capture the intent of a
single NFS request, such as access, getattr, write, etc.

Typical use might involve profiling RPC latency by system call, number
of RPCs, how often a getattr leads to a call to access, when failed
access control checks occur, etc.  More detailed RPC information might
best be provided by adding a krpc provider.  It would also be useful
to add NFS client probes for events such as the access cache or
attribute cache satisfying requests without an RPC.

Sponsored by:	Google, Inc.
MFC after:	1 month
2009-03-22 22:07:52 +00:00
Robert Watson
06bd99086d In nfs_request(), always exit using the nfsmout label once we're
definitely doing an NFSv2 or NFSv3 RPC, rather than sometimes doing
so and sometimes not.  This makes it easier to add a DTrace return
probe at a single point in the function.

MFC after:	1 week
2009-03-21 21:49:07 +00:00
John Baldwin
2a3f3a09c6 Expand the per-node access cache to cache permissions for multiple users.
The number of entries in the cache defaults to 8 but is easily changed in
nfsclient/nfs.h.  When the cache is filled, the oldest cache entry is
evicted when space is needed.

I mirrored the changes to the NFSv[23] client in the NFSv4 client to fix
compile breakage.  However, the NFSv4 client doesn't actually use the
access cache currently.

Submitted by:	rmacklem
2009-03-20 21:12:38 +00:00
John Baldwin
667c6c197e - Remove code to set SAVENAME for CREATE or RENAME requests that get a -ve
hit in the name cache.  cache_lookup() doesn't actually return ENOENT
  for such requests to force the filesystem to do an explicit lookup, so
  this was effectively dead code.
- Grab the nfsnode mutex while writing to n_dmtime.  We don't grab the lock
  when comparing the time against the cached directory mod time (just as
  we don't when comparing ctime's for +ve name cache hits) since the
  attribute caching is already racy for NFS clients as it is.

Discussed with:	bde
2009-03-10 18:41:06 +00:00
Bjoern A. Zeeb
33553d6e99 For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
2009-02-27 14:12:05 +00:00
John Baldwin
fdd1a964d1 Bring back the code to prime the ACCESS cache when fetching attributes for
an NFS file.  Now the priming is conditional on  a new
vfs.nfs.prime_access_cache sysctl.  For now I've left the default setting
to disabling the priming.

Requested by:	 scottl
2009-02-24 16:01:56 +00:00
John Baldwin
03964c8e09 Enable caching of negative pathname lookups in the NFS client. To avoid
stale entries, we save a copy of the directory's modification time when
the first negative cache entry was added in the directory's NFS node.
When a negative cache entry is hit during a pathname lookup, the parent
directory's modification time is checked.  If it has changed, all of the
negative cache entries for that parent are purged and the lookup falls
back to using the RPC.  This required adding a new cache_purge_negative()
method to the name cache to purge only negative cache entries for a given
directory.

Submitted by:	mohans, Rick Macklem, Ricardo Labiaga @ NetApp
Reviewed by:	mohans
2009-02-19 22:28:48 +00:00
John Baldwin
5428c66130 When fetching attributes for a file for NFSv3 mounts, do not perform an
opportunistic ACCESS RPC to populate both the access and attribute caches
of the file and instead always use a GETATTR RPC.  On many modern NFS
servers, an ACCESS RPC is much more expensive to service than a GETATTR
RPC.

Submitted by:	mohans
2009-02-19 22:18:00 +00:00
John Baldwin
3e057a2477 Don't clear the attribute cache of a file when it is closed. A subsequent
open() of the same file will load fresh attributes, so they do not need to
be explicitly flushed in close() to guarantee close to open consistency.
However, other file desciptors may still reference this file and clearing
the attributes in close() forces those other file descriptors to fetch
fresh attributes the next time they need them.

Reviewed by:	mohans
MFC after:	1 week
2009-02-19 22:10:39 +00:00
John Baldwin
093e877818 Reindent a small bit of code that was not 8-space indented like the rest
of the nfs_lookup() function.
2009-02-18 16:34:13 +00:00
Ed Schouten
a4611ab612 Last step of splitting up minor and unit numbers: remove minor().
Inside the kernel, the minor() function was responsible for obtaining
the device minor number of a character device. Because we made device
numbers dynamically allocated and independent of the unit number passed
to make_dev() a long time ago, it was actually a misnomer. If you really
want to obtain the device number, you should use dev2udev().

We already converted all the drivers to use dev2unit() to obtain the
device unit number, which is still used by a lot of drivers. I've
noticed not a single driver passes NULL to dev2unit(). Even if they
would, its behaviour would make little sense. This is why I've removed
the NULL check.

Ths commit removes minor(), minor2unit() and unit2minor() from the
kernel. Because there was a naming collision with uminor(), we can
rename umajor() and uminor() back to major() and minor(). This means
that the makedev(3) manual page also applies to kernel space code now.

I suspect umajor() and uminor() isn't used that often in external code,
but to make it easier for other parties to port their code, I've
increased __FreeBSD_version to 800062.
2009-01-28 17:57:16 +00:00
Craig Rodrigues
e4f9e894d4 Fix parsing of acregmin, acregmax, acdirmin and acdirmax NFS mount options
when passed as strings via nmount().

Submitted by: Jaakko Heinonen <jh saunalahti fi>
2009-01-28 07:46:35 +00:00
John Baldwin
beace17649 Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP:
VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME
can be performed while holding a shared vnode lock (the same functionality
is done internally by VOP_READ which can run with a shared vnode lock).
Add missing locking of the vnode interlock to the ufs implementation and
remove a special note and test from the NFS client about not supporting the
feature.

Inspired by:	ups
Tested by:	pho
2009-01-21 14:42:00 +00:00
Bjoern A. Zeeb
4b79449e2f Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
Doug Rabson
335317d291 Switch the default rpc implementation for NFS back to the new code. I believe
I have fixed the reported problems - if you still have trouble with it, please
contact me with as much detail as possible so that I can track down any other
issues as quickly as possible.
2008-11-14 11:27:53 +00:00
Doug Rabson
0eec5c87bf Temporarily switch NFS back to the old RPC code while I try to diagnose and
fix the problems a few people have noticed with the new code. People who want
to continue testing the new code or who need RPCSEC_GSS support should use
the new option NFS_NEWRPC to select it.
2008-11-13 11:35:18 +00:00
Doug Rabson
a9148abd9d Implement support for RPCSEC_GSS authentication to both the NFS client
and server. This replaces the RPC implementation of the NFS client and
server with the newer RPC implementation originally developed
(actually ported from the userland sunrpc code) to support the NFS
Lock Manager.  I have tested this code extensively and I believe it is
stable and that performance is at least equal to the legacy RPC
implementation.

The NFS code currently contains support for both the new RPC
implementation and the older legacy implementation inherited from the
original NFS codebase. The default is to use the new implementation -
add the NFS_LEGACYRPC option to fall back to the old code. When I
merge this support back to RELENG_7, I will probably change this so
that users have to 'opt in' to get the new code.

To use RPCSEC_GSS on either client or server, you must build a kernel
which includes the KGSSAPI option and the crypto device. On the
userland side, you must build at least a new libc, mountd, mount_nfs
and gssd. You must install new versions of /etc/rc.d/gssd and
/etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf.

As long as gssd is running, you should be able to mount an NFS
filesystem from a server that requires RPCSEC_GSS authentication. The
mount itself can happen without any kerberos credentials but all
access to the filesystem will be denied unless the accessing user has
a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There
is currently no support for situations where the ticket file is in a
different place, such as when the user logged in via SSH and has
delegated credentials from that login. This restriction is also
present in Solaris and Linux. In theory, we could improve this in
future, possibly using Brooks Davis' implementation of variant
symlinks.

Supporting RPCSEC_GSS on a server is nearly as simple. You must create
service creds for the server in the form 'nfs/<fqdn>@<REALM>' and
install them in /etc/krb5.keytab. The standard heimdal utility ktutil
makes this fairly easy. After the service creds have been created, you
can add a '-sec=krb5' option to /etc/exports and restart both mountd
and nfsd.

The only other difference an administrator should notice is that nfsd
doesn't fork to create service threads any more. In normal operation,
there will be two nfsd processes, one in userland waiting for TCP
connections and one in the kernel handling requests. The latter
process will create as many kthreads as required - these should be
visible via 'top -H'. The code has some support for varying the number
of service threads according to load but initially at least, nfsd uses
a fixed number of threads according to the value supplied to its '-n'
option.

Sponsored by:	Isilon Systems
MFC after:	1 month
2008-11-03 10:38:00 +00:00
Tom Rhodes
8b4acb0cc0 Document a few sysctls in the NFS client and server code.
Minor style(9) where applicable.

Approved by:	alfred (slightly older version)
2008-11-02 17:00:23 +00:00
Attilio Rao
83b3bdbc8a Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
  mnt_lock and using instead a refcount in order to keep track of lock
  requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
  useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
  Change so its KPI by removing the interlock argument and defining 2 new
  flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
  old version (which was unlinked from the lockmgr alredy) and
  MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
  once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
  mnt_ref if running for more than 3 seconds, make it totally useless.
  Remove it as it was thought to work into older versions.
  If a problem of "refcount held never going away" should appear, we will
  need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
  leaving the MNTK_MWAIT flag on even if it the waiters were actually
  woken up. Just a place in vfs_mount_destroy() is left because it is
  going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with:	kib
Tested by:	pho
2008-11-02 10:15:42 +00:00
Edward Tomasz Napierala
15bc6b2bd8 Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by:	rwatson (mentor)
2008-10-28 13:44:11 +00:00
Dag-Erling Smørgrav
e11e3f187d Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.
2008-10-23 20:26:15 +00:00
Dag-Erling Smørgrav
1ede983cc9 Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
Attilio Rao
0d7935fd01 Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by:	kib
Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-10-10 21:23:50 +00:00
Marko Zec
8b615593fc Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
John Baldwin
d2722d704c Part 1 of making shared lookups more resilient with respect to forced
unmounts.  When we upgrade a vnode lock from shared to exclusive during
a name cache lookup, fail the lookup with EBADF if the vnode is invalidated
while we are waiting for the exclusive lock.

Also, for correctness (though I'm not sure it can occur in practice),
downgrade an exclusively locked vnode if it should be share locked.

Tested by:	pho
2008-09-24 18:51:33 +00:00
Konstantin Belousov
caf8aec886 fdescfs, devfs, mqueuefs, nfs, portalfs, pseudofs, tmpfs and xfs
initialize the vattr structure in VOP_GETATTR() with VATTR_NULL(),
vattr_null() or by zeroing it. Remove these to allow preinitialization
of fields work in vn_stat(). This is needed to get birthtime initialized
correctly.

Submitted by:   Jaakko Heinonen <jh saunalahti fi>
Discussed on:   freebsd-fs
MFC after:	1 month
2008-09-20 19:50:52 +00:00
Craig Rodrigues
b968101764 Add code to parse NFS mount options passed as individual
items of the nmount() iovec.  This will allow us to move
away from gathering up all the NFS mount options as a single
"struct nfs_args" to be passed down through nmount().
This will make adding new NFS mount options much easier.
Many, many thanks to Doug Rabson, who took my initial patches and
cleaned them up.

Reviewed by:	dfr
MFC after:	3 months
2008-09-13 18:57:47 +00:00
Attilio Rao
59d4932531 Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.
Manpages are updated accordingly.

Tested by:	Diego Sardina <siarodx at gmail dot com>
2008-08-31 14:26:08 +00:00
Attilio Rao
0359a12ead Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-08-28 15:23:18 +00:00
Bjoern A. Zeeb
603724d3ab Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
Doug Rabson
d3508f91ee Try again not to use a userspace pointer in the kernel when trying to record
the hostname which we need for NLM requests. The previous patch was incomplete.

PR:		125849
Pointy hat:	dfr
2008-07-24 14:02:03 +00:00
Doug Rabson
db428b786c Don't use a userspace pointer in the kernel when trying to record the hostname
which we need for NLM requests.

PR:		125849
2008-07-24 13:42:28 +00:00
Ed Schouten
8c2ceafebf Move the NFS/RPC code away from lbolt.
The kernel has a special wchan called `lbolt', which is triggered each
second. It doesn't seem to be used a lot and it seems pretty redundant,
because we can specify a timeout value to the *sleep() routines. In an
attempt to eventually remove lbolt, make the NFS/RPC code use a timeout
of `hz' when trying to reconnect.

Only the TTY code (not MPSAFE TTY) and the VFS syncer seem to use lbolt
now.

Reviewed by:	attilio, jhb
Approved by:	philip (mentor), alfred, dfr
2008-07-22 21:27:22 +00:00
Robert Watson
4f7d1876d5 Introduce a new lock, hostname_mtx, and use it to synchronize access
to global hostname and domainname variables.  Where necessary, copy
to or from a stack-local buffer before performing copyin() or
copyout().  A few uses, such as in cd9660 and daemon_saver, remain
under-synchronized and will require further updates.

Correct a bug in which a failed copyin() of domainname would leave
domainname potentially corrupted.

MFC after:	3 weeks
2008-07-05 13:10:10 +00:00
Doug Rabson
c675522fc4 Re-implement the client side of rpc.lockd in the kernel. This implementation
provides the correct semantics for flock(2) style locks which are used by the
lockf(1) command line tool and the pidfile(3) library. It also implements
recovery from server restarts and ensures that dirty cache blocks are written
to the server before obtaining locks (allowing multiple clients to use file
locking to safely share data).

Sponsored by:	Isilon Systems
PR:		94256
MFC after:	2 weeks
2008-06-26 10:21:54 +00:00
Attilio Rao
edf4632700 Once the ENOLCK is detected we expect to retry the acquisition.
Anyway, in the edge case the flushing happens and the while is no more
executed, nfs_flush() (and nfs4_flush()) can return with a wrong
err value of ENOLCK.
Bring it back to 0, as we expect to have for that case.

Reported by:	kris
Reviewed by:	kib
2008-05-27 00:20:19 +00:00
Benno Rice
aea75fde62 Allow the block size used when booting over NFS to be overridden. It defaults
to 8192 bytes which is the size currently used.
2008-05-16 06:27:03 +00:00
Julian Elischer
8b07e49a00 Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

  One thing where FreeBSD has been falling behind, and which by chance I
  have some time to work on is "policy based routing", which allows
  different
  packet streams to be routed by more than just the destination address.

  Constraints:
  ------------

  I want to make some form of this available in the 6.x tree
  (and by extension 7.x) , but FreeBSD in general needs it so I might as
  well do it in -current and back port the portions I need.

  One of the ways that this can be done is to have the ability to
  instantiate multiple kernel routing tables (which I will now
  refer to as "Forwarding Information Bases" or "FIBs" for political
  correctness reasons). Which FIB a particular packet uses to make
  the next hop decision can be decided by a number of mechanisms.
  The policies these mechanisms implement are the "Policies" referred
  to in "Policy based routing".

  One of the constraints I have if I try to back port this work to
  6.x is that it must be implemented as a EXTENSION to the existing
  ABIs in 6.x so that third party applications do not need to be
  recompiled in timespan of the branch.

  This first version will not have some of the bells and whistles that
  will come with later versions. It will, for example, be limited to 16
  tables in the first commit.
  Implementation method, Compatible version. (part 1)
  -------------------------------
  For this reason I have implemented a "sufficient subset" of a
  multiple routing table solution in Perforce, and back-ported it
  to 6.x. (also in Perforce though not  always caught up with what I
  have done in -current/P4). The subset allows a number of FIBs
  to be defined at compile time (8 is sufficient for my purposes in 6.x)
  and implements the changes needed to allow IPV4 to use them. I have not
  done the changes for ipv6 simply because I do not need it, and I do not
  have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

  Other protocol families are left untouched and should there be
  users with proprietary protocol families, they should continue to work
  and be oblivious to the existence of the extra FIBs.

  To understand how this is done, one must know that the current FIB
  code starts everything off with a single dimensional array of
  pointers to FIB head structures (One per protocol family), each of
  which in turn points to the trie of routes available to that family.

  The basic change in the ABI compatible version of the change is to
  extent that array to be a 2 dimensional array, so that
  instead of protocol family X looking at rt_tables[X] for the
  table it needs, it looks at rt_tables[Y][X] when for all
  protocol families except ipv4 Y is always 0.
  Code that is unaware of the change always just sees the first row
  of the table, which of course looks just like the one dimensional
  array that existed before.

  The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
  are all maintained, but refer only to the first row of the array,
  so that existing callers in proprietary protocols can continue to
  do the "right thing".
  Some new entry points are added, for the exclusive use of ipv4 code
  called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
  which have an extra argument which refers the code to the correct row.

  In addition, there are some new entry points (currently called
  rtalloc_fib() and friends) that check the Address family being
  looked up and call either rtalloc() (and friends) if the protocol
  is not IPv4 forcing the action to row 0 or to the appropriate row
  if it IS IPv4 (and that info is available). These are for calling
  from code that is not specific to any particular protocol. The way
  these are implemented would change in the non ABI preserving code
  to be added later.

  One feature of the first version of the code is that for ipv4,
  the interface routes show up automatically on all the FIBs, so
  that no matter what FIB you select you always have the basic
  direct attached hosts available to you. (rtinit() does this
  automatically).

  You CAN delete an interface route from one FIB should you want
  to but by default it's there. ARP information is also available
  in each FIB. It's assumed that the same machine would have the
  same MAC address, regardless of which FIB you are using to get
  to it.

  This brings us as to how the correct FIB is selected for an outgoing
  IPV4 packet.

  Firstly, all packets have a FIB associated with them. if nothing
  has been done to change it, it will be FIB 0. The FIB is changed
  in the following ways.

  Packets fall into one of a number of classes.

  1/ locally generated packets, coming from a socket/PCB.
     Such packets select a FIB from a number associated with the
     socket/PCB. This in turn is inherited from the process,
     but can be changed by a socket option. The process in turn
     inherits it on fork. I have written a utility call setfib
     that acts a bit like nice..

         setfib -3 ping target.example.com # will use fib 3 for ping.

     It is an obvious extension to make it a property of a jail
     but I have not done so. It can be achieved by combining the setfib and
     jail commands.

  2/ packets received on an interface for forwarding.
     By default these packets would use table 0,
     (or possibly a number settable in a sysctl(not yet)).
     but prior to routing the firewall can inspect them (see below).
     (possibly in the future you may be able to associate a FIB
     with packets received on an interface..  An ifconfig arg, but not yet.)

  3/ packets inspected by a packet classifier, which can arbitrarily
     associate a fib with it on a packet by packet basis.
     A fib assigned to a packet by a packet classifier
     (such as ipfw) would over-ride a fib associated by
     a more default source. (such as cases 1 or 2).

  4/ a tcp listen socket associated with a fib will generate
     accept sockets that are associated with that same fib.

  5/ Packets generated in response to some other packet (e.g. reset
     or icmp packets). These should use the FIB associated with the
     packet being reponded to.

  6/ Packets generated during encapsulation.
     gif, tun and other tunnel interfaces will encapsulate using the FIB
     that was in effect withthe proces that set up the tunnel.
     thus setfib 1 ifconfig gif0 [tunnel instructions]
     will set the fib for the tunnel to use to be fib 1.

  Routing messages would be associated with their
  process, and thus select one FIB or another.
  messages from the kernel would be associated with the fib they
  refer to and would only be received by a routing socket associated
  with that fib. (not yet implemented)

  In addition Netstat has been edited to be able to cope with the
  fact that the array is now 2 dimensional. (It looks in system
  memory using libkvm (!)). Old versions of netstat see only the first FIB.

  In addition two sysctls are added to give:
  a) the number of FIBs compiled in (active)
  b) the default FIB of the calling process.

  Early testing experience:
  -------------------------

  Basically our (IronPort's) appliance does this functionality already
  using ipfw fwd but that method has some drawbacks.

  For example,
  It can't fully simulate a routing table because it can't influence the
  socket's choice of local address when a connect() is done.

  Testing during the generating of these changes has been
  remarkably smooth so far. Multiple tables have co-existed
  with no notable side effects, and packets have been routes
  accordingly.

  ipfw has grown 2 new keywords:

  setfib N ip from anay to any
  count ip from any to any fib N

  In pf there seems to be a requirement to be able to give symbolic names to the
  fibs but I do not have that capacity. I am not sure if it is required.

  SCTP has interestingly enough built in support for this, called VRFs
  in Cisco parlance. it will be interesting to see how that handles it
  when it suddenly actually does something.

  Where to next:
  --------------------

  After committing the ABI compatible version and MFCing it, I'd
  like to proceed in a forward direction in -current. this will
  result in some roto-tilling in the routing code.

  Firstly: the current code's idea of having a separate tree per
  protocol family, all of the same format, and pointed to by the
  1 dimensional array is a bit silly. Especially when one considers that
  there is code that makes assumptions about every protocol having the
  same internal structures there. Some protocols don't WANT that
  sort of structure. (for example the whole idea of a netmask is foreign
  to appletalk). This needs to be made opaque to the external code.

  My suggested first change is to add routing method pointers to the
  'domain' structure, along with information pointing the data.
  instead of having an array of pointers to uniform structures,
  there would be an array pointing to the 'domain' structures
  for each protocol address domain (protocol family),
  and the methods this reached would be called. The methods would have
  an argument that gives FIB number, but the protocol would be free
  to ignore it.

  When the ABI can be changed it raises the possibilty of the
  addition of a fib entry into the "struct route". Currently,
  the structure contains the sockaddr of the desination, and the resulting
  fib entry. To make this work fully, one could add a fib number
  so that given an address and a fib, one can find the third element, the
  fib entry.

  Interaction with the ARP layer/ LL layer would need to be
  revisited as well. Qing Li has been working on this already.

  This work was sponsored by Ironport Systems/Cisco

Reviewed by:    several including rwatson, bz and mlair (parts each)
Obtained from:  Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
Poul-Henning Kamp
9b4a8ab7ba Now that all platforms use genclock, shuffle things around slightly
for better structure.

Much of this is related to <sys/clock.h>, which should really have
been called <sys/calendar.h>, but unless and until we need the name,
the repocopy can wait.

In general the kernel does not know about minutes, hours, days,
timezones, daylight savings time, leap-years and such.  All that
is theoretically a matter for userland only.

Parts of kernel code does however care: badly designed filesystems
store timestamps in local time and RTC chips almost universally
track time in a YY-MM-DD HH:MM:SS format, and sometimes in local
timezone instead of UTC.  For this we have <sys/clock.h>

<sys/time.h> on the other hand, deals with time_t, timeval, timespec
and so on.  These know only seconds and fractions thereof.

Move inittodr() and resettodr() prototypes to <sys/time.h>.
Retain the names as it is one of the few surviving PDP/VAX references.

Move startrtclock() to <machine/clock.h> on relevant platforms, it
is a MD call between machdep.c/clock.c.  Remove references to it
elsewhere.

Remove a lot of unnecessary <sys/clock.h> includes.

Move the machdep.disable_rtc_set sysctl to subr_rtc.c where it belongs.
XXX: should be kern.disable_rtc_set really, it's not MD.
2008-04-22 19:38:30 +00:00
Konstantin Belousov
eab626f110 Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by:	kris
Tested by:	kris, pho
Discussed with:	jeff, dfr
MFC after:	2 weeks
2008-04-16 11:33:32 +00:00
Doug Rabson
dfdcada31e Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
  client handle safely with replies being de-multiplexed at the socket
  upcall (typically driven directly by the NIC interrupt) and handed
  off to whichever thread matches the reply. For UDP sockets, many RPC
  clients can share the same socket. This allows the use of a single
  privileged UDP port number to talk to an arbitrary number of remote
  hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
  server would be relatively straightforward and would follow
  approximately the Solaris KPI. A single thread should be sufficient
  for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
  callbacks. I've tested the NLM server reasonably extensively - it
  passes both my own tests and the NFS Connectathon locking tests
  running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
  support for the local NFS client's locking needs, it does have to
  field async replies and granted callbacks from remote NLMs that the
  local client has contacted. We relay these replies to the userland
  rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
  it will detect deadlocks caused by a lock request that covers more
  than one blocking request. As required by the NLM protocol, all
  deadlock detection happens synchronously - a user is guaranteed that
  if a lock request isn't rejected immediately, the lock will
  eventually be granted. The old system allowed for a 'deferred
  deadlock' condition where a blocked lock request could wake up and
  find that some other deadlock-causing lock owner had beaten them to
  the lock.

* Since both local and remote locks are managed by the same kernel
  locking code, local and remote processes can safely use file locks
  for mutual exclusion. Local processes have no fairness advantage
  compared to remote processes when contending to lock a region that
  has just been unlocked - the local lock manager enforces a strict
  first-come first-served model for both local and remote lockers.

Sponsored by:	Isilon Systems
PR:		95247 107555 115524 116679
MFC after:	2 weeks
2008-03-26 15:23:12 +00:00
Ruslan Ermilov
ea26d58729 Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT.
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.

Reviewed by:	arch

There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.
2008-03-25 09:39:02 +00:00
Jeff Roberson
698b1a6643 - Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
 - Create a new lock in the bufobj to lock bufobj fields independently.
   This leaves the vnode interlock as an 'identity' lock while the bufobj
   is an io lock.  The bufobj lock is ordered before the vnode interlock
   and also before the mnt ilock.
 - Exploit this new lock order to simplify softdep_check_suspend().
 - A few sync related functions are marked with a new XXX to note that
   we may not properly interlock against a non-zero bv_cnt when
   attempting to sync all vnodes on a mountlist.  I do not believe this
   race is important.  If I'm wrong this will make these locations easier
   to find.

Reviewed by:	kib (earlier diff)
Tested by:	kris, pho (earlier diff)
2008-03-22 09:15:16 +00:00
Robert Watson
237fdd787b In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
Craig Rodrigues
771ba39d26 Expand the nfs_opts array to include all possible string
mount options that mount_nfs could pass down, if it passed
down string mount options.  Right now, mount_nfs jut passes
down a single mount option named "nfs_args" with a fully
initialized 'struct nfs_args'.

In future commits, we will add code to the kernel for parsing stringified
NFS mount options, so that we can convert mount_nfs to pass string options
from userspace to kernel, instead of an initialized struct nfs_args.
2008-03-05 10:09:29 +00:00
Craig Rodrigues
c25215a737 In nfs_mount(), default initialize struct nfs_args
the same way that it is default initialized in revision 1.77 of mount_nfs.c.

Right now, this is a no-op, because currently we initialize
struct nfs_args in mount_nfs in userspace, and pass it
down into the kernel via nmount(), so we overwrite whatever we initialize
here with the value passed in from userspace.

However, this lays the groundwork for moving away from passing
struct nfs_args from userspace to kernel via nmount(), so that we
can instead pass string mount options via nmount() which can be parsed in
the kernel.  This will make it easier to add new NFS mount options.
2008-03-05 09:41:22 +00:00
Attilio Rao
81c794f998 Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by:	Andrea Barberio <insomniac at slackware dot it>
2008-02-25 18:45:57 +00:00
Attilio Rao
628f51d275 Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.
2008-02-24 16:38:58 +00:00
Yaroslav Tykhiy
39fb1e8f88 Prevent the NFS client from losing MNT_ROOTFS on the root
file system.  In particular, stop overwriting mount point
flags in nfs_mountdiskless() because now they are set
elsewhere.  (They were _initialized_ by that function in
the 4.4BSD days, when mount structures were not allocated
in a centralized manner -- see rev. 1.1 of this file.)

Fix nfs_mount(), which happened to depend on the loss of
MNT_ROOTFS when it came to update handling.

Also note that mountnfs() no longer handles updates.  Now
they shouldn't reach this function, so printf a diagnostic
message if that happens due to a coding error.
2008-02-17 22:32:08 +00:00
Attilio Rao
84887fa362 - Add real assertions to lockmgr locking primitives.
A couple of notes for this:
  * WITNESS support, when enabled, is only used for shared locks in order
    to avoid problems with the "disowned" locks
  * KA_HELD and KA_UNHELD only exists in the lockmgr namespace in order
    to assert for a generic thread (not curthread) owning or not the
    lock.  Really, this kind of check is bogus but it seems very
    widespread in the consumers code.  So, for the moment, we cater this
    untrusted behaviour, until the consumers are not fixed and the
    options could be removed (hopefully during 8.0-CURRENT lifecycle)
  * Implementing KA_HELD and KA_UNHELD (not surported natively by
    WITNESS) made necessary the introduction of LA_MASKASSERT which
    specifies the range for default lock assertion flags
  * About other aspects, lockmgr_assert() follows exactly what other
    locking primitives offer about this operation.

- Build real assertions for buffer cache locks on the top of
  lockmgr_assert().  They can be used with the BUF_ASSERT_*(bp)
  paradigm.

- Add checks at lock destruction time and use a cookie for verifying
  lock integrity at any operation.

- Redefine BUF_LOCKFREE() in order to not use a direct assert but
  let it rely on the aforementioned destruction time check.

KPI results evidently broken, so __FreeBSD_version bumping and
manpage update result necessary and will be committed soon.

Side note: lockmgr_assert() will be used soon in order to implement
real assertions in the vnode namespace replacing the legacy and still
bogus "VOP_ISLOCKED()" way.

Tested by:      kris (earlier version)
Reviewed by:    jhb
2008-02-13 20:44:19 +00:00
John Baldwin
3156ea2d33 Consolidate the code to generate a new XID for a NFS request into a
nfs_xid_gen() function instead of duplicating the logic in both
nfsm_rpchead() and the NFS3ERR_JUKEBOX handling in nfs_request().

MFC after:	1 week
Submitted by:	mohans (a long while ago)
2008-02-13 00:04:58 +00:00
Kris Kennaway
2427fa1998 Switch the default NFS mount mode from UDP to TCP. UDP mounts are a
historical relic, and are no longer appropriate for either LAN or WAN
mounting.  At modern (gigabit and 10 gigabit) LAN speeds packet loss
from socket buffer fill events is common, and sequence numbers wrap
quickly enough that data corruption is possible.  TCP solves both of
these problems without imposing significant overhead.

MFC after:     1 month
2008-02-11 23:23:21 +00:00
Attilio Rao
a3c14ce5d9 namei() can call underlying nfs_readlink() passing a struct uio pointer
owned by a NULL owner. This will lead consequent VOP_ISLOCKED() present
into nfs_upgrade_vnlock() to panic as it only acquire curthread now.
Fix nfs_upgrade_vnlock() and nfs_downgrade_vnlock() in order to not use
more the struct thread pointer passed as argument (as it is really nomore
required there as vn_lock() and VOP_UNLOCK doesn't get the lock more).
Using curthread, in place, doesn't get ambiguity as LK_EXCLOTHER should
be handled as a "not locked" request by both functions.

Reported by: kris
Tested by: kris
Reviewed by: ups
2008-02-09 20:13:19 +00:00
Attilio Rao
2433c4883e Conver all explicit instances to VOP_ISLOCKED(arg, NULL) into
VOP_ISLOCKED(arg, curthread). Now, VOP_ISLOCKED() and lockstatus() should
only acquire curthread as argument; this will lead in axing the additional
argument from both functions, making the code cleaner.

Reviewed by: jeff, kib
2008-02-08 21:45:47 +00:00
Attilio Rao
0e9eb108f0 Cleanup lockmgr interface and exported KPI:
- Remove the "thread" argument from the lockmgr() function as it is
  always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
  Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
  present

Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
  curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h

KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.

Tested by: matteo
2008-01-24 12:34:30 +00:00
Attilio Rao
d638e093d6 - Introduce the function lockmgr_recursed() which returns true if the
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
  locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
  VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj

BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.

KPI results, obviously, broken so further commits will update manpages
and freebsd version.

Tested by: kris (on UFS and NFS)
2008-01-19 17:36:23 +00:00
Attilio Rao
22db15c06f VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
2008-01-13 14:44:15 +00:00
John Baldwin
b3c56ffc39 The previous revision broke the case of reconnecting to a TCP NFS server
via a new socket during an NFS operation as that reconnect takes place in
the context of an arbitrary thread with an arbitrary credential.  Ideally
we would like to use the mount point's credential for the entire process
of setting up the socket to connect to the NFS server.  Since some of the
APIs (sobind(), etc.) only take a thread pointer and infer the credential
from that instead of a direct credential, work around the problem by
temporarily changing the current thread's credential to that of the mount
point while connecting the socket and then reverting back to the original
credential when we are done.

Reviewed by:	rwatson
Tested on:	UDP, TCP, TCP with forced reconnect
2008-01-11 23:57:39 +00:00
John Baldwin
0ca54a0fc8 Pass curthread to various socket routines (socreate(), sobind(), and
soconnect()) instead of &thread0 when establishing a connection to the NFS
server.  Otherwise inconsistent credentials may be used when setting up
the NFS socket.

MFC after:	1 week
Reviewed by:	rwatson
2008-01-10 23:36:00 +00:00
Attilio Rao
cb05b60a89 vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by:	Diego Sardina <siarodx at gmail dot com>,
		Andrea Di Pasquale <whyx dot it at gmail dot com>
2008-01-10 01:10:58 +00:00
Robert Watson
34f2db4676 Remove hacks from the NFSv2/3 client intended to handle a lack of a
server-side RPC retranmission cache for non-idempotent operations: these
hacks substituted 0 (success) for the expected EEXIST in the event that
a target name already existed for LINK, SYMLINK, and MKDIR operations,
under the assumption that EEXIST represented a second application of the
original RPC rather than a true failure.

Background: certain NFS operations (in this case, LINK, SYMLINK, and
MKDIR) are not idempotent, as they leave behind persisting state on the
server that prevents them from being replayed without an error;if an UDP
RPC reply is lost leading to a retransmission by theclient, the second
reply will return EEXIST rather than success, asthe new object has
already been created.  The NFS client previouslysilently mapped the
EEXIST return into success to paper over thisproblem.

However, in all modern NFS server implementations, a reply cache is kept
in order to retransmit the original reply to a retransmitted request,
rather than performing the operation a second time, allowing this hack
to be avoided.  This allows link()-based filelocking over NFS to operate
correctly, as an application requestingthe creation of a new link for a
file to tell if it succeededatomically or not.

Other NFS clients, including Solaris and Linux, generally follow this
behavior for the same reasons.  Most clients also now default to TCP,
which also helps avoid the issue of retransmitted but non-idempotent
requests in most cases.

Reported by:	Adam McDougall <mcdouga9 at egr dot msu dot edu>,
		Timo Sirainen <tss at iki dot fi>
Reviewed by:	mohans
MFC after:	1 week
2007-11-19 16:03:21 +00:00
Craig Rodrigues
fab2013cb7 Add the following mount options to the nfs_opts array:
noatime, noexec, suiddir, nosuid, nosymfollow, union,
noclusterr, noclusterw, multilabel, acls, force, update,
async.  These options correspond to MOPT_STDOPTS, MOPT_FORCE, MOPT_UPDATE,
and MOPT_ASYNC.

Currently, mount_nfs converts these "-o" options from strings
to MNT_ flags via getmntopts(),
and passes the flags from userspace to the kernel.
This change will allow us in future to pass these mount options
as strings directly to the kernel via nmount() when doing NFS mounts.
2007-10-27 16:28:05 +00:00
Julian Elischer
3745c395ec Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0  so that we can eventually MFC the
new kthread_xxx() calls.
2007-10-20 23:23:23 +00:00
John Baldwin
813947b737 Add a -z flag to nfsstat which zeros the NFS statistics after displaying
them.

MFC after:	1 week
Requested by:	ps
Submitted by:	ps (6 years ago)
2007-10-18 16:38:07 +00:00
Alfred Perlstein
77465d9390 Get rid of qaddr_t.
Requested by: bde
2007-10-16 10:54:55 +00:00
Mohan Srinivasan
faf529dce5 NFS MP scaling changes.
- Eliminate the hideous nfs_sndlock that serialized NFS/TCP request senders
  thru the sndlock.
- Institute a new nfs_connectlock that serializes NFS/TCP reconnects. Add
  logic to wait for pending request senders to finish sending before
  reconnecting. Dial down the sb_timeo for NFS/TCP sockets to 1 sec.
- Break out the nfs xid manipulation under a new nfs xid lock, rather than
  over loading the nfs request lock for this purpose.
- Fix some of the locking in nfs_request.
Many thanks to Kris Kennaway for his help with this and for initiating the
MP scaling analysis and work. Kris also tested this patch thorougly.
Approved by: re@ (Ken Smith)
2007-10-12 19:12:21 +00:00
Mohan Srinivasan
17c53e4a28 Fix for a very rare race, caused by the nfsiod wakeup and nfsiod idle
timeout occurring at exactly the same time. If this happens, the nfsiod
exits although there may be a queued async IO request for it.

Found by : Kris Kennaway
Approved by: re
2007-09-25 21:08:49 +00:00
Robert Watson
0bf686c125 Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet.  As that
has now been removed, they are no longer required.  Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option.  Clean up some related gotos for
consistency.

Reviewed by:	bz, csjp
Tested by:	kris
Approved by:	re (kensmith)
2007-08-06 14:26:03 +00:00
John Baldwin
f4b65ca5d0 Fix for a race where out of order loading of NFS attrs into the
nfsnode could lead to attrs being stale. One example (that we
ran into) was a READDIR+, WRITE. The responses came back in
order, but the attrs from the WRITE were loaded before the
attrs from the READDIR+, leading to the wrong size from being
read on the next stat() call.

MFC after:	1 week
Submitted by:	mohans
Approved by:	re (kensmith)
2007-07-03 18:31:47 +00:00
John Baldwin
03e557fd5a Fix up NFS client write error handling. Errors are split into
recoverable and unrecoverable. For the former, we redirty the
buffer and hang onto it for future retries. For the latter (eg.
ESTALE), we discard the buffer and return the error back to the
user on the next syscall. This fixes a number of vfs panics and
fixes having a large number of dirty buffers (that cannot be
written out and reclaimed) from hanging around. Thanks to ups@
for discussions on this issue.

Reported by:	kris, Kai, others
Approved by:	re (kensmith)
2007-07-03 18:30:55 +00:00
Attilio Rao
b4b7081961 Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)
2007-06-04 21:45:18 +00:00
Jeff Roberson
1c4bcd050a - Move rusage from being per-process in struct pstats to per-thread in
td_ru.  This removes the requirement for per-process synchronization in
   statclock() and mi_switch().  This was previously supported by
   sched_lock which is going away.  All modifications to rusage are now
   done in the context of the owning thread.  reads proceed without locks.
 - Aggregate exiting threads rusage in thread_exit() such that the exiting
   thread's rusage is not lost.
 - Provide a new routine, rufetch() to fetch an aggregate of all rusage
   structures from all threads in a process.  This routine must be used
   in any place requiring a rusage from a process prior to it's exit.  The
   exited process's rusage is still available via p_ru.
 - Aggregate tick statistics only on demand via rufetch() or when a thread
   exits.  Tick statistics are kept in the thread and protected by sched_lock
   until it exits.

Initial patch by:	attilio
Reviewed by:		attilio, bde (some objections), arch (mostly silent)
2007-06-01 01:12:45 +00:00
Attilio Rao
2feb50bf7d Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)
2007-05-31 22:52:15 +00:00
Robert Watson
5d0dd109f4 In nfs_down(), if rep can be NULL, which we test for, then we should
lock and unlock conditionally, not just set the flag on it conditionally.
In practice, this bug couldn't manifest, as in the current revision of
the code, no callers pass a NULL rep.

CID:		1416
Found with:	Coverity Prevent(tm)
2007-05-18 19:34:54 +00:00
Jeff Roberson
222d01951f - define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts.  This can be used to abstract away pcpu details but also changes
   to use atomics for all counters now.  This means sched lock is no longer
   responsible for protecting counts in the switch routines.

Contributed by:		Attilio Rao <attilio@FreeBSD.org>
2007-05-18 07:10:50 +00:00
John Baldwin
a1054d5776 Various fixes to the NFS Directio support.
- Fix for a bug where a close would not wait for all (directio)
  dirty buffers to drain. The nfsnode was not marked NMODIFIED
  when there were directio dirtied buffers pending, causing this.
- No reason to vhold/vrele the vp when enqueueing DirectIO requests
  for the nfsiods. The vnode can't really go way since the close
  has to wait for these requests to drain.

MFC after:	1 week
Submitted by:	mohans
2007-04-25 20:34:55 +00:00
Robert Watson
dc4725135d Attempt to rationalize NFS privileges:
- Replace PRIV_NFSD with PRIV_NFS_DAEMON, add PRIV_NFS_LOCKD.

- Use PRIV_NFS_DAEMON in the NFS server.

- In the NFS client, move the privilege check from nfslockdans(), which
  occurs every time a write is performed on /dev/nfslock, and instead do it
  in nfslock_open() just once.  This allows us to avoid checking the saved
  uid for root, and just use the effective on open.  Use PRIV_NFS_LOCKD.
2007-04-21 18:11:19 +00:00
Xin LI
1247688a3e Don't destroy a mutex just before we use it, instead,
destroy it after we have used it.
2007-03-23 08:52:36 +00:00
Tor Egge
61b9d89ff0 Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount).  On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque().  Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode.  The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup.  Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by:	re (kensmith)
Reviewed by:	kib
In collaboration with:	kib
2007-03-13 01:50:27 +00:00
Mohan Srinivasan
d9915117c9 Back out a chance to nfs_timer() that inadvertantly crept in the last checkin :( 2007-03-09 04:07:54 +00:00
Mohan Srinivasan
f9bb753844 Over NFS, an open() call could result in multiple over-the-wire
GETATTRs being generated - one from lookup()/namei() and the other
from nfs_open() (for cto consistency). This change eliminates the
GETATTR in nfs_open() if an otw GETATTR was done from the namei()
path. Instead of extending the vop interface, we timestamp each attr
load, and use this to detect whether a GETATTR was done from namei()
for this syscall. Introduces a thread-local variable that counts the
syscalls made by the thread and uses <pid, tid, thread syscalls> as
the attrload timestamp. Thanks to jhb@ and peter@ for a discussion on
thread state that could be used as the timestamp with minimal overhead.
2007-03-09 04:02:38 +00:00
John Baldwin
4d70511ac3 Use pause() rather than tsleep() on stack variables and function pointers. 2007-02-27 17:23:29 +00:00
Mohan Srinivasan
0973754e14 Backing out an earlier change. It seems harmless for NFS to miss the "force
unmount" flag, making the acquisition of the MNT_ILOCK in nfs_request() and
nfs_sigintr() unnecessary. Pointed out by tegge@.
2007-02-16 03:46:55 +00:00
Mohan Srinivasan
024465d002 Add missing MNT_ILOCK around some mnt_kern_flag accesses. 2007-02-11 04:01:10 +00:00
Mohan Srinivasan
4e99994cc9 Fix for a vnode lock leak in nfs_create() in the event of an error.
Spotted by ups@.
2007-01-31 23:10:27 +00:00
Kris Kennaway
410355bf69 Instead of always hard-coding the socket type for the nfs root mount as
SOCK_DGRAM (i.e. UDP), respect the value configured earlier.  This allows
TCP NFS root mounts using e.g. the boot.nfsroot.options="tcp" tunable.

In this case some of the connection parameters like the retry timer were
previously set appropriately for TCP but inappropriately for the UDP
socket that was actually used, leading to e.g. extremely long recovery
times (O(hours)) after a nfs server reboot.

Reviewed by:    mohans
MFC After:      2 weeks
2007-01-30 00:26:04 +00:00
Bruce Evans
e43982a801 Unstaticize nfs_iosize() in nfsclient and use it in nfs4client instead
of duplicating it except for larger style bugs in the copy.

Fix some nearby style bugs (including a harmless type mismatch)
in and near the remaining copy.

This is part of fixing collisions of the 2 nfs*client's names.  Even
static names should have a unique prefixes so that they can be debugged
easily.
2007-01-25 13:07:25 +00:00
Konstantin Belousov
2cc7d26f7f Cylinder group bitmaps and blocks containing inode for a snapshot
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.

Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.

Reviewed by:	tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by:	Peter Holm
X-MFC after:	3 weeks (if ever: it changes ABI)
2007-01-23 10:01:19 +00:00
Mohan Srinivasan
7f3a6e42c9 NetApp filers return corrupt post op attrs in the wcc on NFS error responses.
This is easy to reproduce for EROFS. I am not sure if the attrs can be corrupt
for other NFS error responses. For now, disabling wcc pre-op attr checks and
post-op attr loads on NFS errors (sysctl'ed).
Reported by: Kris Kennaway
2006-12-11 19:54:25 +00:00
Sam Leffler
49d5157434 consolidate parsing of nfs root mount options in one place
and handle all options (some may require fixes elsewhere)

Reviewed by:	jhb, mohans
MFC after:	1 month
2006-12-06 02:15:25 +00:00
Mohan Srinivasan
594ece53bc In nfs_nget(), we must initialize the fh in the nfsnode before inserting the
vnode into the vfs hash. Otherwise, another thread walking the hash can trip
on an nfsnode with an uninitialized or partially initialized fh.
Thanks to ups@ for spotting this race.
2006-11-29 02:21:40 +00:00
Mohan Srinivasan
d4875805d7 bde@ pointed out that tprintf() acquires Giant so callers of tprintf() don't
have to explicitly acquire Giant (although they need to be aware of this and
not hold any locks at that point). Remove the acquisitions of Giant in the
NFS client wrapping tprintf().
2006-11-27 23:26:06 +00:00
Mohan Srinivasan
88d5725c38 Fix for a bug caused by a race when 2 threads lookup the same
file. Leave the loser's lock(s) initialized, so the reclaim logic can
unconditionally destroy them when that race occurs (or if the vfs hash
insert happened to fail for some other reason). Thanks to ups@ for a
careful review of the code.
Reported by : Kris Kennaway
2006-11-27 19:06:43 +00:00
Mohan Srinivasan
a18c4dc336 1) Fix up locking in nfs_up() and nfs_down.
2) Reduce the acquisitions of the Giant lock in the nfs_socket.c paths significantly.
- We don't need to acquire Giant before tsleeping on lbolt anymore,
  since jhb specialcased lbolt handling in msleep.
- nfs_up() needs to acquire Giant only if printing the "server up"
  message.
- nfs_timer() held Giant for the duration of the NFS timer processing,
  just because the printing of the message in nfs_down() needed it
  (and we acquire other locks in nfs_timer()). The acquisition of
  Giant is moved down into nfs_down() now, reducing the time Giant is
  held in that path.

Reported by: Kris Kennaway
2006-11-20 04:14:23 +00:00
Mohan Srinivasan
3c2fcc3c92 vfs_hash_insert() vputs() the losing vnode before returning, in the event of
a race where a duplicate vnode is entered into the vfs hash. nfs_nget() shouldn't
be releasing the vnode in that case.
2006-11-16 23:03:46 +00:00
Mohan Srinivasan
87c125cecc Fix to readdir+ reply handling. When inserting an entry into the namecache,
initialize the nfsnode's ctime. Otherwise a subsequent lookup purges the
just entered namecache entry.
2006-11-16 23:02:37 +00:00
Sam Leffler
83cc6b9ad2 honor nolockd flag in root mount options
MFC after:	2 weeks
2006-11-07 18:02:45 +00:00
Mohan Srinivasan
88b94fba38 Make EWOULDBLOCK a recoverable error so that the request is retransmitted.
This bug results in data corruption with NFS/TCP. Writes are silently dropped
on EWOULDBLOCK (because socket send buffer is full and sockbuf timer fires).

Reviewed by: ups@
2006-10-31 20:25:37 +00:00
Bruce Evans
35259c2c89 Fixed some style bugs (especially ones involving long lines and use
of __P(())).  There are many more.
2006-10-17 22:07:07 +00:00
Bruce Evans
6a72ff6b09 Don't do null Setattr RPCs for VA_MARK_ATIME. When we added the
VA_MARK_ATIME feature to fix POSIX conformance fore execve() and mmap(),
we thought that it was optimized well enough for the one file system
that supports it (ffs) and harmless for other file systems (except
layered ones which already get the layering for VOP_SETATTR() wrong).
However, nfs_setattr() doesn't do much parameter checking, so when
it gets a combination of parameters that it doesn't understand, it
always does a Setattr RPC.  This RPC can't do anything good, and for
VA_MARK_ATIME it is null except for wasting a lot of time.

This is the smallest and easiest to fix of several bugs that have
increased the number of RPCs for kernel builds on nfs by more than
100% since 2004-11-05.  The real-time increase depends on network
latency and parallelization and can also be very large (approaching
the same percentage for unparallelized operations like "make depend"
on systems with fast CPUs and high-latency networks).
2006-10-14 07:25:11 +00:00
Poul-Henning Kamp
f645b0b51c First part of a little cleanup in the calendar/timezone/RTC handling.
Move relevant variables to <sys/clock.h> and fix #includes as necessary.

Use libkern's much more time- & spamce-efficient BCD routines.
2006-10-02 12:59:59 +00:00
Tor Egge
a1e363f256 Add mnt_noasync counter to better handle interleaved calls to nmount(),
sync() and sync_fsync() without losing MNT_ASYNC.  Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.
2006-09-26 04:15:59 +00:00
Tor Egge
5da56ddb21 Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().
2006-09-26 04:12:49 +00:00
Mohan Srinivasan
7d7d9e2242 Fixes up the handling of shared vnode lock lookups in the NFS client,
adds a FS type specific flag indicating that the FS supports shared
vnode lock lookups, adds some logic in vfs_lookup.c to test this flag
and set lock flags appropriately.

- amd on 6.x is a non-starter (without this change). Using amd under
  heavy load results in a deadlock (with cascading vnode locks all the
  way to the root) very quickly.
- This change should also fix the more general problem of cascading
  vnode deadlocks when an NFS server goes down.

Ideally, we wouldn't need these changes, as enabling shared vnode lock
lookups globally would work. Unfortunately, UFS, for example isn't
ready for shared vnode lock lookups, crashing pretty quickly.

This change is the result of discussions with Stephan Uphoff (ups@).

Reviewed by:	ups@
2006-09-13 18:39:09 +00:00
Mohan Srinivasan
6cd7078919 Fix for a deadlock triggered by a 'umount -f' causing a NFS request to never
retransmit (or return). Thanks to John Baldwin for helping nail this one.

Found by : Kris Kennaway
2006-08-29 22:00:12 +00:00
Thomas Quinot
3401780fa0 Fix typos in comment. 2006-08-16 23:53:05 +00:00
Alan Cox
5786be7cc7 Introduce a field to struct vm_page for storing flags that are
synchronized by the lock on the object containing the page.

Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags.  Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.

Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().

Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().
2006-08-09 17:43:27 +00:00
Brooks Davis
a36aa44a85 Add a new kernel environment variable "boot.netif.mtu" which is used to
set the MTU prior to mounting root via NFS.  This is required if the
server supports a higher than default MTU because the client will not
see the responses otherwise.

MFC after:	3 weeks
2006-08-09 01:56:17 +00:00
Robert Watson
b0668f7151 soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod:	sam, gnn, wollman
2006-07-24 15:20:08 +00:00
Konstantin Belousov
c915bcbad2 Signals may be delivered to process as well as to the thread. Check the
thread-delivered signals in addition to the process one.

Reviewed by:	mohan
MFC after:	1 month
Approved by:	kan (mentor)
2006-07-08 15:39:11 +00:00
Konstantin Belousov
201599c3af Always supply curthread as argument to nfs_asyncio and nfs_doio
in nfs_strategy. Otherwise, for some buffers, signals would be ignored
at the intr mounts.

Reviewed by:	mohan
MFC after:	1 month
Approved by:	kan (mentor)
2006-07-08 15:36:51 +00:00
Yaroslav Tykhiy
4b97d7affd There is a consensus that ifaddr.ifa_addr should never be NULL,
except in places dealing with ifaddr creation or destruction; and
in such special places incomplete ifaddrs should never be linked
to system-wide data structures.  Therefore we can eliminate all the
superfluous checks for "ifa->ifa_addr != NULL" and get ready
to the system crashing honestly instead of masking possible bugs.

Suggested by:	glebius, jhb, ru
2006-06-29 19:22:05 +00:00
Yaroslav Tykhiy
576cdf4352 Use the elegant TAILQ_FOREACH() in place of a hand-rolled for() loop. 2006-06-29 15:37:39 +00:00
Mohan Srinivasan
64c3892747 Kris Kennaway found that for '/' NFS mounts, the MPSAFE mount flag was
not being set, which means Giant would be acquired for these mounts.
2006-05-30 20:32:44 +00:00
Mohan Srinivasan
1af6f471ca Fix for a potential attempt to sleep while holding nm_mtx. Caught and reported
by Witness (which forces the mbuf allocation flag to M_NOWAIT).

Reported by: "sekes".
2006-05-26 18:45:55 +00:00
Stephan Uphoff
6c1b7d16c2 Call vm_object_page_clean() with the object lock held.
Submitted by:	kensmith@
Reviewed by:	mohans@
MFC after:	6 days
2006-05-25 17:16:11 +00:00
Stephan Uphoff
dcf67e65d2 Do not set B_NOCACHE on buffers when releasing them in flushbuflist().
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.

This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.

Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@
Reviewed by:	tegge@
MFC after:	7 days
2006-05-25 01:00:35 +00:00
Mohan Srinivasan
5bbfbd1422 Since NFSv4 is not SMP safe, nfsiod needs to acquire Giant for NFSv4 mounts
before doing the read/write.

Reported by:	Chuck Lever.
2006-05-24 23:06:50 +00:00
Robert Watson
33c6a485bd Adjust minimum iod threads from 4 to 0 -- since we compile the NFS
client into the kernel by default, and many users won't use NFS,
don't start an extra 4 kernel threads that are unused.  Once NFS
becomes active, it will start nfsiod's as it needs them.

We might consider mandating a minimum iod's equal to the number of
active NFS mounts (truncated to some value), which would force some
to remain available without having to create a new one if the file
system is mostly inactive.

PR:		70880
MFC after:	2 weeks
Prodded by:	cel
Head nod:	peter
Pointed out by:	Joe <fbsd_user at a1poweruser dot com>
2006-05-24 21:04:46 +00:00
Chuck Lever
6d0699a5ba NFS over TCP retransmit behavior should default to a 60 second time out,
mimicing the NFS reference implementation.

NFS over TCP does not need fast retransmit timeouts, since network loss
and congestion are managed by the transport (TCP), unlike with NFS over
UDP.  A long timeout prevents the unnecessary retransmission of non-
idempotent NFS requests.

Reviewed by:	mohans, silby, rees?
Sponsored by:	Network Appliance, Incorporated
2006-05-23 18:48:07 +00:00
Chuck Lever
94163ea283 Refactor the NFS over UDP retransmit timeout estimation logic to allow
the estimator to be more easily tuned and maintained.

There should be no functional change except there is now a lower limit
on the retransmit timeout to prevent the client from retransmitting
faster than the server's disks can fill requests, and an upper limit
to prevent the estimator from taking to long to retransmit during a
server outage.

Reviewed by:	mohan, kris, silby
Sponsored by:	Network Appliance, Incorporated
2006-05-23 18:33:58 +00:00
Mohan Srinivasan
f2c48228fe Vnode locks are recursive and the NFS client support shared vnode locks.
Found by: Kris Kennaway.
2006-05-23 16:07:23 +00:00
Mohan Srinivasan
f1cdf89911 Changes to make the NFS client MP safe.
Thanks to Kris Kennaway for testing and sending lots of bugs my way.
2006-05-19 00:04:24 +00:00
Mohan Srinivasan
671d06fb2e Fix a snafu caused while patching the previous fix from another branch. 2006-05-05 18:12:13 +00:00
Mohan Srinivasan
9f5b7dea42 Fix for a NFS/TCP client bug which would cause the NFS/TCP stream to get
out of sync under heavy loads, forcing frequent reconnets, causing EBADRPC
errors etc.
2006-05-05 18:04:53 +00:00
Mohan Srinivasan
5ef7d50da5 Keep track of the number of in-progress async direct IO writes in the nfsnode.
Make fsync/close wait until all of these drain. Add a check to nfs_getpage() and
nfs_putpage().
2006-04-06 01:20:30 +00:00
Jeff Roberson
b2282f9a3f - Busy the filesystem in nfs_statfs to prevent us from creating a new
vnode after vflush() has succeeded.  This would cause a dangling vnode
   panic at unmount time otherwise.  Other filesystems may have this problem
   via their VFS_VGET() routines.

Found by:	kris
Sponsored by:	Isilon Systems, Inc.
2006-04-01 01:15:23 +00:00
Kris Kennaway
78e31796c9 Fix a bug in the NFS/TCP retransmission path.
The bug was that earlier, if a request was retransmitted,
we would do subsequent retransmits every 10 msecs.

This can cause data corruption under moderate loads by reordering
operations as seen by the client NFS attribute cache, and on the
server side when the retransmission occurs after the original request
has left the duplicate cache, since the operation will be committed
for a second time.

Further work on retransmission handling is needed (e.g. they are still
being done sent too often since they are scaled by HZ, and the size of
the dup cache is too small and easily overwhelmed on busy servers).

Submitted by:	mohans
2006-03-23 22:58:42 +00:00
Pawel Jakub Dawidek
9972deb772 Actually I wanted 'nolockd' here instead of 'lockd'.
MFC after:	2 days
2006-03-19 13:27:37 +00:00
Chuck Lever
a59b03bf0e If an NFS server returns more than a few EJUKEBOX errors for a given RPC
request, the FreeBSD NFS client will quickly back off to a excessively
long wait (days, then weeks) before retrying the request.

Change the behavior of the FreeBSD NFS client to match the behavior of
the reference NFS client implementation (Solaris).  This provides a fixed
delay of 10 seconds between each retry by default.  A sysctl, called
nfs3_jukebox_delay, is now available to tune the delay.  Unlike Solaris,
the sysctl value on FreeBSD is in seconds, rather than in HZ.

Sponsored by:	Network Appliance, Incorporated
Reviewed by:	rick
Approved by:	silby
MFC after:	3 days
2006-03-17 22:14:23 +00:00
Chuck Lever
9f5349f23d Fix a bug in NFSv3 READDIRPLUS reply processing
The client's READDIRPLUS logic skips the attributes and
filehandle of the ".." entry.  If the server doesn't send
attributes but does send a filehandle for "..", the
client's logic doesn't account for the extra "value
follows" field that indicates whether the filehandle is
present, causing the remaining entries in the reply
to be ignored.

Sponsored by:	Network Appliance, Inc.
Reviewed by:	rick, mohans
Approved by:	silby
MFC after:	2 weeks
2006-03-08 01:43:01 +00:00
Jim Rees
4b81d0eb0f Don't log an error on tcp connection reset, even if we don't get ECONNRESET.
Submitted by:	cel@citi.umich.edu
2006-01-20 15:07:18 +00:00
Alfred Perlstein
92e73f5711 I ran into an nfs client panic a couple of times in a row over the
last few days.  I tracked it down to the fact that nfs_reclaim()
is setting vp->v_data to NULL _before_ calling vnode_destroy_object().
After silence from the mailing list I checked further and discovered
that ufs_reclaim() is unique among FreeBSD filesystems for calling
vnode_destroy_object() early, long before tossing v_data or much
of anything else, for that matter.  The rest, including NFS, appear
to be identical, as if they were just clones of one original routine.

The enclosed patch fixes all file systems in essentially the same
way, by moving the call to vnode_destroy_object() to early in the
routine (before the call to vfs_hash_remove(), if any).  I have
only tested NFS, but I've now run for over eighteen hours with the
patch where I wouldn't get past four or five without it.

Submitted by: Frank Mayhar
Requested by: Mohan Srinivasan
MFC After: 1 week
2006-01-17 17:29:03 +00:00
Robert Watson
63074a901a In nfs_dolock(), GC now under-used ioflg, rendered obsolete when we moved
from using a fifo to talk to rpc.lockd to using a special device node.

Noticed by:	Coverity Prevent analysis tool
MFC after:	3 days
2006-01-13 23:16:29 +00:00
Tor Egge
82be0a5a24 Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by:	truckman
2006-01-09 20:42:19 +00:00
Xin LI
fc9fac4c78 Correct a typo 2005-12-28 10:03:48 +00:00
Paul Saab
fc6ff223c4 Improve upon rev 1.133 where NFS/TCP would not reconnect.
Submitted by:	Mohan Srinivasan
2005-12-12 23:18:05 +00:00
Ruslan Ermilov
2f1b461447 Unexpand LLADDR(). 2005-11-29 09:51:47 +00:00
Paul Saab
38b29f71ef Fix for a bug where NFS/TCP would not reconnect (in the case where
the server FIN'ed). Seen with Solaris NFS servers.

Reported by:	TOMITA Yoshinori <yoshint@flab.fujitsu.co.jp>
Submitted by:	Mohan Strinivasan
2005-11-21 19:25:24 +00:00
Paul Saab
3834aac17e - Always return success from NFS strategy. nfs_doio(), in the
event of an error, does the right thing, in terms of setting
  the error flags in the buf header. That fixes a crash from
  bstrategy().
- Treat ETIMEDOUT as a "recoverable" error, causing the buffer
  to be re-dirtied. ETIMEDOUT can occur on soft mounts, when
  the number of retries are exceeded, and we don't want data loss
  in that case.

Submitted by:	Mohan Srinivasan
2005-11-21 19:23:46 +00:00
Jim Rees
cb156cc603 fix a problem with XID re-use when a server returns NFSERR_JUKEBOX.
Submitted by:	cel@citi.umich.edu
Fixed by:	rick@snowhite.cis.uoguelph.ca
Approved by:	alfred
MFC after:	3 weeks
2005-11-21 18:39:18 +00:00
Jonathan Chen
0b3e7451da fix a crash when an nfsv2 mount fails
MFC after:	1 week
2005-11-10 23:25:16 +00:00
Paul Saab
9c31df40bb Fix for a crash (from nfs_lookup() in an error case).
Submitted by:	Mohan Srinivasan
2005-11-03 19:24:54 +00:00
Paul Saab
41ce2892bb In nfs_flush(), clear the NMODIFIED bit only if there are no dirty
buffers *and* there are no buffers queued up for writing.  The bug
was that NMODIFIED was being cleared even while there were buffers
scheduled to be written out, which leads to all sorts of interesting
bugs - one where the file could shrink (because of a post-op getattr
load, say) causing data in buffer(s) queued for write to be tossed,
resulting in data corruption.

Submitted by:	Mohan Srinivasan
2005-11-03 07:42:15 +00:00
Paul Saab
120c58288c Fix for a race between the thread transmitting the request and the
thread processing the reply.

Submitted by:	Mohan Srinivasan
2005-11-03 07:31:06 +00:00
Robert Watson
5bb84bc84b Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in
  memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
  as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
  memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
  attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion.  Similar changes are required for UMA zone names.
2005-10-31 15:41:29 +00:00
Gleb Smirnoff
c0bc2867c1 - Fix leak of struct nlminfo on process exit.
- Fix malloc type collision, that made the above problem
  difficult to understand.

Reported by:	Vladimir Sharun <sharun ukr.net>
2005-10-26 07:18:37 +00:00
Pawel Jakub Dawidek
df71afde00 - Use strsep() instead of strtok().
- strdup() uses M_WAITOK, so we don't need to check it's return value
  against NULL.

MFC after:	2 weeks
2005-10-06 19:04:08 +00:00
Pawel Jakub Dawidek
720f3948c0 Add boot.nfsroot.options loader tunable.
It allows to specify options for NFS root file system.
Currently supported options are: soft, intr, conn, lockd.

I'm adding this functionality mostly for 'lockd' option, which is only
honored when performing the initial mount and will be silently ignored
if used while updating the mount options.

This will allow to use flock(2) without the need of using varmfs or
rpc.lockd and friends.

Example of use:
boot.nfsroot.options="intr,lockd"

MFC after:	2 weeks
2005-10-06 11:18:34 +00:00
Robert Watson
84d2b7df26 Add GIANT_REQUIRED and WITNESS sleep warnings to uprintf() and tprintf(),
as they both interact with the tty code (!MPSAFE) and may sleep if the
tty buffer is full (per comment).

Modify all consumers of uprintf() and tprintf() to hold Giant around
calls into these functions.  In most cases, this means adding an
acquisition of Giant immediately around the function.  In some cases
(nfs_timer()), it means acquiring Giant higher up in the callout.

With these changes, UFS no longer panics on SMP when either blocks are
exhausted or inodes are exhausted under load due to races in the tty
code when running without Giant.

NB: Some reduction in calls to uprintf() in the svr4 code is probably
desirable.

NB: In the case of nfs_timer(), calling uprintf() while holding a mutex,
or even in a callout at all, is a bad idea, and will generate warnings
and potential upset.  This needs to be fixed, but was a problem before
this change.

NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having
non-MPSAFE tty code.

MFC after:	1 week
2005-09-19 16:51:43 +00:00
Paul Saab
250614c5ab FIx for a bug in the change that made nfs_timer() MPSAFE. We need to
grab Giant before calling pru_send() (if running with mpsafenet = 0).

Found by:	Jeremie Le Hen.
Fixed by:	Maxime Henrion
2005-07-27 15:06:26 +00:00
Paul Saab
4fb48d10b0 In nfs_nget() if two threads race on the same filehandle, the loser should
cause the nfsnode to get freed. This fixes a potential vnode (and nfsnode)
leak in that path.

Submitted by:	Mohan Srinivasan
Reviewed by:	phk
2005-07-27 15:05:31 +00:00
Paul Saab
865b5cc7fd Remove the NFS client rslock. The rslock was used to serialize
writers that want to extend the file. It was also used to serialize
readers that might want to read the last block of the file (with a
writer extending the file).  Now that we support vnode locking for
NFS, the rslock is unnecessary. Writers grab the exclusive vnode
lock before writing and readers grab the shared (or in some cases
the exclusive) lock.

Submitted by:	Mohan Srinivasan
2005-07-21 22:46:56 +00:00
Paul Saab
4321eae6b7 Make nfs_timer() MPSAFE. With this change, the bottom half of the NFS
client (the interface with the protocol stack and callouts) is
Giant-free.

Submitted by:	Mohan Srinivasan.
2005-07-19 21:27:25 +00:00
Paul Saab
38b8570c55 Fix for a NFS soft mounts bug where if the number of retries exceeds
the max rexmits, the request was not being bounced back with a
ETIMEDOUT error.

Reported by:	Oliver Lehmann
Submitted by:	Mohan Srinivasan
2005-07-18 02:12:17 +00:00
Paul Saab
0e38f5365b Fixes for NFS crashes on architectures that require strict alignment.
- Fix nfsm_disct() so that after pulling up data, the remaining data
  is aligned if necessary.
- Fix nfs_clnt_tcp_soupcall() to bcopy() the rpc length out of the
  mbuf (instead of casting m_data to a uint32).

Submitted by:	Pyun YongHyeon
Reviewed by:	Mohan Srinivasan
2005-07-14 20:08:27 +00:00
Brian Feldman
6979a7592a Ifdef out the incomplete non-blocking IO implementation for NFS
pending discussion of how implementation would proceed.  Applications
like -lc_r expect select(3) to match the EAGAIN-status of IO
functions.

Approved by:	re
2005-06-16 15:43:17 +00:00
Brian Feldman
cc3149b1ea Fix a serious deadlock with the NFS client. Given a large enough
atomic write request, it can fill the buffer cache with the entirety
of that write in order to handle retries.  However, it never drops
the vnode lock, or else it wouldn't be atomic, so it ends up waiting
indefinitely for more buf memory that cannot be gotten as it has it
all, and it waits in an uncancellable state.

To fix this, hibufspace is exported and scaled to a reasonable
fraction.  This is used as the limit of how much of an atomic write
request by the NFS client will be handled asynchronously.  If the
request is larger than this, it will be turned into a synchronous
request which won't deadlock the system.  It's possible this value is
far off from what is required by some, so it shall be tunable as soon
as mount_nfs(8) learns of the new field.

The slowdown between an asynchronous and a synchronous write on NFS
appears to be on the order of 2x-4x.

General nod by:	gad
MFC after:	2 weeks
More testing:	wes
PR:		kern/79208
2005-06-10 23:50:41 +00:00
Dag-Erling Smørgrav
3f54cc0505 Ugh. Previous commit got the logic exactly backward.
Submitted by:	bland
Pointy hat to:	des
2005-05-17 18:23:03 +00:00
Dag-Erling Smørgrav
ff17c7a727 Revision 1.173 broke updating a mount from ro to rw. Fix that by clearing
the MNT_RDONLY flag if MNT_UPDATE is set and "ro" was not specified.

Suggested by:	cognet
2005-05-17 12:00:43 +00:00
Jim Rees
3785bdbe7f set R_MUSTRESEND flag in mark_for_reconnect so re-connected requests get
re-sent instead of timing out.

don't log an error message on reconnection, which is not an error.

remove unused nfs_mrep_before_tsleep.

Reviewed by:	Mohan Srinivasan
Approved by:	alfred
2005-05-10 14:25:14 +00:00
Paul Saab
15ec3fe2f0 Fix a bug in NFS/TCP where retransmissions would not reliably happen
if the server rebooted or tore down the connection for any reason.

Found by:	Jonathan Noack.
Submitted by:	Mohan Srinivasan.
2005-05-04 16:37:31 +00:00
Ian Dowse
2c443c417c Don't copy the NFSMNT_* flags into struct statfs's f_flags field,
as they have no connection with the expected MNT_* flags. This bug
was exposed 18 months ago when the assignments to f_flags in
vfs_syscalls.c were moved to before the VFS_STATFS() call. It was
fixed in the CSRG source 10 years ago, but we never picked up that
change.

PR:		kern/80390
MFC after:	1 week
2005-05-02 15:57:10 +00:00
Dag-Erling Smørgrav
4104e6bc1d When NFS was converted to the new mount syscall, code was written that sets
the MNT_RDONLY flag if the "ro" option was passed in from userland, and
clears it otherwise.  In the diskless case, the MNT_RDONLY flag is already
set when this code is reached, but there are no mount options, so it was
incorrectly cleared.  Change the logic so the MNT_RDONLY flag is set if the
"ro" option was specified, and left alone otherwise.

Note that the NFS code will still happily let you mount a filesystem RW
even if the server exports it RO.  I'm not sure how to fix that.
2005-04-27 14:46:02 +00:00
Dag-Erling Smørgrav
c6acf6d557 While I'm here, list the new kenv (boot.netif.name) along with the others. 2005-04-26 20:47:59 +00:00
Dag-Erling Smørgrav
8f0aecc01f When netbooting, as soon as we've figured out which interface we booted
from, store its name in a kenv variable.
2005-04-26 20:45:29 +00:00
Jim Rees
dcee1d0771 TCP reconnect is not an error.
Change the message from LOG_ERR to LOG_INFO.

Approved by:	alfred
2005-04-18 13:42:13 +00:00
Jeff Roberson
5b5f16b5a8 - cache_lookup() relocks the parent in the DOTDOT case for us.
Spotted by:	phk
Sponsored by:	Isilon Systems, Inc.
2005-04-14 07:08:34 +00:00
Jeff Roberson
4585e3ac5a - Change all filesystems and vfs_cache to relock the dvp once the child is
locked in the ISDOTDOT case.  Se vfs_lookup.c r1.79 for details.

Sponsored by:	Isilon Systems, Inc.
2005-04-13 10:59:09 +00:00
Jeff Roberson
f247a5240d - LK_NOPAUSE is a nop now.
Sponsored by:   Isilon Systems, Inc.
2005-03-31 04:37:09 +00:00
Jeff Roberson
da1c9cb2b5 - Remove wantparent, it is no longer necessary. An assert in vfs_lookup.c
prevents any callers from doing a modifying op without
   LOCKPARENT or WANTPARENT.
2005-03-29 13:09:42 +00:00
Jeff Roberson
5c5e51fd9a - cache_lookup() now locks the new vnode for us to prevent some races.
Remove redundant code.

Sponsored by:	Isilon Systems, Inc.
2005-03-29 13:00:37 +00:00
Jeff Roberson
f6576f194e - We no longer have to bother with PDIRUNLOCK, lookup() handles it for us.
- Network filesystems are written with a special idiom that checks the
   cache first, and may even unlock dvp before discovering that a network
   round-trip is required to resolve the name.  I believe dvp is prevented
   from being recycled even in the forced unmount case by the shared lock
   on the mount point.  If not, this code should grow checks for VI_DOOMED
   after it relocks dvp or it will access NULL v_data fields.

Sponsored by:	Isilon Systems, Inc.
2005-03-28 09:29:58 +00:00
Jeff Roberson
a176ceb322 - Update vfs_root implementations to match the new prototype. None of
these filesystems will support shared locks until they are explicitly
   modified to do so.  Careful review must be done to ensure that this
   is safe for each individual filesystem.

Sponsored by:   Isilon Systems, Inc.
2005-03-24 07:39:03 +00:00
Paul Saab
cae2d2c61f - The NFS client was incorrectly masking SIGSTOP (which is
non-maskable).
- The NFS client needs to guard against spurious wakeups
  while waiting for the response. ltrace causes the process
  under question to wakeup (possibly from ptrace()), which
  causes NFS to wakeup from tsleep without the response being
  delivered.

Submitted by:	Mohan Srinivasan
2005-03-23 22:10:10 +00:00
David Schultz
938838feb1 Don't brelse(bp) if bp is null. Also, eliminate some redundancy
and dead code.

Found by:	Coverity Prevent analysis tool
2005-03-18 21:23:32 +00:00
Poul-Henning Kamp
8b5505c013 Use vfs_hash. 2005-03-16 11:28:19 +00:00
John-Mark Gurney
7f76b06b35 MFp4: use the function to fix the packet header length instead of rolling
our own...
2005-03-16 08:13:08 +00:00
Jeff Roberson
8d8d331063 - VOP_INACTIVE should no longer drop the vnode lock.
Sponsored by:	Isilon Systems, Inc.
2005-03-13 12:15:36 +00:00
Jeff Roberson
c0f681c21d - The VI_DOOMED flag now signals the end of a vnode's relationship with
the filesystem.  Check that rather than VI_XLOCK.

Sponsored by:	Isilon Systems, Inc.
2005-03-13 12:14:56 +00:00
Jeff Roberson
30144f05f0 - It is no longer necessary to lock and unlock the vnode in nfs_close() as
the top level does this for us now.

Sponsored by:	Isilon Systems, Inc.
2005-03-13 12:11:23 +00:00
Paul Saab
6ff1ccae7f Minor cleanup in nfs_request() and removal of a comment that doesn't
reflect reality.

Submitted by:	Mohan Srinivasan
2005-02-26 18:55:36 +00:00
Poul-Henning Kamp
33822d53bf vp->v_id is a private field for the vfs namecache and it is a big mistake
that NFS ever started using it.  Long time ago I added the necessary
vhold()/vdrop() calls to replace it, but forgot to remove the v_id code.

Do it now.
2005-02-22 14:52:00 +00:00
Poul-Henning Kamp
dfd4be14bd Try to unbreak the vnode locking around vop_reclaim() (based mostly on
patch from kan@).

Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close.  This is not yet a generally safe function, but for this very
specific use it is safe.  This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.
2005-02-19 11:44:57 +00:00
Paul Saab
7a5147540a Fix for a potential NFS client race where shared data is updated from
base context as well as the socket callback.

Submitted by:	Mohan Srinivasan
2005-02-18 23:41:39 +00:00
John Baldwin
01660e7bc2 Drop Giant before calling kthread_exit(). 2005-02-07 18:21:50 +00:00
Robert Watson
38aa565976 Style cleanup for O_DIRECT sysctl comment introduced in nfs_vnops.c:1.242. 2005-01-29 23:19:08 +00:00
Poul-Henning Kamp
a369f34d76 Make filesystems get rid of their own vnodes vnode_pager object in
VOP_RECLAIM().
2005-01-28 14:42:17 +00:00
Poul-Henning Kamp
b3a4d73ebe Create a vnode_pager object when a file is opened. 2005-01-24 23:03:29 +00:00
Poul-Henning Kamp
56dd36b1a6 Remove unused cred arg from nfs_vinvalbuf() and many bogus arguments
passed for it.
2005-01-24 12:31:06 +00:00
Peter Wemm
bcbfb8bc3d Mostly back out rev 1.33 from quite some time ago, and the followup fixes
and tweaks.  The code was actually quite broken because it discarded the
upper bits of the 64 bit division.  We only had a 50% chance of scaling up
the blocksize for large NFS client mounts when it was needed.  For 5.x and
beyond, this was harmless because we could represent the result in either
case.  For 4.x this was a big problem though.  (4.x also has a df(1) bug to
compound the problem)
2005-01-18 21:59:44 +00:00
Poul-Henning Kamp
7c0745eeae Eliminate unused and unnecessary "cred" argument from vinvalbuf() 2005-01-14 07:33:51 +00:00
Brian Somers
3e056862a2 Include opt_bootp.h for BOOTP_NFSROOT
PR:		73183
Submitted by:	Darrin Smith sdar at salseast dot org
MFC after:	7 days
2005-01-12 12:42:46 +00:00
Poul-Henning Kamp
6ef8480a88 Add BO_SYNC() and add a default which uses the secret vnode pointer
and VOP_FSYNC() for now.
2005-01-11 10:43:08 +00:00