Commit Graph

485 Commits

Author SHA1 Message Date
kib
8f845e475e Fix the mis-handling of the VV_TEXT on the nullfs vnodes.
If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by:	pho (previous version)
MFC after:	2 weeks
2012-09-28 11:25:02 +00:00
delphij
51d47afd85 Honor NFSv3 commit call (RFC 1813, Section 3.3.21) where when count is 0,
the full length from offset is being flushed.  Note that for now VOP_FSYNC
does not support offset and length parameters so we still do the same full
VOP_FSYNC.  This issue was reported at FreeNAS support site as FreeNAS
ticket #1096.

Submitted by:	"ceckerle" <ce.freenas eckerle net>
Prodded by:	gcooper
Reviewed by:	rmacklem
MFC after:	2 weeks
2011-12-15 02:26:53 +00:00
jhb
6dededc9d9 Enhance the sequential access heuristic used to perform readahead in the
NFS server and reuse it for writes as well to allow writes to the backing
store to be clustered.
- Use a prime number for the size of the heuristic table (1017 is not
  prime).
- Move the logic to locate a heuristic entry from the table and compute
  the sequential count out of VOP_READ() and into a separate routine.
- Use the logic from sequential_heuristic() in vfs_vnops.c to update the
  seqcount when a sequential access is performed rather than just
  increasing seqcount by 1.  This lets the clustering count ramp up
  faster.
- Allow for some reordering of RPCs and if it is detected leave the current
  seqcount as-is rather than dropping back to a seqcount of 1.  Also,
  when out of order access is encountered, cut seqcount in half rather than
  dropping it all the way back to 1 to further aid with reordering.
- Fix the new NFS server to properly update the next offset after a
  successful VOP_READ() so that the readahead actually works.

Some of these changes came from an earlier patch by Bjorn Gronwall that was
forwarded to me by bde@.

Discussed with:	bde, rmacklem, fs@
Submitted by:	Bjorn Gronwall (1, 4)
MFC after:	2 weeks
2011-12-01 18:46:28 +00:00
rmacklem
3e62df9adb Fix the NFS servers so that they can do a Lookup of "..",
which requires that ni_strictrelative be set to 0, post-r224810.

Tested by:	swills (earlier version), geo dot liaskos at gmail.com
Approved by:	re (kib)
2011-09-03 00:28:53 +00:00
rwatson
4af919b491 Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *.  With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by:	re (bz)
Submitted by:	jonathan
Sponsored by:	Google Inc
2011-08-11 12:30:23 +00:00
rmacklem
be57e75b00 Fix the kgssapi so that it can be loaded as a module. Currently
the NFS subsystems use five of the rpcsec_gss/kgssapi entry points,
but since it was not obvious which others might be useful, all
nineteen were included. Basically the nineteen entry points are
set in a structure called rpc_gss_entries and inline functions
defined in sys/rpc/rpcsec_gss.h check for the entry points being
non-NULL and then call them. A default value is returned otherwise.
Requested by rwatson.

Reviewed by:	jhb
MFC after:	2 weeks
2011-06-19 22:08:55 +00:00
rmacklem
fbb8a5e8ec Add a lock flags argument to the VFS_FHTOVP() file system
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.

Reviewed by:	kib
2011-05-22 01:07:54 +00:00
netchild
cc4128c6b1 Add some FEATURE macros for various features (AUDIT/CAM/IPC/KTR/MAC/NFS/NTP/
PMC/SYSV/...).

No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.

Sponsored by:   Google Summer of Code 2010
Submitted by:   kibab
Reviewed by:    arch@ (parts by rwatson, trasz, jhb)
X-MFC after:    to be determined in last commit with code from this project
2011-02-25 10:11:01 +00:00
alc
11491a4c5e Unless "cnt" exceeds MAX_COMMIT_COUNT, nfsrv_commit() and nfsvno_fsync() are
incorrectly calling vm_object_page_clean().  They are passing the length of
the range rather than the ending offset of the range.

Perform the OFF_TO_IDX() conversion in vm_object_page_clean() rather than the
callers.

Reviewed by:	kib
MFC after:	3 weeks
2011-02-05 21:21:27 +00:00
pjd
76df586660 ZFS might not return monotonically increasing directory offset cookies,
so turn off UFS-specific hack that assumes so in ZFS case.
Before the change we can miss returning some directory entries to a
NFS client.

I believe that the hack should be moved to ufs_readdir(), but until we find
somebody who will do it, turn it off for ZFS in NFS server code.

Submitted by:	rmacklem
Discussed with:	rmacklem, mckusick
MFC after:	3 days
2010-12-28 21:12:15 +00:00
pjd
f4e75b41ae Use newly added NFSRV_FLAG_BUSY flag for nfsrv_fhtovp() to keep mount point
busy. This fixes a race where we can pass invalid mount point to VFS_VGET()
via vp->v_mount when exported file system was forcibly unmounted between
nfsrv_fhtovp() and VFS_VGET().

Reviewed by:	kib
MFC after:	5 days
2010-12-21 23:15:40 +00:00
pjd
8503dc84a4 - Move pubflag and lockflag handling from nfsrv_fhtovp() to nfs_namei() -
this is the only place that is different from all the other nfsrv_fhtovp()
  consumers.
  This simplifies nfsrv_fhtovp() a bit and also eliminates one
  vn_lock/VOP_UNLOCK() cycle in case of NFSv3.
- Implement NFSRV_FLAG_BUSY flag for nfsrv_fhtovp() that tells it to leave
  mount point busy.

Reviewed by:	kib
MFC after:	5 days
2010-12-21 23:12:45 +00:00
pjd
c6bae56829 On error, unbusy file system and jump to the end, so we won't try to unlock
NULL *vpp.

Reviewed by:	kib
MFC after:	5 days
2010-12-21 23:04:04 +00:00
pjd
fef0f93e5f After r216626 no extra { } are needed with VFS_UNLOCK_GIANT(). 2010-12-21 22:01:26 +00:00
pjd
c1aba2a1e9 Reduce lock scope a little. 2010-12-19 18:06:20 +00:00
kib
d0d6cc47b6 VOP_ISLOCKED() should not be used to determine if the vnode is locked.
Explicitely track the locked status of the vnode.

Reviewed by:	pjd
Tested by:	avg
MFC after:	1 week
2010-12-15 12:46:53 +00:00
kib
5fa1e0e510 Fix a bug in r214049. The nvp == vp case shall be handled specially
only for !usevget case. If VFS_VGET is working, the vnode shared lock
is obtained recursively and vput() shall be done, not vunref().

Submitted by:	rmacklem
Tested by:	Josh Carroll <josh.carroll gmail com>
MFC after:	3 days
2010-11-05 21:13:16 +00:00
kib
c4752b1717 When readdirplus() is handled on the exported filesystem that does
not support VFS_VGET, like msdosfs, do not call VOP_LOOKUP() for
dotdot on the root directory. Our filesystems expect that VFS handles
dotdot lookups on root on its own.

Reported and tested by:	kevlo
MFC after:   2 weeks
2010-10-19 08:55:31 +00:00
pjd
f8dd61b4d9 - When VFS_VGET() is not supported, switch to VOP_LOOKUP().
- We are fine by only share-locking the vnode.
- Remove assertion that doesn't hold for ZFS where we cross mount points
  boundaries by going into .zfs/snapshot/<name>/.

Reviewed by:	rmacklem
MFC after:	1 month
2010-08-26 23:41:40 +00:00
rmacklem
dad418e560 Patch the regular NFS server so that it returns ESTALE to the client
for all errors returned by VFS_FHTOVP(). This is required to ensure
that EIO doesn't get returned to the client when ZFS is used as the
server file system.

Tested by:	korvus AT comcast.net
Reviewed by:	jhb
MFC after:	2 weeks
2010-03-26 01:19:29 +00:00
marius
e51e443f9c Factor out the code shared between NFS client and server into its own
module. With r203732 it became apparent that creating the sysctl nodes
twice causes at least a warning, however the whole code shouldn't be
present twice in the first place.

Discussed with:	rmacklem
2010-02-16 20:00:21 +00:00
marius
efffc45c73 - Move nfs_realign() from the NFS client to the shared NFS code and
remove the NFS server version in order to reduce code duplication.
  The shared version now uses a second parameter how, which is passed
  on to m_get(9) and m_getcl(9) as the server used M_WAIT while the
  client requires M_DONTWAIT, and replaces the the previously unused
  parameter hsiz.
- Change nfs_realign() to use nfsm_aligned() so as with other NFS code
  the alignment check isn't actually performed on platforms without
  strict alignment requirements for performance reasons because as the
  comment suggests unaligned data only occasionally occurs with TCP.
- Change fha_extract_info() to use nfs_realign() with M_DONTWAIT rather
  than M_WAIT because it's called with the RPC sp_lock held.

Reviewed by:	jhb, rmacklem
MFC after:	1 week
2010-02-09 23:45:14 +00:00
marius
c5e9510afa Some style(9) fixes in order to fabricate a commit to denote that
the commit message for r201896 actually should have read:

As nfsm_srvmtofh_xx() assumes the 4-byte alignment required by XDR
ensure the mbuf data is aligned accordingly by calling nfs_realign()
in fha_extract_info(). This fix is orthogonal to the problem solved
by r199274/r199284.

PR:		142102 (second part)
MFC after:	1 week
2010-01-09 15:59:15 +00:00
marius
5acbf107b3 Exclude options COMPAT_FREEBSD4 now that the MD freebsd4_sigreturn()
is gone since r201396 and which is also in line with the fact that
FreeBSD 4 didn't supported sparc64.

PR:		142102 (second part)
MFC after:	1 week
2010-01-09 15:31:27 +00:00
jhb
045df5c8fd Properly return an error reply if an NFS remove or link operation fails.
Previously the failing operation would allocate an mbuf and construct an
error reply, but because the function did not return 0, the NFS server
assumed it had failed to generate a reply and would leak the reply mbuf as
well as not sending the reply to the NFS client.

PR:		kern/140853
Submitted by:	Ted Faber  faber at isi edu (remove)
Reviewed by:	rmacklem (remove)
MFC after:	1 week
2009-12-03 20:59:28 +00:00
marcel
98727c096e Revert previous change and fix misalignment by using bcopy()
to copy the file handle from fid_data into fh. This eliminates
conditional compilation.

Pointed out by:	imp
2009-11-15 03:09:50 +00:00
marcel
272f28263b Fix an obvious panic by not casting from a pointer that is 4-bytes
alignment to a type that needs 8-byte alignment, and thus causing
misaligned memory references.

MFC after:	1 week
2009-11-14 18:14:07 +00:00
pjd
62c08fc476 Ensure that tv_sec is between INT32_MIN and INT32_MAX, so ZFS won't object.
This completes the fix from r185586.

PR:		kern/139059
Reported by:	Daniel Braniss <danny@cs.huji.ac.il>
Submitted by:	Jaakko Heinonen <jh@saunalahti.fi>
Tested by:	Daniel Braniss <danny@cs.huji.ac.il>
MFC after:	3 days
2009-09-26 18:23:16 +00:00
pjd
f02d7edecd Correct typo after manual patching.
Noticed by:	b. f.
2009-09-09 13:23:26 +00:00
pjd
ef2355e38d Fix usecount leak in mknod(2) on file system exported over NFS.
While I'm here, correct typo in comment.

Reviewed by:	kan, kib
MFC after:	3 days
2009-09-09 12:56:05 +00:00
dfr
5d248bb05f Remove the old kernel RPC implementation and the NFS_LEGACYRPC option.
Approved by: re
2009-06-30 19:03:27 +00:00
jhb
ccb2608d06 Fix build with NFS_LEGACYRPC enabled after the socket upcall locking
changes.

Approved by:	re (kensmith)
2009-06-30 03:18:51 +00:00
brooks
f53c1c309d Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively.  (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer.  Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively.  Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary.  In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups.  When feasible, truncate
the group list rather than generating an error.

Minor changes:
  - Reduce the number of hand rolled versions of groupmember().
  - Do not assign to both cr_gid and cr_groups[0].
  - Modify ipfw to cache ucreds instead of part of their contents since
    they are immutable once referenced by more than one entity.

Submitted by:	Isilon Systems (initial implementation)
X-MFC after:	never
PR:		bin/113398 kern/133867
2009-06-19 17:10:35 +00:00
rmacklem
d88296a89f Since svc_[dg|vc|tli|tp]_create() did not hold a reference count on the
SVCXPTR structure returned by them, it was possible for the structure
to be free'd before svc_reg() had been completed using the structure.
This patch acquires a reference count on the newly created structure
that is returned by svc_[dg|vc|tli|tp]_create(). It also
adds the appropriate SVC_RELEASE() calls to the callers, except the
experimental nfs subsystem. The latter will be committed separately.

Submitted by:	dfr
Tested by:	pho
Approved by:	kib (mentor)
2009-06-17 22:50:26 +00:00
rmacklem
9d4e4b79d9 Add a #include <sys/jail.h> so that it builds when
options KGSSAPI is specified in the kernel configuration.

Approved by:	kib (mentor)
2009-06-12 20:18:08 +00:00
rwatson
f4934662e5 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
jhb
a1af9ecca4 Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
  locked.  It is not permissible to call soisconnected() with this lock
  held; however, so socket upcalls now return an integer value.  The two
  possible values are SU_OK and SU_ISCONNECTED.  If an upcall returns
  SU_ISCONNECTED, then the soisconnected() will be invoked on the
  socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls.  The
  API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
  the receive socket buffer automatically.  Note that a SO_SND upcall
  should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
  instead of calling soisconnected() directly.  They also no longer need
  to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
  state machine, but other accept filters no longer have any explicit
  knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
  while invoking soreceive() as a temporary band-aid.  The plan for
  the future is to add a new flag to allow soreceive() to be called with
  the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
  buffer locked.  Previously sowakeup() would drop the socket buffer
  lock only to call aio_swake() which immediately re-acquired the socket
  buffer lock for the duration of the function call.

Discussed with:	rwatson, rmacklem
2009-06-01 21:17:03 +00:00
jamie
572db1408a Place hostnames and similar information fully under the prison system.
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex.  Jails may
have their own host information, or they may inherit it from the
parent/system.  The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL.  The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.

The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.

Approved by:	bz (mentor)
2009-05-29 21:27:12 +00:00
jamie
a013e0afcb Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails.  Child jails may be restricted more than their parents,
but never less.  Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system.  Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings.  The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by:	bz (mentor)
2009-05-27 14:11:23 +00:00
dfr
4bf244123e Fix build of KGSSAPI bits post-vimage. 2009-05-24 11:10:27 +00:00
attilio
1dcb84131b Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS.  Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled.  Bump __FreeBSD_version in order to signal such
situation.
2009-05-11 15:33:26 +00:00
kan
7b57a857b7 Do not embed struct ucred into larger netcred parent structures.
Credential might need to hang around longer than its parent and be used
outside of mnt_explock scope controlling netcred lifetime. Use separate
reference-counted ucred allocated separately instead.

While there, extend mnt_explock coverage in vfs_stdexpcheck and clean-up
some unused declarations in new NFS code.

Reported by:	John Hickey
PR:		kern/133439
Reviewed by:	dfr, kib
2009-05-09 18:09:17 +00:00
rmacklem
bb19ddd3ae Change nfsserver so that it uses the nfssvc() system call provided
in sys/nfs/nfs_nfssvc.c by registering with it using the
	nfsd_call_nfsserver function pointer. Also, add the build glue for
	nfs_nfssvc.c optionally based on "nfsserver" and also as a loadable
	module.

Submitted by:	rmacklem
Reviewed by:	kib
Approved by:	kib (mentor)
2009-04-12 19:04:27 +00:00
dfr
41382eee31 Fix an mbuf leak in the error path.
Submitted by:	Rick Macklem <rick at snowhite dot cis dot uoguelph dot ca>
2009-03-19 14:13:18 +00:00
rwatson
ceb08f5693 Include audit.h so that the system call path protected by NFS_LEGACYRPC
can audit its arguments.

Submitted by:	Jaakko Heinonen <jh at saunalahti.fi>
MFC after:	1 week
X-MFC-note:	MFC with r188311
2009-02-23 23:04:15 +00:00
jhb
26e338d6fc Use shared vnode locks when invoking VOP_READDIR().
MFC after:	1 month
2009-02-13 18:18:14 +00:00
rwatson
4349e4002b Audit the flag argument to the nfssvc(2) system call.
Obtained from:	TrustedBSD Project
Sponsored by:	Apple, Inc.
2009-02-08 14:04:08 +00:00
ed
a964306db9 Last step of splitting up minor and unit numbers: remove minor().
Inside the kernel, the minor() function was responsible for obtaining
the device minor number of a character device. Because we made device
numbers dynamically allocated and independent of the unit number passed
to make_dev() a long time ago, it was actually a misnomer. If you really
want to obtain the device number, you should use dev2udev().

We already converted all the drivers to use dev2unit() to obtain the
device unit number, which is still used by a lot of drivers. I've
noticed not a single driver passes NULL to dev2unit(). Even if they
would, its behaviour would make little sense. This is why I've removed
the NULL check.

Ths commit removes minor(), minor2unit() and unit2minor() from the
kernel. Because there was a naming collision with uminor(), we can
rename umajor() and uminor() back to major() and minor(). This means
that the makedev(3) manual page also applies to kernel space code now.

I suspect umajor() and uminor() isn't used that often in external code,
but to make it easier for other parties to port their code, I've
increased __FreeBSD_version to 800062.
2009-01-28 17:57:16 +00:00
kensmith
7d088b7fca Handle VFS_VGET() failing with an error other than EOPNOTSUPP in addition
to failing with that error.

PR:		125149
Submitted by:	Jaakko Heinonen (jh <at> saunalahti <dot> fi)
Reviewed by:	mohans, kan
MFC after:	3 days
2008-12-16 04:34:09 +00:00
dfr
75eeed4f1f We need to pass a structure with enough space for an NFSv2 filehandle to
nfs_srvmtofh_xx otherwise bad things happen when an NFSv2 client tries to
make a request.
2008-12-10 14:49:54 +00:00