a directory to a subdir of the root directory from somewhere else.
For all directory moves that change the parent directory, the dotdot
entry must be fixed up. For msdosfs, the root directory is magic for
non-FAT32. It is less magic for FAT32, but needs the same magic for
the dotdot fixup. It didn't have it.
Both chkdsk and fsck_msdosfs fix the corrupt directory entries with no
problems.
The fix is to use the same magic for dotdot in msdosfs_rename() as in
msdosfs_mkdir().
For msdosfs_mkdir(), document the magic. When writing the dotdot entry
in mkdir, use explicitly set pcl variable instead on relying on the
start cluster of the root directory typically has a value < 65536.
Submitted by: bde
MFC after: 1 week
Trying FAT32 on a small partition failed to mount because
pmp->pm_Sectors was nonzero. Normally, FAT32 file systems are so
large that the 16-bit pm_Sectors can't hold the size. This is
indicated by setting it to 0 and using only pm_HugeSectors. But at
least old versions of newfs_msdos use the 16-bit field if possible,
and msdosfs supports this except for breaking its own support in the
sanity check. This is quite different from the handling of pm_FATsecs
-- now the 16-bit value is always ignored for FAT32 except for
checking that it is 0, and newfs_msdos doesn't use the 16-bit value
for FAT32.
Submitted by: bde
MFC after: 1 week
this check is somewhere in the network code, but this assertion
already proven to be useful in catching what seems to be driver bugs
causing NFS scrambling random memory.
Discussed with: rmacklem
MFC after: 1 week
requested from the server for the read operation. Server shall not
reply with too large size, but client should be resilent too.
Reviewed by: rmacklem
MFC after: 1 week
- Use NFSD_MONOSEC (which maps to time_uptime) instead of the seconds
portion of wall-time stamps to manage timeouts on events.
- Remove unused nd_starttime from the per-request structure in the new
NFS server.
- Use nanotime() for the modification time on a delegation to get as
precise a time as possible.
- Use time_second instead of extracting the second from a call to
getmicrotime().
Submitted by: bde (3)
Reviewed by: bde, rmacklem
MFC after: 2 weeks
The previous change accidentally left the substraction we
were trying to avoid in case that i_blocks could become
negative.
Reported by: bde
MFC after: 4 days
Ext2fs uses unsigned fields in its dinode struct.
FreeBSD can have negative values in some of those
fields and the inode is meant to interact with the
system so we have never respected the unsigned
nature of most of those fields.
Block numbers and the NFS generation number do
not need to be signed so redefine them as
unsigned to better match the on-disk information.
MFC after: 1 week
Testing with fsx has revealed problems and in order to
hunt the bugs properly we need reduce the complexity.
This seems to help but is not a complete solution.
MFC after: 3 days
to the old one's nfs.nfsrv.async.
Please note that by enabling this option (default is disabled), the system
could potentionally have silent data corruption if the server crashes
before write is committed to non-volatile storage, as the client side have
no way to tell if the data is already written.
Submitted by: rmacklem
MFC after: 2 weeks
in r245004. Although the report was for noatime option which is
non-functional for the nullfs, other standard options like nosuid or
noexec are useful with it.
Reported by: Dewayne Geraghty <dewayne.geraghty@heuristicsystems.com.au>
MFC after: 3 days
by returning an error of EINTR rather than EACCES.
- While here, bring back some (but not all) of the NFS RPC statistics lost
when krpc was committed.
Reviewed by: rmacklem
MFC after: 1 week
existing nullfs vnode by the lower vnode is only 16 slots. Since the
default mode for the nullfs is to cache the vnodes, hash has extremely
huge chains.
Size the nullfs hashtbl based on the current value of
desiredvnodes. Use vfs_hash_index() to calculate the hash bucket for a
given vnode.
Pointy hat to: kib
Diagnosed and reviewed by: peter
Tested by: peter, pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 5 days
get back the leased write reference from the lower vnode. There is no
other path which can correct v_writecount on the lowervp.
Reported by: flo
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
It was plagued with style errors and the offsets had been lost.
While here took the time to update the fields according to the
latest ext4 documentation.
Reviewed by: bde
MFC after: 3 days
Use file name hash as a tree key, handle duplicate keys. Both VOP_LOOKUP
and VOP_READDIR operations utilize same tree for search. Directory
entry offset (cookie) is either file name hash or incremental id in case
of hash collisions (duplicate-cookies). Keep sorted per directory list
of duplicate-cookie entries to facilitate cookie number allocation.
Don't fail if previous VOP_READDIR() offset is no longer valid, start
with next dirent instead. Other file system handle it similarly.
Workaround race prone tn_readdir_last[pn] fields update.
Add tmpfs_dir_destroy() to free all dirents.
Set NFS cookies in tmpfs_dir_getdents(). Return EJUSTRETURN from
tmpfs_dir_getdents() instead of hard coded -1.
Mark directory traversal routines static as they are no longer
used outside of tmpfs_subr.c
the free nullfs vnodes, switching nullfs behaviour to pre-r240285.
The option is mostly intended as the last-resort when higher pressure
on the vnode cache due to doubling of the vnode counts is not
desirable.
Note that disabling the cache costs more than 2x wall time in the
metadata-hungry scenarious. The default is "cache".
Tested and benchmarked by: pho (previous version)
MFC after: 2 weeks
We also try to make better use of the fs flags instead of
trying adapt the code according to the fs structures. In
the case of subsecond timestamps and birthtime we now
check that the feature is explicitly enabled: previously
we only checked that the reserved space was available and
silently wrote them.
This approach is much safer, especially if the filesystem
happens to use embedded inodes or support EAs.
Discussed with: Zheng Liu
MFC after: 5 days
to head. I don't think the NFS client behaviour will change unless
the new "minorversion=1" mount option is used. It includes basic
NFSv4.1 support plus support for pNFS using the Files Layout only.
All problems detecting during an NFSv4.1 Bakeathon testing event
in June 2012 have been resolved in this code and it has been tested
against the NFSv4.1 server available to me.
Although not reviewed, I believe that kib@ has looked at it.
which dumps out the actual options being used by an NFS mount.
This will be used to implement a "-m" option for nfsstat(1).
Reviewed by: alfred
MFC after: 2 weeks
Bring several definitions required for newer ext4 features.
Rename EXT2F_COMPAT_HTREE to EXT2F_COMPAT_DIRHASHINDEX since it
is not being used yet and the new name is more compatible with
NetBSD and Linux.
This change is purely cosmetic and has no effect on the real
code.
Obtained from: NetBSD
MFC after: 3 days
When a file is first being written, the dynamic block reallocation
(implemented by ext2_reallocblks) relocates the file's blocks
so as to cluster them together into a contiguous set of blocks on
the disk.
When the cluster crosses the boundary into the first indirect block,
the first indirect block is initially allocated in a position
immediately following the last direct block. Block reallocation
would usually destroy locality by moving the indirect block out of
the way to keep the data blocks contiguous.
The issue was diagnosed long ago by Bruce Evans on ffs and surfaced
on ext2fs when block reallocaton was ported. This is only a partial
solution based on the similarities with FFS. We still require more
review of the allocation details that vary in ext2fs.
Reported by: bde
MFC after: 1 week
the vnode use count, and this might cause the kernel to panic if compiled
with WITNESS enable.
- Be sure to put the '\0' terminator to the rpath string.
Sponsored by: iXsystems inc.
useful and has the side effect of obfuscating the code a bit.
- Remove spurious references to simple_lock.
Reported by: attilio [1]
Sponsored by: iXsystems inc.
vnode and following back the chain of n_parent pointers up to the root,
without acquiring the locks of the n_parent vnodes analyzed during the
computation. This is immediately wrong because if the vnode lock is not
held there's no guarantee on the validity of the vnode pointer or the data.
In order to fix, store the whole path in the smbnode structure so that
smbfs_fullpath() can use this information.
Discussed with: kib
Reported and tested by: pho
Sponsored by: iXsystems inc.
received granular locking) but the comment present in UFS has been
copied all over other filesystems code incorrectly for several times.
Removes comments that makes no sense now.
Reviewed by: kib
MFC after: 3 days
zombie list for the pid. This allows several kern.proc sysctls to
report useful information for zombies.
Hold the allproc_lock around all searches instead of relocking it.
Remove private pfind_locked() from the new nfs client code.
Requested and reviewed by: pjd
Tested by: pho
MFC after: 3 weeks
on timing of the operations and not real lookup, bringing too many
false positives. Remove the whole mechanism. If it needs to be
implemented, next time it should really be done in the proper way.
- Fix VOP_GETATTR() in order to cope with userland bugs that would
change the type of file and not panic. Instead it gets the entry as
if it is not existing.
Reported and tested by: flo
MFC after: 2 months
X-MFC: 241519, 242536,242616
was still possible to open for write from the lower filesystem. There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.
Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount. Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode. Always use the lower vnode v_writecount
for the checks.
Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode. Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.
Tested by: pho
MFC after: 3 weeks
- Change the code so that it relies on vfs_hash rather than on a
home-made hashtable.
- There's no need to inline fnv_32_buf().
Reviewed by: delphij
Tested by: pho
Sponsored by: iXsystems inc.
VM pressure. The reason is that in some codepaths pointers to stack
variables were passed from one thread to another.
In collaboration with: pho
Reported by: pho's stress2 suite
Sponsored by: iXsystems inc.
- Use M_ZERO flag in malloc() rather than bzero()
- malloc() with M_NOWAIT can't return NULL so there's no need to check
Reviewed by: alc
Approved by: alc
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.
Conducted and reviewed by: attilio
Tested by: pho
processes running as root to suspend/resume execution
of the kernel nfsd threads. An earlier version of this
patch was tested by Vincent Hoffman (vince at unsane.co.uk)
and John Hickey (jh at deterlab.net).
Reviewed by: kib
MFC after: 2 weeks
This has been developed during 2 summer of code mandates and being revived
by gnn recently.
The functionality in this commit mirrors entirely content of fusefs-kmod
port, which doesn't need to be installed anymore for -CURRENT setups.
In order to get some sparse technical notes, please refer to:
http://lists.freebsd.org/pipermail/freebsd-fs/2012-March/013876.html
or to the project branch:
svn://svn.freebsd.org/base/projects/fuse/
which also contains granular history of changes happened during port
refinements. This commit does not came from the branch reintegration
itself because it seems svn is not behaving properly for this functionaly
at the moment.
Partly Sponsored by: Google, Summer of Code program 2005, 2011
Originally submitted by: ilya, Csaba Henk <csaba-ml AT creo DOT hu >
In collabouration with: pho
Tested by: flo, gnn, Gustau Perez,
Kevin Oberman <rkoberman AT gmail DOT com>
MFC after: 2 months
If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.
Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.
Tested by: pho (previous version)
MFC after: 2 weeks
and owner_group strings that consist entirely of
digits, interpreting them as the uid/gid number.
This change was needed since new (>= 3.3) Linux
servers reply with these strings by default.
This change is mandated by the rfc3530bis draft.
Reported on freebsd-stable@ under the Subject
heading "Problem with Linux >= 3.3 as NFSv4 server"
by Norbert Aschendorff on Aug. 20, 2012.
Tested by: norbert.aschendorff at yahoo.de
Reviewed by: jhb
MFC after: 2 weeks
which is an elaborate way to cause kernel panic. Change the VOPs
implementation to return EBADF for a reclaimed vnode.
While the calls to vop_bmap should not reach deadfs, it is indeed
possible for vop_ioctl, because the VOP locking protocol is to pass
the vnode to VOP unlocked. The actual panic was observed when ioctl
was called on procfs filedescriptor which pointed to an exited
process.
Reported by: zont
Tested by: pho
MFC after: 1 week
Use it for a printf() that can be harmlessly generated for mmap()'d
files. It will be used extensively for the NFSv4.1 client.
Debugging printf()s are enabled by setting vfs.nfs.debuglevel to
a non-zero value. The higher the value, the more debugging printf()s.
Reviewed by: jhb
MFC after: 2 weeks
it. There are two problems which shall be addressed for shared
lookups use to have measurable effect on nullfs scalability:
1. When vfs_lookup() calls VOP_LOOKUP() for nullfs, which passes lookup
operation to lower fs, resulting vnode is often only shared-locked. Then
null_nodeget() cannot instantiate covering vnode for lower vnode, since
insmntque1() and null_hashins() require exclusive lock on the lower.
Change the assert that lower vnode is exclusively locked to only
require any lock. If null hash failed to find pre-existing nullfs
vnode for lower vnode and the vnode is shared-locked, the lower vnode
lock is upgraded.
2. Nullfs reclaims its vnodes on deactivation. This is due to nullfs
inability to detect reclamation of the lower vnode. Reclamation of a
nullfs vnode at deactivation time prevents a reference to the lower
vnode to become stale.
Change nullfs VOP_INACTIVE to not reclaim the vnode, instead use the
VFS_RECLAIM_LOWERVP to get notification and reclaim upper vnode
together with the reclamation of the lower vnode.
Note that nullfs reclamation procedure calls vput() on the lowervp
vnode, temporary unlocking the vnode being reclaimed. This seems to be
fine for MPSAFE filesystems, but not-MPSAFE code often put partially
initialized vnode on some globally visible list, and later can decide
that half-constructed vnode is not needed. If nullfs mount is created
above such filesystem, then other threads might catch such not
properly initialized vnode. Instead of trying to overcome this case,
e.g. by recursing the lower vnode lock in null_reclaim_lowervp(), I
decided to rely on nearby removal of the support for non-MPSAFE
filesystems.
In collaboration with: pho
MFC after: 3 weeks
1) It is not useful to call "devfs_clear_cdevpriv()" from
"d_close" callbacks, hence for example read, write, ioctl and
so on might be sleeping at the time of "d_close" being called
and then then freed private data can still be accessed.
Examples: dtrace, linux_compat, ksyms (all fixed by this patch)
2) In sys/dev/drm* there are some cases in which memory will
be freed twice, if open fails, first by code in the open
routine, secondly by the cdevpriv destructor. Move registration
of the cdevpriv to the end of the drm open routines.
3) devfs_clear_cdevpriv() is not called if the "d_open" callback
registered cdevpriv data and the "d_open" callback function
returned an error. Fix this.
Discussed with: phk
MFC after: 2 weeks
network file systems (not only NFS proper). Short reads cause pages
other then the requested one, which were not filled by read response,
to stay invalid.
Change the vm_page_readahead_finish() interface to not take the error
code, but instead to make a decision to free or to (de)activate the
page only by its validity. As result, not requested invalid pages are
freed even if the read RPC indicated success.
Noted and reviewed by: alc
MFC after: 1 week
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.
Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.
Suggested and reviewed by: alc
MFC after: 2 weeks
vm_page oflags by providing helper function
vm_page_readahead_finish(), which handles completed reads for pages
with indexes other then the requested one, for VOP_GETPAGES().
Reviewed by: alc
MFC after: 1 week
I found 8.3 is a history BSD version using socket to implement FIFO
pipe, it uses per-file seqcount to compare with writer generation
stored in per-pipe object. The concept is after all writers are gone,
the pipe enters next generation, all old readers have not closed the
pipe should get the indication that the pipe is disconnected, result
is they should get EPIPE, SIGPIPE or get POLLHUP in poll().
But newcomer should not know that previous writters were gone, it
should treat it as a fresh session.
I am trying to bring back FIFO pipe to history behavior. It is still
unclear that if single EOF flag can represent SBS_CANTSENDMORE and
SBS_CANTRCVMORE which socket-based version is using, but I have run
the poll regression test in tool directory, output is same as the one
on 8.3-STABLE now.
I think the output "not ok 18 FIFO state 6b: poll result 0 expected 1.
expected POLLHUP; got 0" might be bogus, because newcomer should not
know that old writers were gone. I got the same behavior on Linux.
Our implementation always return POLLIN for disconnected pipe even it
should return POLLHUP, but I think it is not wise to remove POLLIN for
compatible reason, this is our history behavior.
Regression test: /usr/src/tools/regression/poll
flag but not PIPE_WANTW, but FIFO pipe code does not understand this internal
state, when a FIFO peer reader closes the pipe, it wants to notify the writer,
it checks PIPE_WANTW, if not set, it skips calling wakeup(), so blocked writer
never noticed the case, but in general, the writer should return from the
syscall with EPIPE error code and may get SIGPIPE signal. Setting the
PIPE_WANTW fixed problem, or you can turn off direct write, it should fix the
problem too. This bug is found by PR/170203.
Another bug in FIFO pipe code is when peer closes the pipe, another end which
is being blocked in select() or poll() is not notified, it missed to call
pipeselwakeup().
Third problem is found in poll regression test, the existing code can not
pass 6b,6c,6d tests, but FreeBSD-4 works. This commit does not fix the
problem, I still need to study more to find the cause.
PR: 170203
Tested by: Garrett Copper < yanegomi at gmail dot com >
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().
Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.
The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).
Tested by: pho
No objections from: jhb
MFC after: 3 weeks
bwrite(). VFS needs to know about EFAULT from uiomove() and does not
care much that partially filled block writeback after EFAULT was
successfull. Early return without error causes short write to be
reported to usermode.
Reported and tested by: andreast
MFC after: 3 weeks
written, but not msync'd by a process. A VOP_PUTPAGES()
called when VOP_RECLAIM() happens will usually fail, since
the NFSv4 Open has already been closed by VOP_INACTIVE().
Add a vm_object_page_clean() call to the NFSv4 client's
VOP_INACTIVE(), so that the write happens before the NFSv4
Open is closed. kib@ suggested using vgone() instead and
I will explore this, but this patch fixes things in the
meantime. For some reason, the VOP_PUTPAGES() is still
attaempted in VOP_RECLAIM(), but having this fail doesn't
cause any problems except a "stateid0 in write" being logged.
Reviewed by: kib
MFC after: 1 week
Do not brelse() the buffer unconditionally with BIO_ERROR set if
uiomove() failed. The brelse() treats most buffers with BIO_ERROR as
B_INVAL, dropping their content. Instead, if the write request
covered the whole buffer, remember the cached state and brelse() with
BIO_ERROR set only if the buffer was not cached previously.
Update the buffer dirtyoff/dirtyend based on the progress recorded by
uiomove() in passed struct uio, even in the presence of
error. Otherwise, usermode could see changed data in the backed pages,
but later the buffer is destroyed without write-back.
If uiomove() failed for IO_UNIT request, try to truncate the vnode
back to the pre-write state, and rewind the progress in passed uio
accordingly, following the FFS behaviour.
Reviewed by: rmacklem (some time ago)
Tested by: pho
MFC after: 1 month
Revamp the CAM enclosure services driver.
This updated driver uses an in-kernel daemon to track state changes and
publishes physical path location information\for disk elements into the
CAM device database.
Sponsored by: Spectra Logic Corporation
Sponsored by: iXsystems, Inc.
Submitted by: gibbs, will, mav
to freebsd-fs@, where the setfacl of an NFSv4 acl would fail.
This was caused by the VOP_ACLCHECK() call for ZFS replying
EOPNOTSUPP. After discussion with rwatson@, it was determined
that a call to VOP_ACLCHECK() before doing VOP_SETACL() is not
required. This patch fixes the problem by deleting the
VOP_ACLCHECK() call.
Tested by: Andrew Leonard (previous version)
MFC after: 1 week
The NAND Flash environment consists of several distinct components:
- NAND framework (drivers harness for NAND controllers and NAND chips)
- NAND simulator (NANDsim)
- NAND file system (NAND FS)
- Companion tools and utilities
- Documentation (manual pages)
This work is still experimental. Please use with caution.
Obtained from: Semihalf
Supported by: FreeBSD Foundation, Juniper Networks
8.x code:
- If the lock cannot be acquired immediately unlocks 'bar' vnode
and then locks both vnodes in order.
- wrong vnode type panics from cache_enter_time after calls by
ext2_lookup.
The fix merges the fixes from ufs/ufs_lookup.c.
Submitted by: Mateusz Guzik
Approved by: jhb@ (mentor)
Reviewed by: kib@
MFC after: 1 week
Entries with zero inode number are considered placeholders by libc and
UFS. Fix remaining uses of VOP_READDIR in kernel: vop_stdvptocnp,
unionfs.
Sponsored by: Google Summer of Code 2011
used, when the code should actually protect the tested
variable with a mutex. Since the tsleep()s had a 10sec
timeout, the race would have only delayed the allocation
of a new clientid for a client. The sleeps will also
rarely occur, since having a callback in progress when
a client acquires a new clientid, is unlikely.
in practice, since having a callback in progress when
a fresh clientid is being acquired by a client is unlikely.
MFC after: 1 month
memory mapped pages being written back on an NFS mount.
Since any thread can call VOP_PUTPAGES() to write back a
dirty page, the credentials of that thread may not have
write access to the file on an NFS server. (Often the uid
is 0, which may be mapped to "nobody" in the NFS server.)
Although there is no completely correct fix for this
(NFS servers check access on every write RPC instead of at
open/mmap time), this patch avoids the common cases by
holding onto a credential that recently opened the file
for writing and uses that credential for the write RPCs
being done by VOP_PUTPAGES() for both NFS clients.
Tested by: Joel Ray Holveck (joelh at juniper.net)
PR: kern/165923
Reviewed by: kib
MFC after: 2 weeks
- fixed an incorrect lock status issue.
- fixed an incorrect lock issue of unionfs root vnode removed.
(pointed out by keith)
- fixed an infinity loop issue.
(pointed out by dumbbell)
- changed to do LK_RELEASE expressly when unlocked.
Submitted by: ozawa@ongs.co.jp
do not include file attributes in the reply to an NFS create RPC
under certain circumstances.
This resulted in a vnode of type VNON that was not usable.
This patch adds an NFS getattr RPC to nfs_create() for this case,
to fix the problem. It was tested by the person that reported
the problem and confirmed to fix this case for their server.
Tested by: Steven Haber (steven.haber at isilon.com)
MFC after: 2 weeks
ZFS volume is exported via the new NFS server. The leak occurred
because the new NFS server code didn't handle the case where
a file system sets the SAVENAME flag in its VOP_LOOKUP() and
ZFS does this for the DELETE case.
Tested by: Oliver Brandmueller (ob at gruft.de), hrs
PR: kern/167266
MFC after: 1 month
a mount point. Active vnodes are those with a non-zero use or hold
count, e.g., those vnodes that are not on the free list. Note that
this list is in addition to the list of all the vnodes associated
with a mount point.
To avoid adding another set of linkage pointers to the vnode
structure, the active list uses the existing linkage pointers
used by the free list (previously named v_freelist, now renamed
v_actfreelist).
This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point (typically
less than 1% of the vnodes associated with the mount point).
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).
To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.
The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
Since r230208 update mounts were allowed if the list of mount options
contained the "export" option. This is not correct as tmpfs doesn't
really support updating all options.
Reviewed by: kevlo, trociny
Return EPERM from ext2_setattr() when an user without PRIV_VFS_SYSFLAGS
privilege attempts to toggle SF_SETTABLE flags.
Flags are now stored to ip->i_flags in one place after all checks.
Also, remove SF_NOUNLINK from the checks because ext2fs doesn't support
that flag.
Reviewed by: bde
- Use more natural ip->i_flags instead of vap->va_flags in the final
flags check.
- Style improvements.
No functional change intended.
MFC after: 2 weeks
accesses of the cache member of vm_object objects.
- Use novel vm_page_is_cached() for checks outside of the vm subsystem.
Reviewed by: alc
MFC after: 2 weeks
X-MFC: r234039
allocator.
Replace UINT32_MAX checks with INT_MAX. Keeping more than 2^31 nodes in
memory is not likely to become possible in foreseeable feature and would
require new unit number allocator.
Discussed with: delphij
MFC after: 2 weeks
behaviour on error from write RPC back to behaviour of old nfs client.
When set to not zero, the pages for which write failed are kept dirty.
PR: kern/165927
Reviewed by: alc
MFC after: 2 weeks
Doomed vnode is hardly of any use here, besides all callers handle error
case. vfs_hash_get() does the same.
Don't mess with vnode holdcount, vget() takes care of it already.
Approved by: mdf (mentor)
When using big inodes there is sufficient space in ext3 to
keep extra resolution and birthtime (creation) timestamps.
The appropriate fields in the on-disk inode have been approved
for a long time but support for this in ext3 has not been
widely distributed.
In preparation for ext4 most linux distributions have enabled
by default such bigger inodes and some people use nanosecond
timestamps in ext3. We now support those when the inode is big
enough and while we do recognize the EXT4F_ROCOMPAT_EXTRA_ISIZE,
we maintain the extra timestamps even when they are not used.
An additional note by Bruce Evans:
We blindly accept unrepresentable tv_nsec in VOP_SETATTR(), but
all file systems have always done that. When POSIX gets around
to specifying the behaviour, it will probably require certain
rounding to the fs's resolution and not rejecting the request.
This unfortunately means that syscalls that set times can't
really tell if they succeeded without reading back the times
using stat() or similar and checking that they were set close
enough.
Reviewed by: bde
Approved by: jhb (mentor)
MFC after: 2 weeks
fifo_iseof() condition, allowing the v_fifoinfo to be reset and freed
by fifo_cleanup().
Precalculate EOF at the places were fo_wgen is changed, and cache the
state in a new pipe state flag PIPE_SAMEWGEN.
Reported and tested by: bf
Submitted by: gianni
MFC after: 1 week (a backport)
in the new NFS server for NFSv4, where it would report ENOENT
when the file actually existed on the server. This turned out
to be caused by not initializing ni_topdir before calling lookup()
and there was a rare case where the value on the stack location
assigned to ni_topdir happened to be a pointer to a ".." entry,
such that "dp == ndp->ni_topdir" succeeded in lookup().
This patch initializes ni_topdir to fix the problem.
MFC after: 5 days
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.
Reported by: bde
Discussed with: kib
Reviewed by: jhb
MFC after: 2 weeks
associated with the previous vnode (if any) associated with the target of
a rename(). Otherwise, a lookup of the target pathname concurrent with a
rename() could re-add a name cache entry after the namei(RENAME) lookup
in kern_renameat() had purged the target pathname.
MFC after: 2 weeks
instead of accepting half-constructed vnode. Previous code cannot decide
what to do with such vnode anyway, and although processing it for hash
removal, paniced later when getting rid of nullfs reference on lowervp.
While there, remove initializations from the declaration block.
Tested by: pho
MFC after: 1 week
function null_destroy_proto() from null_insmntque_dtr(). Also
apply null_destroy_proto() in null_nodeget() when we raced and a vnode
is found in the hash, so the currently allocated protonode shall be
destroyed.
Lock the vnode interlock around reassigning the v_vnlock.
In fact, this path will not be exercised after several later commits,
since null_nodeget() cannot take shared-locked lowervp at all due to
insmntque() requirements.
Reported by: rea
Tested by: pho
MFC after: 1 week
a new jail parameter node with the following parameters:
allow.mount.devfs:
allow mounting the devfs filesystem inside a jail
allow.mount.nullfs:
allow mounting the nullfs filesystem inside a jail
Both parameters are disabled by default (equals the behavior before
devfs and nullfs in jails). Administrators have to explicitly allow
mounting devfs and nullfs for each jail. The value "-1" of the
devfs_ruleset parameter is removed in favor of the new allow setting.
Reviewed by: jamie
Suggested by: pjd
MFC after: 2 weeks
"panic in 8.3-PRERELEASE" on Feb. 22, 2012. This panic was caused
by use of a mix of tsleep() and msleep() calls on the same event
in the new NFS server DRC code. It did "mtx_unlock(); tsleep();"
in two places, which kib@ noted introduced a slight risk that the
wakeup() would occur before the tsleep(), resulting in a 10sec
delay before waking up. This patch fixes the problem by replacing
"mtx_unlock(); tsleep();" with mtx_sleep(..PDROP..). It also
changes a nfsmsleep() call to mtx_sleep() so that the code uses
mtx_sleep() consistently within the file.
Tested by: hrs (in progress)
Reviewed by: jhb
MFC after: 5 days
Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.
Discussed with: bde, das (previous versions)
MFC after: 1 month
A new jail(8) option "devfs_ruleset" defines the ruleset enforcement for
mounting devfs inside jails. A value of -1 disables mounting devfs in
jails, a value of zero means no restrictions. Nested jails can only
have mounting devfs disabled or inherit parent's enforcement as jails are
not allowed to view or manipulate devfs(8) rules.
Utilizes new functions introduced in r231265.
Reviewed by: jamie
MFC after: 1 month
Add support for updating the devfs mount (currently only changing the
ruleset number is supported).
Check mnt_optnew with vfs_filteropt(9).
This new option sets the specified ruleset number as the active ruleset
of the new devfs mount and applies all its rules at mount time. If the
specified ruleset doesn't exist, a new empty ruleset is created.
MFC after: 1 month
ext4 but that can be used in ext3 mode.
Also adjust the internal inode to carry the birthtime,
like in UFS, which is starting to get some use when
big inodes are available.
Right now these are just placeholders for features
to come.
Approved by: jhb (mentor)
MFC after: 2 weeks
a credential structure would corrupt it. This happened when the
p argument was != NULL. However, I now realize that the copying of
open credentials should only happen for p == NULL, since that indicates
that it is a read-ahead or write-behind. This patch fixes this.
After this commit, r228827 could be reverted, but I think the code is
clearer and safer with the patch, so I am going to leave it in.
Without this patch, it was possible that a NFSv4 VOP_SETATTR() could have
changed the credentials of the caller. This would have happened if
the process doing the VOP_SETATTR() did not have the file open, but
some other process running as a different uid had the file open for writing
at the same time.
MFC after: 5 days
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.
Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.
Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.
Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks
the original IPv4 implementation from r178888:
- Use RT_DEFAULT_FIB in the IPv4 implementation where noticed.
- Use rt*fib() KPI with explicit RT_DEFAULT_FIB where applicable in
the NFS code.
- Use the new in6_rt* KPI in TCP, gif(4), and the IPv6 network stack
where applicable.
- Split in6_rtqtimo() and in6_mtutimo() as done in IPv4 and equally
prevent multiple initializations of callouts in in6_inithead().
- Use wrapper functions where needed to preserve the current KPI to
ease MFCs. Use BURN_BRIDGES to indicate expected future cleanup.
- Fix (related) comments (both technical or style).
- Convert to rtinit() where applicable and only use custom loops where
currently not possible otherwise.
- Multicast group, most neighbor discovery address actions and faith(4)
are locked to the default FIB. Individual IPv6 addresses will only
appear in the default FIB, however redirect information and prefixes
of connected subnets are automatically propagated to all FIBs by
default (mimicking IPv4 behavior as closely as possible).
Sponsored by: Cisco Systems, Inc.
any thread doing an I/O RPC with a transfer size greater than
NFS_UDPMAXDATA will be hung indefinitely, retrying the RPC.
After a discussion on freebsd-fs@, I decided to add a warning
message for this case, as suggested by Jeremy Chadwick.
Suggested by: freebsd at jdc.parodius.com (Jeremy Chadwick)
MFC after: 2 weeks
NFS clients was reported to freebsd-fs@ under the subject "NFS
corruption in recent HEAD" on Nov. 26, 2011. This problem occurred when
a TCP mounted root fs was changed to using UDP. I believe that this
problem was caused by the change in mnt_stat.f_iosize that occurred
because rsize was decreased to the maximum supported by UDP. This
patch fixes the problem by using v_bufobj.bo_bsize instead of f_iosize,
since the latter is set to f_iosize when the vnode is allocated, but
does not change for a given vnode when f_iosize changes.
Reported by: pjd
Reviewed by: kib
MFC after: 2 weeks
appropriate timestamps. Restore the assertions which verify that
NCF_TS is set when timestamp is asked for.
Reviewed by: jhb (previous version)
MFC after: 2 weeks
we will only trust a positive name cache entry for a specified amount of
time before falling back to a LOOKUP RPC, even if the ctime for the file
handle matches the cached copy in the name cache entry. The timeout is
configured via a new 'nametimeo' mount option and defaults to 60 seconds.
It may be set to zero to disable positive name caching entirely.
Reviewed by: rmacklem
MFC after: 1 week
from TCP to UDP and the rsize/wsize/readdirsize is greater
than NFS_MAXDGRAMDATA, it is possible for a thread doing an
I/O RPC to get stuck repeatedly doing retries. This happens
because the RPC will use a resize/wsize/readdirsize that won't
work for UDP and, as such, it will keep failing indefinitely.
This patch returns an error for this case, to avoid the problem.
A discussion on freebsd-fs@ seemed to indicate that returning
an error was preferable to silently ignoring the "udp"/"mntudp"
option.
This problem was discovered while investigating a problem reported
by pjd@ via email.
MFC after: 2 weeks
entries on one client when a directory was renamed on another client. The
root cause for the stale entry being trusted is that each per-vnode nfsnode
structure has a single 'n_ctime' timestamp used to validate positive name
cache entries. However, if there are multiple entries for a single vnode,
they all share a single timestamp. To fix this, extend the name cache
to allow filesystems to optionally store a timestamp value in each name
cache entry. The NFS clients now fetch the timestamp associated with
each name cache entry and use that to validate cache hits instead of the
timestamps previously stored in the nfsnode. Another part of the fix is
that the NFS clients now use timestamps from the post-op attributes of
RPCs when adding name cache entries rather than pulling the timestamps out
of the file's attribute cache. The latter is subject to races with other
lookups updating the attribute cache concurrently. Some more details:
- Add a variant of nfsm_postop_attr() to the old NFS client that can return
a vattr structure with a copy of the post-op attributes.
- Handle lookups of "." as a special case in the NFS clients since the name
cache does not store name cache entries for ".", so we cannot get a
useful timestamp. It didn't really make much sense to recheck the
attributes on the the directory to validate the namecache hit for "."
anyway.
- ABI compat shims for the name cache routines are present in this commit
so that it is safe to MFC.
MFC after: 2 weeks
subject "Data corruption over NFS in -current". During investigation
of this, I came across an ugly bogusity in the new NFS client where
it replaced the cr_uid with the one used for the mount. This was
done so that "system operations" like the NFSv4 Renew would be
performed as the user that did the mount. However, if any other
thread shares the credential with the one doing this operation,
it could do an RPC (or just about anything else) as the wrong cr_uid.
This patch fixes the above, by using the mount credentials instead of
the one provided as an argument for this case. It appears
to have fixed Martin's problem.
This patch is needed for NFSv4 mounts and NFSv3 mounts against
some non-FreeBSD servers that do not put post operation attributes
in the NFSv3 Statfs RPC reply.
Tested by: Martin Cracauer (cracauer at cons.org)
Reviewed by: jhb
MFC after: 2 weeks
to read strings completely to know the actual size.
As a side effect it fixes the issue with kern.proc.args and kern.proc.env
sysctls, which didn't return the size of available data when calling
sysctl(3) with the NULL argument for oldp.
Note, in get_ps_strings(), which does actual work for proc_getargv() and
proc_getenvv(), we still have a safety limit on the size of data read in
case of a corrupted procces stack.
Suggested by: kib
MFC after: 3 days
vm_object_pip_{add,subtract}() on the swap object because the swap
object can't be destroyed while the vnode is exclusively locked.
Moreover, even if the swap object could have been destroyed during
tmpfs_nocacheread() and tmpfs_mappedwrite() this code is broken
because vm_object_pip_subtract() does not wake up the sleeping thread
that is trying to destroy the swap object.
Free invalid pages after an I/O error. There is no virtue in keeping
them around in the swap object creating more work for the page daemon.
(I believe that any non-busy page in the swap object will now always
be valid.)
vm_pager_get_pages() does not return a standard errno, so its return
value should not be returned by tmpfs without translation to an errno
value.
There is no reason for the wakeup on vpg in tmpfs_mappedwrite() to
occur with the swap object locked.
Eliminate printf()s from tmpfs_nocacheread() and tmpfs_mappedwrite().
(The swap pager already spam your console if data corruption is
imminent.)
Reviewed by: kib
MFC after: 3 weeks
the new NFSv4 server where the code follows the wrong list.
Fortunately, for these fairly rare cases, the lc_stateid[]
lists are normally empty. This patch fixes the code to
follow the correct list.
Reported by: tai.horgan at isilon.com
Discussed with: zack
MFC after: 2 weeks
would go negative after using the "-z" option to zero out the stats.
This patch fixes that by not zeroing out the srvcache_size field
for "-z", since it is the size of the cache and not a counter.
MFC after: 2 weeks
operation on POSIX shared memory objects and tmpfs. Previously, neither of
these modules correctly handled the case in which the new size of the object
or file was not a multiple of the page size. Specifically, they did not
handle partial page truncation of data stored on swap. As a result, stale
data might later be returned to an application.
Interestingly, a data inconsistency was less likely to occur under tmpfs
than POSIX shared memory objects. The reason being that a different mistake
by the tmpfs truncation operation helped avoid a data inconsistency. If the
data was still resident in memory in a PG_CACHED page, then the tmpfs
truncation operation would reactivate that page, zero the truncated portion,
and leave the page pinned in memory. More precisely, the benevolent error
was that the truncation operation didn't add the reactivated page to any of
the paging queues, effectively pinning the page. This page would remain
pinned until the file was destroyed or the page was read or written. With
this change, the page is now added to the inactive queue.
Discussed with: jhb
Reviewed by: kib (an earlier version)
MFC after: 3 weeks
The effect of this was, for clients mounted via inet6 addresses,
that the DRC cache would never have a hit in the server. It also
broke NFSv4 callbacks when an inet6 address was the only one available
in the client. This patch fixes the above, plus deletes opt_inet6.h
from a couple of files it is not needed for.
MFC after: 2 weeks
consequence sbuf_len() will return -1 for buffers which had the error
status set prior to sbuf_finish() call. This causes a problem in
pfs_read() which purposely uses a fixed size sbuf to discard bytes which
are not needed to fulfill the read request.
Work around the problem by using the full buffer length when
sbuf_finish() indicates an overflow. An overflowed sbuf with fixed size
is always full.
PR: kern/163076
Approved by: des
MFC after: 2 weeks
Several callers of null_nodeget() did the cleanup itself, but several
missed it, most prominent being null_bypass(). Remove the cleanup from
the callers, now null_nodeget() handles lowervp free itself.
Reported and tested by: pho
MFC after: 1 week
tmpfs_nocacheread(). It is both unnecessary and a pessimization. It
results in either the page being zeroed twice or zeroed first and then
overwritten by an I/O operation.
MFC after: 3 weeks
It seems strchr() and strrchr() are used more often than index() and
rindex(). Therefore, simply migrate all kernel code to use it.
For the XFS code, remove an empty line to make the code identical to
the code in the Linux kernel.
While there, remove a useless check from the code. memcchr() always
returns characters unequal to 0xff in this case, so inosused[i] ^ 0xff
can never be equal to zero. Also, the fact that memcchr() returns a
pointer instead of the number of bytes until the end, makes conversion
to an offset far more easy.
jhb@ spotted that nfscl_getstateid() might modify credentials when
called from nfsrpc_read() for the case where p != NULL, whereas
nfsrpc_read() only did a crdup() to get new credentials for p == NULL.
This bug was introduced by r195510, since pre-r195510 nfscl_getstateid()
only modified credentials for the p == NULL case. This patch modifies
nfsrpc_read()/nfsrpc_write() so that they do crdup() for the p != NULL case.
It is conceivable that this bug caused the crash reported by glebius@, but
that will not be determined for some time, since the crash occurred after
about 1month of operation.
Tested by: glebius
Reviewed by: jhb
MFC after: 2 weeks
get a reply of EEXIST from an NFS server when a Mkdir RPC was retried,
for an NFS over UDP mount.
Upon investigation, it was found that the client was retransmitting
the Mkdir RPC request over UDP, but with a different xid. As such,
the retransmitted message would miss the Duplicate Request Cache
in the server, causing it to reply EEXIST. The kernel client side
UDP rpc code has two timers. The first one causes a retransmit using
the same xid and socket and was set to a fixed value of 3seconds.
(The default can be overridden via CLSET_RETRY_TIMEOUT.)
The second one creates a new socket and xid and should be larger
than the first. However, both NFS clients were setting the second
timer to nm_timeo ("timeout=<value>" mount argument), which defaulted to
1second, so the first timer would never time out.
This patch fixes both NFS clients so that they set the first timer
using nm_timeo and makes the second timer larger than the first one.
Reported by: jwd
Tested by: jwd
Reviewed by: jhb
MFC after: 2 weeks
Fix a comment from the previous commit.
Use M_ZERO instead of bzero() in ext2_vfsops.c
Add include guards from PR.
PR: 162564
Approved by: jhb (mentor)
MFC after: 2 weeks
The feature has been standard for a while in UFS as a means to reduce
fragmentation, therefore maintaining consistent performance with
filesystem aging. This is also very similar to what ext4 calls
"delayed allocation".
In his 2010 GSoC, Zheng Liu ported and benchmarked the missing
FANCY_REALLOC code to find more consistent performance improvements than
with the preallocation approach.
PR: 159233
Author: Zheng Liu <gnehzuil AT SPAMFREE gmail DOT com>
Sponsored by: Google Inc.
Approved by: jhb (mentor)
MFC after: 2 weeks
This removes the obfuscations mentioned in ext2_readwrite and
places the clustering funtion in a location similar to other
UFS-based implementations.
No performance or functional changeses are expected from
this move.
PR: kern/159232
Suggested by: bde
Approved by: jhb (mentor)
MFC after: 2 weeks
of a devfs file descriptor in devfs_close_f(). The passed in td argument
may be NULL if the close was invoked by garbage collection of open
file descriptors in pending control messages in the socket buffer of a
UNIX domain socket after it was closed.
PR: kern/151758
Submitted by: Andrey Shidakov andrey shidakov ru
Submitted by: Ruben van Staveren ruben verweg com
Reviewed by: kib
MFC after: 2 weeks
difference between fi_wgen and f_seqcount, so the change is purely
cosmetic, but it makes the code easier to understand.
Submitted by: gianni
MFC after: 2 weeks
check for a UTF-8 compliant file name. Enabling this sysctl results in
an NFSv4 server that is non-RFC3530 compliant, therefore it is not enabled
by default. However, enabling this sysctl results in NFSv3 compatible
behaviour and fixes the problem reported by "dan at sunsaturn.com"
to freebsd-current@ on Nov. 14, 2011 under the subject "NFSV4 readlink_stat".
Tested by: dan at sunsaturn.com
Reviewed by: zack
MFC after: 2 weeks
of the same lock_owner4 string. As such, the handling of cleanup
of lock_owners could be simplified. This simplification permitted
the client to do a ReleaseLockOwner operation when the process that
the lock_owner4 string represents, has exited. This permits the
server to release any storage related to the lock_owner4 string
before the associated open is closed. Without this change, it
is possible to exhaust a server's storage when a long running
process opens a file and then many child processes do locking
on the file, because the open doesn't get closed. A similar patch
was applied to the Linux NFSv4 client recently so that it wouldn't
exhaust a server's storage.
Reviewed by: zack
MFC after: 2 weeks
NFS server and reuse it for writes as well to allow writes to the backing
store to be clustered.
- Use a prime number for the size of the heuristic table (1017 is not
prime).
- Move the logic to locate a heuristic entry from the table and compute
the sequential count out of VOP_READ() and into a separate routine.
- Use the logic from sequential_heuristic() in vfs_vnops.c to update the
seqcount when a sequential access is performed rather than just
increasing seqcount by 1. This lets the clustering count ramp up
faster.
- Allow for some reordering of RPCs and if it is detected leave the current
seqcount as-is rather than dropping back to a seqcount of 1. Also,
when out of order access is encountered, cut seqcount in half rather than
dropping it all the way back to 1 to further aid with reordering.
- Fix the new NFS server to properly update the next offset after a
successful VOP_READ() so that the readahead actually works.
Some of these changes came from an earlier patch by Bjorn Gronwall that was
forwarded to me by bde@.
Discussed with: bde, rmacklem, fs@
Submitted by: Bjorn Gronwall (1, 4)
MFC after: 2 weeks
vnode locking for read, readdir, readlink, getattr and access.
It is hoped that this will improve server performance for these
operations, since they will no longer be serialized for a given
file/vnode.
- Don't deduct wired pages from total usable counts because it does not
make any sense. To make things worse, on systems where swap size is
smaller than physical memory and use a lot of wired pages (e.g. ZFS),
tmpfs can suddenly have free space of 0 because of this;
- Count cached pages as available; [1]
- Don't count inactive pages as available, technically we could but that
might be too aggressive; [1]
[1] Suggested by kib@
MFC after: 1 week
client. This does not change the client's behaviour, but prepares
the code so that nfsrpc_rellockown() can be called elsewhere in a
future commit.
MFC after: 2 weeks
head nfsc_defunctlockowner. This patch simply removes the code that
loops through this always empty list, since the code no longer does
anything useful. It should not have any effect on the client's
behaviour.
MFC after: 2 weeks
nullfs. The problem is that resulting vnode is only required to be
held on return from the successfull call to vop, instead of being
referenced.
Nullfs VOP_INACTIVE() method reclaims the vnode, which in combination
with the VOP_VPTOCNP() interface means that the directory vnode
returned from VOP_VPTOCNP() is reclaimed in advance, causing
vn_fullpath() to error with EBADF or like.
Change the interface for VOP_VPTOCNP(), now the dvp must be
referenced. Convert all in-tree implementations of VOP_VPTOCNP(),
which is trivial, because vhold(9) and vref(9) are similar in the
locking prerequisites. Out-of-tree fs implementation of VOP_VPTOCNP(),
if any, should have no trouble with the fix.
Tested by: pho
Reviewed by: mckusick
MFC after: 3 weeks (subject of re approval)
for regular files. Since other file types don't write into the
buffer cache, calling ncl_flush() is almost a no-op. However, it does
clear the NMODIFIED flag and this shouldn't be done by nfs_fsync() for
directories.
MFC after: 2 weeks
before the nfs_decode_args() call in the new NFS client, so
that a specfied command line value won't be overwritten.
Also, modify the calculation for small values of desiredvnodes
to avoid an unusually large value or a divide by zero crash.
It seems that the default value for nm_wcommitsize is very
conservative and may need to change at some time.
PR: kern/159351
Submitted by: onwahe at gmail.com (earlier version)
Reviewed by: jhb
MFC after: 2 weeks
-- highly experimental even. So far the closest to a bug in TMPFS that people
have gotten to relates to how ZFS can take away from the memory that TMPFS
needs. One can argue that such is not a bug in TMPFS. Irrespective, even if
there is a bug here and there in TMPFS, it's not in our own advantage to
scare people away from using TMPFS. I for one have been using it, even with
ZFS, very successfully.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
file descriptor drops to zero out of _fdrop() and into devfs_close_f()
as it is only relevant for devfs file descriptors.
Reviewed by: kib
MFC after: 1 week
failure (the getnewvnode cannot return an error). In this case, the
null_insmntque_dtr() already unlocked the reclaimed vnode, so VOP_UNLOCK()
in the nullfs_mount() after null_nodeget() failure is wrong.
Tested by: pho
MFC after: 1 week
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.
Reviewed by: rwatson
Approved by: re (bz)
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.
Document the changes to flags field to only require the page lock.
Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.
Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)
which requires that ni_strictrelative be set to 0, post-r224810.
Tested by: swills (earlier version), geo dot liaskos at gmail.com
Approved by: re (kib)
an attempt to do an Open operation on any type of file other
than VREG is done. A recent discussion on the IETF working group's
mailing list (nfsv4@ietf.org) decided that NFSERR_SYMLINK
should be returned for all non-regular files and not just symlinks,
so that the Linux client would work correctly.
This change does not affect the FreeBSD NFSv4 client and is not
believed to have a negative effect on other NFSv4 clients.
Reviewed by: zkirsch
Approved by: re (kib)
MFC after: 2 weeks
to resolve the path of the text vnode of the process. The behaviour is
very confusing for any consumer of the procfs, in particular, java.
Reported and tested by: bf
MFC after: 2 weeks
Approved by: re (bz)
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.
Based on the submission by: glebius
Reviewed by: rwatson
Approved by: re (bz)
r224086 added "goto out"-style error handling to nfssvc_nfsd(), in order
to reliably call NFSEXITCODE() before returning. Our Capsicum changes,
based on the old "return (error)" model, did not merge nicely.
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
kernel for FreeBSD 9.0:
Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.
Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.
In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.
Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.
Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc
This was reported to the mailing list freebsd-net@freebsd.org
on July 21, 2011 under the subject "LOR with nfsclient sillyrename".
The LOR occurred when nfs_inactive() called vrele(sp->s_dvp)
while holding the vnode lock on the file in s_dvp. This patch
modifies the client so that it performs the vrele(sp->s_dvp)
as a separate task to avoid the LOR. This fix was discussed
with jhb@ and kib@, who both proposed variations of it.
Tested by: pho, jlott at averesystems.com
Submitted by: jhb (earlier version)
Reviewed by: kib
Approved by: re (kib)
MFC after: 2 weeks
recursive vnode lock on the directory for the case where the
new file name is in the same directory as the old one. The patch
handles this as a special case, recognized by the new directory
having the same file handle as the old one and just VREF()s the old
dir vnode for this case, instead of doing a second VFS_FHTOVP() to get it.
This is required so that the server will work for file systems like
msdosfs, that do not support recursive vnode locking.
This problem was discovered during recent testing by pho@
when exporting an msdosfs file system via the new NFS server.
Tested by: pho
Reviewed by: zkirsch
Approved by: re (kib)
MFC after: 2 weeks
failed after the file was created in nfs_create(). This would
probably only happen during a forced dismount. The old NFS client
does have a vput() for this case. Detected by pho during recent
testing, where an open syscall returned with a vnode still locked.
Tested by: pho
Approved by: re (kib)
MFC after: 2 weeks
32 bits to 64 bits and eliminates the unused mnt_xflag field. The
existing mnt_flag field is completely out of bits, so this update
gives us room to expand. Note that the f_flags field in the statfs
structure is already 64 bits, so the expanded mnt_flag field can
be exported without having to make any changes in the statfs structure.
Approved by: re (bz)
for the remove and rename operations. Some NFSv4 servers will
report NFSERR_GRACE for these operations. This patch changes
the behaviour of the client so that it handles NFSERR_GRACE
like NFSERR_DELAY for non-state related operations like
remove and rename. It also exempts the delegreturn operation
from handling within newnfs_request() for NFSERR_DELAY/NFSERR_GRACE
so that it can handle NFSERR_GRACE in the same manner as before.
This problem was resolved thanks to discussion with bfields at fieldses.org.
The problem was identified at the recent NFSv4 ineroperability
bakeathon.
MFC after: 2 weeks
Isilon has the concept of an in-memory exit-code ring that saves the last exit
code of a function and allows for stack tracing. This is very helpful when
debugging tough issues.
This patch is essentially a no-op for BSD at this point, until we upstream
the dexitcode logic itself. The patch adds DEXITCODE calls to every NFS
function that returns an errno error code. A number of code paths were also
reorganized to have single exit paths, to reduce code duplication.
Submitted by: David Kwan <dkwan@isilon.com>
Reviewed by: rmacklem
Approved by: zml (mentor)
MFC after: 2 weeks
r179247 introduced a possibility of devfs_allocv() returning spurious
ENOENT. If the vnode is selected by vnlru daemon for reclamation, then
devfs_allocv() can get ENOENT from vget() due to devfs_close() dropping
vnode lock around the call to cdevsw d_close method.
Use LK_RETRY in the vget() call, and do some part of the devfs_reclaim()
work in devfs_allocv(), clearing vp->v_data and de->de_vnode. Retry the
allocation of the vnode, now with de->de_vnode == NULL.
The check vp->v_data == NULL at the start of devfs_close() cannot be
affected by the change, since vnode lock must be held while VI_DOOMED
is set, and only dropped after the check.
Reported and tested by: Kohji Okuno <okuno.kohji jp panasonic com>
Reviewed by: attilio
MFC after: 3 weeks
loop in nfscl_getcl() when a forced dismount is in progress,
because nfsv4_lock() will return 0 without sleeping when
MNTK_UNMOUNTF is set.
This patch fixes it so it won't loop calling nfsv4_lock()
for this case.
MFC after: 2 weeks
This comment refers to CAP_NT_SMBS, which does not exist; it should refer to SMB_CAP_NT_SMBS.
Fixing this comment makes it easier for people interested in Capsicum to grep around for
capability rights, whose identifiers are of the form 'CAP_[A-Z_]'.
Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
multiple instances of the same lock_owner when a process both
inherited an open file descriptor plus opened the same file itself.
Since some NFSv4 servers cannot handle multiple instances of
the same lock_owner string, this patch changes the algorithm
used by nfscl_getopen() in the new NFSv4 client to keep that
from happening. The new algorithm is simpler, since there is
no longer any need to ascend the process's parentage tree because
all NFSv4 Closes for a file are done at VOP_INACTIVE()/VOP_RECLAIM(),
making the Opens indistinct w.r.t. use with Lock Ops.
This problem was discovered at the recent NFSv4 interoperability
Bakeathon.
MFC after: 2 weeks
to the lock_owner4 string that goes on the wire. Also, add
code to do a ReleaseLockOwner Op on the lock_owner4 string
before a Close. Apparently not all NFSv4 servers handle multiple
instances of the same lock_owner4 string, at least not in a
compatible way. This patch avoids having multiple instances,
except for one unusual case, which will be fixed by a future commit.
Found at the recent NFSv4 interoperability Bakeathon.
Tested by: tdh at excfb.com
MFC after: 2 weeks
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.
This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.
Update all of the existing assertions on pmap_remove_all() to reflect this
change.
Reviewed by: kib
mode attribute in as 0 when doing writes. The change adds
the Mode attribute plus the others except Owner and Owner_group
to the list requested by the NFSv4 Write Operation. This fixed
a problem where an executable file built by "cc" would get mode
0111 instead of 0755 for some NFSv4 servers.
Found at the recent NFSv4 interoperability Bakeathon.
Tested by: tdh at excfb.com
MFC after: 2 weeks
server replied NFS3ERR_JUKEBOX/NFS4ERR_DELAY to an rpc.
This affected both NFSv3 and NFSv4. Found during testing
at the recent NFSv4 interoperability Bakeathon.
MFC after: 2 weeks
was used for doing a mount when performing system operations
on AUTH_SYS mounts. This resolved an issue when mounting
a Linux server. Found during testing at the recent
NFSv4 interoperability Bakeathon.
MFC after: 2 weeks
operations while traversing non-exported file systems. This is
required for some non-FreeBSD clients to do NFSv4 mounts. Found during
the recent NFSv4 interoperability Bakeathon.
MFC after: 2 weeks
lock the mutex when manipulating rc_flag in the DRC cache.
This is believed to fix a hung server that was reported
to the freebsd-fs@ list on June 9 under the subject heading
"New NFS server stress test hang", where all the threads
were waiting for the RC_LOCKED flag to clear.
Tested by: jwd at slowblink.com
MFC after: 2 weeks
the NFS subsystems use five of the rpcsec_gss/kgssapi entry points,
but since it was not obvious which others might be useful, all
nineteen were included. Basically the nineteen entry points are
set in a structure called rpc_gss_entries and inline functions
defined in sys/rpc/rpcsec_gss.h check for the entry points being
non-NULL and then call them. A default value is returned otherwise.
Requested by rwatson.
Reviewed by: jhb
MFC after: 2 weeks
cloned from the old NFS client, plus additions for NFSv4. A
review of this code is in progress, however it was felt by the
reviewer that it could go in now, before code slush. Any changes
required by the review can be committed as bug fixes later.
should be ok, since the client now delays NFSv4 Close operations
until VOP_INACTIVE()/VOP_RECLAIM(). As such, there should be no
risk that the NFSv4 Open is closed while an associated byte range lock
still exists.
Tested by: avg
MFC after: 2 weeks
"p_leader" for the "id" for POSIX byte range locking. I think
this would only have affected processes created by rfork(2)
with the RFTHREAD flag specified. This patch fixes that by
passing the "id" down through the various functions from
nfs_advlock().
MFC after: 2 weeks