both node pointer and name component. This does the right thing for
hardlinks to the same node in the same directory.
Submitted by: Yoshihiro Ota <ota j email ne jp>
PR: kern/131356
MFC after: 2 weeks
sequence of pathname components. We walk the list building a string in
the caller's passed in buffer. Currently this only handles path names
in CS8 (character set 8) as that is what mkisofs generates for UDF images.
MFC after: 1 month
- Add a separate set of vnode operations that inherits from the fifo ops
and use it for fifo nodes.
- Add a VOP_SETATTR() method that allows setting the size (by silently
ignoring the requests) of fifos. This is to allow O_TRUNC opens of
fifo devices (e.g. I/O redirection in shells using ">").
- Add a VOP_PRINT() handler while I'm here.
- Align the fifo output in fifo_print() with other vn_printf() output.
- Remove the leading space from lockmgr_printinfo() so its output lines up
in vn_printf().
- lockmgr_printinfo() now ends with a newline, so remove an extra newline
from vn_printf().
So we are no longer interested in the error returned from
the *fs_doio() functions. With that we can remove the
error variable as its value is unused now.
Submitted by: Christoph Mallon christoph.mallon@gmx.de
After running a `make buildkernel', I noticed most of the Giant locks in
sysctl are only caused by a very small amount of sysctl's:
- sysctl.name2oid. This one is locked by SYSCTL_LOCK, just like
sysctl.oidfmt.
- kern.ident, kern.osrelease, kern.version, etc. These are just constant
strings.
- kern.arandom, used by the stack protector. It is already protected by
arc4_mtx.
I also saw the following sysctl's show up. Not as often as the ones
above, but still quite often:
- security.jail.jailed. Also mark security.jail.list as MPSAFE. They
don't need locking or already use allprison_lock.
- kern.devname, used by devname(3), ttyname(3), etc.
This seems to reduce Giant locking inside sysctl by ~75% in my primitive
test setup.
pathname lookups.
- Remove 'i_offset' and 'i_ino' from the ISO node structure and replace
them with local variables in the lookup routine instead.
- Cache a copy of 'i_diroff' for use during a lookup in a local variable.
- Save a copy of the found directory entry in a malloc'd buffer after a
successfull lookup before getting the vnode. This allows us to release
the buffer holding the directory block before calling vget() which
otherwise resulted in a LOR between "bufwait" and the vnode lock.
- Use an inlined version of vn_vget_ino() to handle races with ..
lookups. I had to inline the code here since cd9660 uses an internal
vget routine to save a disk I/O that would otherwise re-read the
directory block.
- Honor the requested locking flags during lookups to allow for shared
locking.
- Honor the requested locking flags passed to VFS_ROOT() and VFS_VGET()
similar to UFS.
- Don't make every ISO 9660 vnode hold a reference on the vnode of the
underlying device vnode of the mountpoint. The mountpoint already
holds a suitable reference.
Inside the kernel, the minor() function was responsible for obtaining
the device minor number of a character device. Because we made device
numbers dynamically allocated and independent of the unit number passed
to make_dev() a long time ago, it was actually a misnomer. If you really
want to obtain the device number, you should use dev2udev().
We already converted all the drivers to use dev2unit() to obtain the
device unit number, which is still used by a lot of drivers. I've
noticed not a single driver passes NULL to dev2unit(). Even if they
would, its behaviour would make little sense. This is why I've removed
the NULL check.
Ths commit removes minor(), minor2unit() and unit2minor() from the
kernel. Because there was a naming collision with uminor(), we can
rename umajor() and uminor() back to major() and minor(). This means
that the makedev(3) manual page also applies to kernel space code now.
I suspect umajor() and uminor() isn't used that often in external code,
but to make it easier for other parties to port their code, I've
increased __FreeBSD_version to 800062.
without corresponding number of fifo_open(). This causes assertion
failure in fifo_close() due to vp->v_fifoinfo being NULL for kernel
with INVARIANTS, or NULL pointer dereference otherwise. In fact, we may
ignore excess calls to fifo_close() without bad consequences.
Turn KASSERT() into the return, and print warning for now.
Tested by: pho
Reviewed by: rwatson
MFC after: 2 weeks
to be caused by a metadata corruption that occurs quite often after
unplugging a pendrive during write activity.
Reviewed by: scottl
Approved by: rwatson (mentor)
Sponsored by: FreeBSD Foundation
pfs_vncache_free() that removes pvd from the list, while it is not yet
put on the list.
Prevent the invalid removal from the list by clearing pvd_next and
pvd_prev for the newly allocated pvd, and only move pfs_vncache list
head when the pvd was at the head.
Suggested and approved by: des
MFC after: 2 weeks
compare map->timestamp with saved timestamp after map read lock is
reacquired, not with saved timestamp + 1. The only consequence of the +1
was unconditional lookup of the next map entry, though.
Tested by: pho
Approved by: des
MFC after: 2 weeks
may need to lock arbitrary vnodes, causing either lock order reversal or
recursive vnode lock acquisition.
Tested by: pho
Approved by: des
MFC after: 2 weeks
do pfs_vncache_alloc() for the same pfs_node and pid. In this case, we
could end up with two vnodes for the pair. Recheck the cache under the
locked pfs_vncache_mutex after all sleeping operations are done [1].
This case mostly cannot happen now because pseudofs uses exclusive vnode
locking for lookup. But it does drop the vnode lock for dotdot lookups,
and Marcus' pseudofs_vptocnp implementation is vulnerable too.
Do not call free() on the struct pfs_vdata after insmntque() failure,
because vp->v_data points to the structure, and pseudofs_reclaim()
frees it by the call to pfs_vncache_free().
Tested by: pho [1]
Approved by: des
MFC after: 2 weeks
anything other than 0. Make it so. This fixes
"panic: VOP_STRATEGY failed bp=0xc320dd90 vp=0xc3b9f648",
encountered when writing to an orphaned filesystem. Reason
for the panic was the following assert:
KASSERT(i == 0, ("VOP_STRATEGY failed bp=%p vp=%p", bp, bp->b_vp));
at vfs_bio:bufstrategy().
Reviewed by: scottl, phk
Approved by: rwatson (mentor)
Sponsored by: FreeBSD Foundation
message for r185765.
Noted by: rdivacky
Requested by: des
Commit message for r185765 should be:
In procfs map handler, and in linprocfs maps handler, do not call
vn_fullpath() while having vm map locked. This is done in anticipation
of the vop_vptocnp commit, that would make vn_fullpath sometime
acquire vnode lock.
Also, in linprocfs, maps handler already acquires vnode lock.
No objections from: des
MFC after: 2 week
sbuf instead of doing uiomove. This allows for reads from non-zero
offsets to work.
Patch is forward-ported des@' one, and was adopted to current code
by dchagin@ and me.
Reviewed by: des (linprocfs part)
PR: kern/101453
MFC after: 1 week
ask NDINIT to return a locked vnode instead of letting it drop the
lock and return a referenced vnode and then relock the vnode a few
lines down. This matches the behavior of other filesystem mount routines.
vnode in question does not need to be held. All the data structures used
during the name lookup are protected by the global name cache lock.
Instead, the caller merely needs to ensure a reference is held on the
vnode (such as vhold()) to keep it from being freed.
In the case of procfs' <pid>/file entry, grab the process lock while we
gain a new reference (via vhold()) on p_textvp to fully close races with
execve(2).
For the kern.proc.vmmap sysctl handler, use a shared vnode lock around
the call to VOP_GETATTR() rather than an exclusive lock.
MFC after: 1 month
and server. This replaces the RPC implementation of the NFS client and
server with the newer RPC implementation originally developed
(actually ported from the userland sunrpc code) to support the NFS
Lock Manager. I have tested this code extensively and I believe it is
stable and that performance is at least equal to the legacy RPC
implementation.
The NFS code currently contains support for both the new RPC
implementation and the older legacy implementation inherited from the
original NFS codebase. The default is to use the new implementation -
add the NFS_LEGACYRPC option to fall back to the old code. When I
merge this support back to RELENG_7, I will probably change this so
that users have to 'opt in' to get the new code.
To use RPCSEC_GSS on either client or server, you must build a kernel
which includes the KGSSAPI option and the crypto device. On the
userland side, you must build at least a new libc, mountd, mount_nfs
and gssd. You must install new versions of /etc/rc.d/gssd and
/etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf.
As long as gssd is running, you should be able to mount an NFS
filesystem from a server that requires RPCSEC_GSS authentication. The
mount itself can happen without any kerberos credentials but all
access to the filesystem will be denied unless the accessing user has
a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There
is currently no support for situations where the ticket file is in a
different place, such as when the user logged in via SSH and has
delegated credentials from that login. This restriction is also
present in Solaris and Linux. In theory, we could improve this in
future, possibly using Brooks Davis' implementation of variant
symlinks.
Supporting RPCSEC_GSS on a server is nearly as simple. You must create
service creds for the server in the form 'nfs/<fqdn>@<REALM>' and
install them in /etc/krb5.keytab. The standard heimdal utility ktutil
makes this fairly easy. After the service creds have been created, you
can add a '-sec=krb5' option to /etc/exports and restart both mountd
and nfsd.
The only other difference an administrator should notice is that nfsd
doesn't fork to create service threads any more. In normal operation,
there will be two nfsd processes, one in userland waiting for TCP
connections and one in the kernel handling requests. The latter
process will create as many kthreads as required - these should be
visible via 'top -H'. The code has some support for varying the number
of service threads according to load but initially at least, nfsd uses
a fixed number of threads according to the value supplied to its '-n'
option.
Sponsored by: Isilon Systems
MFC after: 1 month
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.
Approved by: rwatson (mentor)
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()
and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()
Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.
As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP
Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
sbuf instead of doing uiomove. This allows for reads from non-zero
offsets to work.
Patch is forward-ported des@' one, and was adopted to current code
by dchagin@ and me.
Reviewed by: des (linprocfs part)
PR: kern/101453
MFC after: 1 week
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
filedescriptor into it. Make sure that td_fpop is NULL when calling
d_mmap from dev_pager_getpages().
Change guards against td_fpop field being non-NULL with private state
for another device, and against sudden clearing the td_fpop. This
could occur when either a driver method calls another driver through
the filedescriptor operation, or a page fault happen while driver is
writing to a memory backed by another driver.
Noted by: rwatson
Tested by: rnoland
MFC after: 3 days
When I changed kern_conf.c three months ago I made device unit numbers
equal to (unneeded) device minor numbers. We used to require
bitshifting, because there were eight bits in the middle that were
reserved for a device major number. Not very long after I turned
dev2unit(), minor(), unit2minor() and minor2unit() into macro's.
The unit2minor() and minor2unit() macro's were no-ops.
We'd better not remove these four macro's from the kernel, because there
is a lot of (external) code that may still depend on them. For now it's
harmless to remove all invocations of unit2minor() and minor2unit().
Reviewed by: kib
and bcmp are not the same thing. 'man bcmp' states that the return is
"non-zero" if the two byte strings are not identical. Where as,
'man memcmp' states that the return is the "difference between the
first two differing bytes (treated as unsigned char values" if the
two byte strings are not identical.
So provide a proper memcmp(9), but it is a C implementation not a tuned
assembly implementation. Therefore bcmp(9) should be preferred over memcmp(9).
In the MPSAFE TTY branch I noticed the vfs timestamps inside devfs were
allocated with 0, where the getattr() routine bumps the timestamps to
boottime if the value is below 3600. The reason why it has been designed
like this, is because timestamps during boot are likely to be invalid.
This means that device nodes that are created on demand (posix_openpt())
have timestamps with a value of boottime, which is not what we want.
Solve this by calling vfs_timestamp() inside devfs_alloc().
Discussed with: kib
initialize the vattr structure in VOP_GETATTR() with VATTR_NULL(),
vattr_null() or by zeroing it. Remove these to allow preinitialization
of fields work in vn_stat(). This is needed to get birthtime initialized
correctly.
Submitted by: Jaakko Heinonen <jh saunalahti fi>
Discussed on: freebsd-fs
MFC after: 1 month
NODEV is more appropriate when va_rdev doesn't have a meaningful value.
Submitted by: Jaakko Heinonen <jh saunalahti fi>
Suggested by: bde
Discussed on: freebsd-fs
MFC after: 1 month
initialize va_vaflags and va_spare because they are not part of the
VOP_GETATTR() API. Also don't initialize birthtime to ctime or zero.
Submitted by: Jaakko Heinonen <jh saunalahti fi>
Reviewed by: bde
Discussed on: freebsd-fs
MFC after: 1 month
replaced by file relative sector numbers as the buffer block number when
zero-padding a file during extension. Revert the change, it causes wrong
blocks filled with zeroes on seeking beyond end of file.
PR: kern/47628
Submitted by: tegge
MFC after: 3 days
The last half year I've been working on a replacement TTY layer for the
FreeBSD kernel. The new TTY layer was designed to improve the following:
- Improved driver model:
The old TTY layer has a driver model that is not abstract enough to
make it friendly to use. A good example is the output path, where the
device drivers directly access the output buffers. This means that an
in-kernel PPP implementation must always convert network buffers into
TTY buffers.
If a PPP implementation would be built on top of the new TTY layer
(still needs a hooks layer, though), it would allow the PPP
implementation to directly hand the data to the TTY driver.
- Improved hotplugging:
With the old TTY layer, it isn't entirely safe to destroy TTY's from
the system. This implementation has a two-step destructing design,
where the driver first abandons the TTY. After all threads have left
the TTY, the TTY layer calls a routine in the driver, which can be
used to free resources (unit numbers, etc).
The pts(4) driver also implements this feature, which means
posix_openpt() will now return PTY's that are created on the fly.
- Improved performance:
One of the major improvements is the per-TTY mutex, which is expected
to improve scalability when compared to the old Giant locking.
Another change is the unbuffered copying to userspace, which is both
used on TTY device nodes and PTY masters.
Upgrading should be quite straightforward. Unlike previous versions,
existing kernel configuration files do not need to be changed, except
when they reference device drivers that are listed in UPDATING.
Obtained from: //depot/projects/mpsafetty/...
Approved by: philip (ex-mentor)
Discussed: on the lists, at BSDCan, at the DevSummit
Sponsored by: Snow B.V., the Netherlands
dcons(4) fixed by: kan
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
to global hostname and domainname variables. Where necessary, copy
to or from a stack-local buffer before performing copyin() or
copyout(). A few uses, such as in cd9660 and daemon_saver, remain
under-synchronized and will require further updates.
Correct a bug in which a failed copyin() of domainname would leave
domainname potentially corrupted.
MFC after: 3 weeks
performed. Otherwise if ruleset is used by given mountpoint and is empty
it's freed by devfs_ruleset_reap and pointer becomes bogus.
Submitted by: Mateusz Guzik <mjguzik@gmail.com>
PR: kern/124853
needed to promote cdev to cdev_priv, the si_priv pointer was followed.
Use member2struct() to calculate address of the wrapping cdev_priv.
Rename si_priv to __si_reserved.
Tested by: pho
Reviewed by: ed
MFC after: 2 weeks
Except for the case where we use the cloner library (clone_create() and
friends), there is no reason to enforce a unique device minor number
policy. There are various drivers in the source tree that allocate unr
pools and such to provide minor numbers, without using them themselves.
Because we still need to support unique device minor numbers for the
cloner library, introduce a new flag called D_NEEDMINOR. All cdevsw's
that are used in combination with the cloner library should be marked
with this flag to make the cloning work.
This means drivers can now freely use si_drv0 to store their own flags
and state, making it effectively the same as si_drv1 and si_drv2. We
still keep the minor() and dev2unit() routines around to make drivers
happy.
The NTFS code also used the minor number in its hash table. We should
not do this anymore. If the si_drv0 field would be changed, it would no
longer end up in the same list.
Approved by: philip (mentor)
The while loop that is assumed to initialize the uio_off later, may
be not entered at all, causing uninitialized value to be returned in
uio->uio_offset.
PR: 122925
Submitted by: Jaakko Heinonen <jh saunalahti fi>
MFC after: 1 weeks
the dm_lock is held while the newly allocated vnode is locked. Since no
other threads may try to lock the new vnode yet, the LOR there cannot
result in the deadlock.
Shut down the witness warning to note this fact.
Tested by: pho
Prodded by: attilio
As discussed with Robert Watson and John Baldwin, it would be better if
PTY's are created with proper permissions, turning grantpt() into a
no-op.
Bypassing security frameworks like MAC by passing NOCRED to
VOP_SETATTR() will only make things more complex.
Approved by: philip (mentor)
In the mpsafetty branch, PTY's are allocated through the posix_openpt()
system call. The controller side of a PTY now uses its own file
descriptor type (just like sockets, vnodes, pipes, etc).
To remain compatible with existing FreeBSD and Linux C libraries, we can
still create PTY's by opening /dev/ptmx or /dev/ptyXX. These nodes
implement d_fdopen(). Devfs has been slightly changed here, to allow
finit() to be called from d_fdopen().
The routine grantpt() has also been moved into the kernel. This routine
is a little odd, because it needs to bypass standard UNIX permissions.
It needs to change the owner/group/mode of the slave device node, which
may often not be possible. The old implementation solved this by
spawning a setuid utility.
When VOP_SETATTR() is called with NOCRED, devfs_setattr() dereferences
ap->a_cred, causing a kernel panic. Change the de_{uid,gid,mode} code to
allow changes when a->a_cred is set to NOCRED.
Approved by: philip (mentor)
- Use proper synhronization primitives to protect the internal fdesc node cache
used in fdescfs.
- Properly initialize and uninitalize hash.
- Remove unused functions.
Since fdescfs might recurse on itself, adding proper locking to it needed some
tricky workarounds in some parts to make it work. For instance, a descriptor in
fdescfs could refer to an open descriptor to itself, thus forcing the thread to
recurse on vnode locks. Because of this, other race conditions also had to be
fixed.
Tested by: pho
Reviewed by: kib (mentor)
Approved by: kib (mentor)
sense to loop trying to vget() the vnode again.
PR: 122977
Submitted by: Arthur Hartwig <arthur.hartwig nokia com>
Tested by: pho
Reviewed by: jhb
MFC after: 1 week
The patch does not change the cdevsw KBI. Management of the data is
provided by the functions
int devfs_set_cdevpriv(void *priv, cdevpriv_dtr_t dtr);
int devfs_get_cdevpriv(void **datap);
void devfs_clear_cdevpriv(void);
All of the functions are supposed to be called from the cdevsw method
contexts.
- devfs_set_cdevpriv assigns the priv as private data for the file
descriptor which is used to initiate currently performed driver
operation. dtr is the function that will be called when either the
last refernce to the file goes away, the device is destroyed or
devfs_clear_cdevpriv is called.
- devfs_get_cdevpriv is the obvious accessor.
- devfs_clear_cdevpriv allows to clear the private data for the still
open file.
Implementation keeps the driver-supplied pointers in the struct
cdev_privdata, that is referenced both from the struct file and struct
cdev, and cannot outlive any of the referee.
Man pages will be provided after the KPI stabilizes.
Reviewed by: jhb
Useful suggestions from: jeff, antoine
Debugging help and tested by: pho
MFC after: 1 month
the mentioned PR:
- bounds check time->month as it is used as an array index
- fix usage of time->month as array index (month is 1-12)
- fix calculation based on time->day (day is 1-31)
- fix the speedup code as it doesn't calculate correct timestamps before
the year 2000 and reduce the number of calculation in the year-by-year code
- speedup month calculations by replacing the array content with cumulative
values
- add microseconds calculation
- fix an endian problem
PR: kern/97786
Submitted by: Andriy Gapon <avg@topspin.kiev.ua>
Reviewed by: scottl (earlier version)
Approved by: emax (mentor)
MFC after: 1 week
VSOCK has been added as cache target. Now they process
not only VDIR but also VSOCK.
- fixed panic issue caused by cache incorrect free process
by "umount -f"
Submitted by: Masanori OZAWA <ozawa@ongs.co.jp>
MFC after: 1 week
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.
Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.
The implementation of the lf_purgelocks() is submitted by dfr.
Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks
state transitioning flags and of msleep(9) callings.
Use, instead, an algorithm very similar to what sx(9) and rwlock(9)
alredy do and direct accesses to the sleepqueue(9) primitive.
In order to avoid writer starvation a mechanism very similar to what
rwlock(9) uses now is implemented, with the correspective per-thread
shared lockmgrs counter.
This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and
lockmgr_args_rw(). These two are like the 2 "normal" versions, but they
both accept a rwlock as interlock. In order to realize this, the general
lockmgr manager function "__lockmgr_args()" has been implemented through
the generic lock layer. It supports all the blocking primitives, but
currently only these 2 mappers live.
The patch drops the support for WITNESS atm, but it will be probabilly
added soon. Also, there is a little race in the draining code which is
also present in the current CVS stock implementation: if some sharers,
once they wakeup, are in the runqueue they can contend the lock with
the exclusive drainer. This is hard to be fixed but the now committed
code mitigate this issue a lot better than the (past) CVS version.
In addition assertive KA_HELD and KA_UNHELD have been made mute
assertions because they are dangerous and they will be nomore supported
soon.
In order to avoid namespace pollution, stack.h is splitted into two
parts: one which includes only the "struct stack" definition (_stack.h)
and one defining the KPI. In this way, newly added _lockmgr.h can
just include _stack.h.
Kernel ABI results heavilly changed by this commit (the now committed
version of "struct lock" is a lot smaller than the previous one) and
KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction,
so manpages and __FreeBSD_version will be updated accordingly.
Tested by: kris, pho, jeff, danger
Reviewed by: jeff
Sponsored by: Google, Summer of Code program 2007
the fdesc_allocvp(). The caller of the fdesc_allocvp() expects that the
returned vnode is not reclaimed. Do lock the vnode exclusive and drop
the lock after.
Reported by: pho
Reviewed by: jeff
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.
Highlights include:
* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.
* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.
* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.
* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.
* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.
* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.
Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.
Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)
obtain the reference. In particular, this fixes the panic reported in
the PR. Remove the comments stating that this needs to be done.
PR: kern/119422
MFC after: 1 week
While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.
than rely on the lockmgr support [1]:
* bump the waiters only if the interlock is held
* let brelvp() return the waiters count
* rely on brelvp() instead than BUF_LOCKWAITERS() in order to check
for the waiters number
- Remove a namespace pollution introduced recently with lockmgr.h
including lock.h by including lock.h directly in the consumers and
making it mandatory for using lockmgr.
- Modify flags accepted by lockinit():
* introduce LK_NOPROFILE which disables lock profiling for the
specified lockmgr
* introduce LK_QUIET which disables ktr tracing for the specified
lockmgr [2]
* disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it
can only be used on a per-instance basis
- Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer
used
This patch breaks KPI so __FreBSD_version will be bumped and manpages
updated by further commits. Additively, 'struct buf' changes results in
a disturbed ABI also.
[2] Really, currently there is no ktr tracing in the lockmgr, but it
will be added soon.
[1] Submitted by: kib
Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
always curthread.
As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.
Tested by: Andrea Barberio <insomniac at slackware dot it>
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode
In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock
Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.
macros. The only semantic change was the need to add a vc_opened field
to struct vcomm since we can no longer use the request queue returning
to an uninitialized state to hold whether or not the device is open.
MFC after: 1 month
fundamentally fairly confused about how signals work and when it is
appropriate for upcalls to be interrupted. In particular, we should
be exempting certain upcalls from interruption, we should not always
eventually time out sleeping on a upcall, and we should not be
interrupting the sleep for certain signals that we currently are
(including SIGINFO). This code needs to be reworked in the style of
NFS interruptible mounts.
MFC after: 1 month
access cache improvements:
- Flush just access control state on CODA_PURGEUSER, not the full
namecache for /coda.
- When replacing a fid on a cnode as a result of, e.g.,
reintegration after offline operation, we no longer need to
purge the namecache entries associated with its vnode.
MFC after: 1 month
modeled on the access cache found in NFS, smbfs, and the Linux coda
module. This is a positive access cache of a single entry per file,
tracking recently granted rights, but unlike NFS and smbfs,
supporting explicit invalidation by the distributed file system.
For each cnode, maintain a C_ACCCACHE flag indicating the validity
of the cache, and a cached uid and mode tracking recently granted
positive access control decisions.
Prefer the cache to venus_access() in VOP_ACCESS() if it is valid,
and when we must fall back to venus_access(), update the cache.
Allow Venus to clear the access cache, either the whole cache on
CODA_FLUSH, or just entries for a specific uid on CODA_PURGEUSER.
Unlike the Coda module on Linux, we don't flush all entries on a
user purge using a generation number, we instead walk present
cnodes and clear only entries for the specific user, meaning it is
somewhat more expensive but won't hit all users.
Since the Coda module is agressive about not keeping around
unopened cnodes, the utility of the cache is somewhat limited for
files, but works will for directories. We should make Coda less
agressive about GCing cnodes in VOP_INACTIVE() in order to improve
the effectiveness of in-kernel caching of attributes and access
rights.
MFC after: 1 month
VFS namecache, as is done by the Coda module on Linux. Unlike the Coda
namecache, the global VFS namecache isn't tagged by credential, so use
ore conservative flushing behavior (for now) when CODA_PURGEUSER is
issued by Venus.
This improves overall integration with the FreeBSD VFS, including
allowing __getcwd() to work better, procfs/procstat monitoring, and so
on. This improves shell behavior in many cases, and improves ".."
handling. It may lead to some slowdown until we've implemented a
specific access cache, which should net improve performance, but in the
mean time, lookup access control now always goes to Venus, whereas
previously it didn't.
MFC after: 1 month
When ntfs_ntput() reaches 0 in the refcount the inode lockmgr is not
released and directly destroyed. Fix this by unlocking the lockmgr() even
in the case of zero-refcount.
Reported by: dougb, yar, Scot Hetzel <swhetzel at gmail dot com>
Submitted by: yar
tree, restyle everything but coda.h (which is more explicitly shared
across systems) into a closer approximation to style(9).
Remove a few more unused function prototypes.
Add or clarify some comments.
MFC after: 1 month
- Rename print_vattr to coda_print_vattr and make static, rename
print_cred to coda_print_cred.
- Remove unused coda_vop_nop.
- Add XXX comment because coda_readdir forwards to the cache vnode's
readdir rather than venus_readdir, and annotate venus_readdir as
unused.
- Rename vc_nb_* to vc_*.
- Use d_open_t, d_close_t, d_read_t, d_write_t, d_ioctl_t and d_poll_t
for prototyping vc_* as that is the intent, don't use our own
definitions.
- Rename coda_nb_statfs to coda_statfs, rename NB_SFS_SIZ to
CODA_SFS_SIZ.
- Replace one more OBE reference to NetBSD with a reference to FreeBSD.
- Tidy up a little vertical whitespace here and there.
- Annotate coda_nc_zapvnode as unused.
- Remove unused vcodattach.
- Annotate VM_INTR as unused.
- Annotate that coda_fhtovp is unused and doesn't match the FreeBSD
prototype, so isn't hooked up to vfs_fhtovp. If we want NFS export of
Coda to work someday, this needs to be fixed.
- Remove unused getNewVnode.
- Remove unused coda_vget, coda_init, coda_quotactl prototypes.
MFC after: 1 month
the mountpoint for a specific device. This was implemented incorrectly,
a bad idea in a fundamental sense, and also never used, so presumably
a long-idle debugging function.
MFC after: 1 month
for vop_bmap; delete the existing stub that returned either EINVAL
or EOPNOTSUPP, and had unreachable calls to VOP_BMAP on the cache
vnode.
MFC after: 1 month
then later to FreeBSD. Update various NetBSD-related comments: in some
cases delete them because they don't appply, in others update to say
FreeBSD as they still apply but in FreeBSD (and might for that matter
no longer apply on NetBSD), and flag one case where I'm not sure
whether it applies.
MFC after: 1 month
locks of those vnodes. Probably, Coda should do the same lock sharing/
pass-through that is done for nullfs, but in the mean time this ensures
that locks are adequately held to prevent corruption of data structures
in the cache file system.
Assuming most operations came from the top layer of Coda and weren't
performed directly on the cache vnodes, in practice this corruption was
relatively unlikely as the Coda vnode locks were ensuring exclusive
access for most consumers.
This causes WITNESS to squeal like a pig immediately when Coda is used,
rather than waiting until file close; I noticed these problems because
of the lack of said squealing.
MFC after: 1 month
vget() calls using inode numbers to query the root of /coda, which is not
needed since we now cache the root vnode with the mountpoint.
MFC after: 1 month
VOP_ISLOCKED(arg, curthread). Now, VOP_ISLOCKED() and lockstatus() should
only acquire curthread as argument; this will lead in axing the additional
argument from both functions, making the code cleaner.
Reviewed by: jeff, kib
bug that caused us to reintroduce it is believed to be fixed, and Kris
says he no longer sees problems with fifofs in highly parallel builds.
If this works out, we'll MFC it for 7.1.
MFC after: 3 months
Pointed out by: kris
- Remove the "thread" argument from the lockmgr() function as it is
always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
present
Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h
KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.
Tested by: matteo
to files, such as ktrace output, under CODA_VERBOSE. Otherwise, each
such call to VOP_WRITE() results in a kernel printf.
MFC after: 3 days
Obtained from: NetBSD
- Don't specify vnode operations for mknod, lease, and advlock--let them
fall through to vop_default.
- Implement vop_default with &default_vnodeops, rather than with VOP_PANIC,
so that unimplemented vnode operations are handled in more sensible ways
than panicking, such as EOPNOTSUPP on ACL queries generated by bsdtar,
or mknod.
MFC after: 3 days
fill out all fields, just fill out the ones the file system knows
about. Among other things, this causes the outpuf of "mount" and
"df" to make quite a bit more sense as /dev/cfs0 is specified as the
mountfrom name.
MFC after: 3 days
vnodes during coda_unmount() in order to detect errant use of them
after the vnode references may no longer be valid.
No need to clear the VV_ROOT flag on mi_rootvp flag (especially after
the vnode reference is no longer valid) as this isn't done on other
file systems.
MFC after: 3 days
and then release it when it is closed: we rely on the caller to keep the
vnode around with a valid reference. This avoids vrele() destroying the
vnode vop_close() is being called from during a call to vop_close(), and
a crash due to lockmgr recursing the vnode lock when a Coda unmount
occurs.
MFC after: 3 days
Move all extern variable definitions to associated .h files, move some
extern variable definitions between include files to place them more
appropriately.
MFC after: 3 days
Coda vnode derived from it, in the style of nullfs. This allows files
in the Coda file system to be memory-mapped, such as with execve(2) or
mmap(2).
MFC after: 3 days
Reported by: Rune <u+openafsdev-sr55 at chalmers dot se>
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.
KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.
Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.
Manpage and FreeBSD_version will be updated through further commits.
As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.
Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>
Now, lockmgr() function can only be called passing curthread and the
KASSERT() is upgraded according with this.
In order to support on-the-fly owner switching, the new function
lockmgr_disown() has been introduced and gets used in BUF_KERNPROC().
KPI, so, results changed and FreeBSD version will be bumped soon.
Differently from previous code, we assume idle thread cannot try to
acquire the lockmgr as it cannot sleep, so loose the relative check[1]
in BUF_KERNPROC().
Tested by: kris
[1] kib asked for a KASSERT in the lockmgr_disown() about this
condition, but after thinking at it, as this is a well known general
rule, I found it not really necessary.
This makes it possible to support ftruncate() on non-vnode file types in
the future.
- 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on
a given file descriptor.
- ftruncate() moves to kern/sys_generic.c and now just fetches a file
object and invokes fo_truncate().
- The vnode-specific portions of ftruncate() move to vn_truncate() in
vfs_vnops.c which implements fo_truncate() for vnode file types.
- Non-vnode file types return EINVAL in their fo_truncate() method.
Submitted by: rwatson
add correct locking to the operation of unmounting.
This will prevent debugging kernels from panicking if mounting a
non-hpfs partition (I'm not sure if this can be a problem with a
successful mounting operation though).
MFC: 3 days
- Introduce a finit() which is used to initailize the fields of struct file
in such a way that the ops vector is only valid after the data, type,
and flags are valid.
- Protect f_flag and f_count with atomic operations.
- Remove the global list of all files and associated accounting.
- Rewrite the unp garbage collection such that it no longer requires
the global list of all files and instead uses a list of all unp sockets.
- Mark sockets in the accept queue so we don't incorrectly gc them.
Tested by: kris, pho
This option just adds complexity and the new implementation no longer
will support it, so axing it now that it is unused is probabilly the
better idea.
FreeBSD version is bumped in order to reflect the KPI breakage introduced
by this patch.
In the ports tree, kris found that only old OSKit code uses it, but as
it is thought to work only on 2.x kernels serie, version bumping will
solve any problem.
for that argument. This will allow DDB to detect the broad category of
reason why the debugger has been entered, which it can use for the
purposes of deciding which DDB script to run.
Assign approximate why values to all current consumers of the
kdb_enter() interface.
7.2.3, bytes 0-3 and 5-15 are used to calculate the checksum of a descriptor
tag.
PR: kern/90521
Submitted by: Björn König <bkoenig@cs.tu-berlin.de>
Reviewed by: scottl
Approved by: emax (mentor)
strees2 suite, to quote his letter, this change:
1. It removes the tn_lookup_dirent stuff. I think this cannot be fixed,
because nothing protects vnode/tmpfs node between lookup is done, and
actual operation is performed, in the case the vnode lock is dropped.
At least, this is the case with the from vnode for rename.
For now, we do the linear lookup in the parent node. This has its own
drawbacks. Not mentioning speed (that could be fixed by using hash), the
real problem is the situation where several hardlinks exist in the dvp.
But, I think this is fixable.
2. The patch restores the VV_ROOT flag on the root vnode after it became
reclaimed and allocated again. This fixes MPASS assertion at the start
of the tmpfs_lookup() reported by many.
Submitted by: kib
with ACCESSPERMS. Document in mount_ntfs(8) only the nine
low-order bits of mask are used (taken from mount_msdosfs(8)).
PR: kern/114856
Submitted by: Ighighi
MFC after: 1 month
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:
mac_<object>_<method/action>
mac_<object>_check_<method/action>
The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.
All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.
Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer
- markvoldirty() needs to write to underlying GEOM provider. We
have to do that *before* g_access() which sets the GEOM provider
to read-only.
- Remove dirty flag before free'ing iconv related resources. The
dirty flag removal could fail, and it is hard to revert the
iconv-free after the fail.
- Mark volume as dirty if we have failed to mark it clean for safe.
- Other style fixes to the touched functions.
This is much simpler than for ffs since there are many fewer places
where we need to choose between a delayed write and a sync write --
just 5 in msdosfs and more than 30 in ffs.
This is more complete and correct than in ffs. Several places in ffs
are are still missing the choice. ffs_update() has a layering violation
that breaks callers which want to force a sync update (mainly fsync(2)
and O_SYNC write(2)).
However, fsync(2) and O_SYNC write(2) are still more broken than in
ffs, since they are broken for default (non-sync non-async) mounts
too. Both fail to sync the FAT in all cases, and both fail to sync
the directory entry in some cases after losing a race. Async everything
is probably safer than the half-baked sync of metadata given by default
mounts.
leaving space for adding missing options. Negative options are sorted
after removing their "no" prefix, and generic options are sorted before
msdosfs-specific ones.
(except indirectly for the size pseudo-attribute). If anything deserves
a sync update, then it is ids and immutable flags, since these are
related to security, but ffs never synced these and msdosfs doesn't
support them. (ufs_setattr() only does an update in one case where
it is least needed (for timestamps); it did pessimal sync updates for
timestamps until 1998/03/08 but was changed for unlogged reasons related
to soft updates.)
Now msdosfs calls deupdat() with waitfor == 0, which normally gives a
delayed update to disk but always gives a sync update of timestamps
in core, while for ffs everything is delayed until the syncer daemon
or other activity causes an update (except for timestamps).
This gives a large optimization mainly for things like cp -p, where
attribute adjustment could easily triple the number of physical I/O's
if it is done synchronously (but cp -p to msdosfs is not as bad as
that, since msdosfs doesn't support many attributes so null adjustments
are more common, and msdosfs doesn't support ctimes so even if cp
doesn't weed out null adjustments they don't become non-null after
clobbering the ctime).
(it is established practice) and ``-o whiteout=whenneeded'' is less
disk-space using mode especially for resource restricted environments
like embedded environments. (Contributed by Ed Schouten. Thanks)
Submitted by: Masanori Ozawa <ozawa@ongs.co.jp> (unionfs developer)
Reviewed by: jeff, kensmith
Approved by: re (kensmith)
MFC after: 1 week
Some folks who have reported some issues have solved with transparent mode.
We guess it is time to change the default copy mode. The transparent-mode is
the best in most situations.
Submitted by: Masanori Ozawa <ozawa@ongs.co.jp> (unionfs developer)
Reviewed by: jeff, kensmith
Approved by: re (kensmith)
MFC after: 1 week
applications that use procfs on unionfs.
- Removed unionfs internal cache mechanism because it has
vfs_cache support instead. As a result, it just simplified code of
unionfs.
- Fixed kern/111262 issue.
Submitted by: Masanori Ozawa <ozawa@ongs.co.jp> (unionfs developer)
Reviewed by: jeff, kensmith
Approved by: re (kensmith)
MFC after: 1 week
directory itself (rather than any of its contents) is visible to the
current thread.
MFC after: 1 week
PR: kern/90063
Submitted by: john of 8192.net
Approved by: re (kensmith)
All active fields in fsi are advisory/optional, so we shouldn't do
extra work to make them valid at all times, but instead we write to
the fsi too often (we still do), and we searched for a free cluster
for fsinxtfree too often.
This commit just removes the whole search and its results, so that we
write out our in-core copy of fsinxtfree instead of writing a "fixed"
copy and clobbering our in-core copy. This saves fixing 3 bugs:
- off-by-1 error for the end of the search, resulting in fsinxtfree
not actually being adjusted iff only the last cluster is free.
- missing adjustment when no clusters are free.
- off-by-many error for the start of the search. Starting the search
at 0 instead of at (the in-core copy of) fsinxtfree did more than
defeat the reasons for existence of fsinxtfree. fsinxtfree exists
mainly to avoid having to start at 0 for just the first search per
mount, but has the side effect of reducing bias towards allocating
near cluster 0. The bias would normally only be generated by the
first search per mount (if fsinxtfree is not supported), but since
we also adjusted the in-core copy of fsinxtfree here, we were doing
extra work to maximize the bias.
Approved by: re (kensmith)
Eliminates panics due to locking issues.
Idea taken from src/sys/gnu/fs/xfs/FreeBSD/xfs_super.c.
PR: 89966, 92000, 104393
Reported by: H. Matsuo <hiroshi50000 yahoo co jp>,
Chris <m2chrischou gmail.com>,
Andrey V. Elsukov <bu7cher yandex ru>,
Jan Henrik Sylvester <me janh de>
Approved by: re (kensmith)
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.
Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)
can easily block in bread(), and then there was nothing to prevent the
static buffer (nambuf_{ptr,len,last_id}) being clobbered by another
thread.
The effects of the bug seem to have been limited to failed lookups and
mangled names in readdir(), since Giant locking provides enough
serialization to prevent concurrent calls to the functions that access
the buffer. They were very obvious for multiple concurrent tree walks,
especially with a small cluster size.
The bug was introduced in msdosfs_conv.c 1.34 and associated changes,
and is in all releases starting with 5.2.
The fix is to allocate the buffer as a local variable and pass around
pointers to it like "_r" functions in libc do. Stack use from this
is large but not too large. This also fixes a memory leak on module
unload.
Reviewed by: kib
Approved by: re (kensmith)
% mount | grep home
/dev/ad4s1e on /home (ufs, local, noatime, soft-updates)
% mount -u -o atime /home
% mount | grep home
/dev/ad4s1e on /home (ufs, local, soft-updates)
Restore this behavior for on 7.x for the following mount options:
noatime, noclusterr, noclusterw, noexec, nosuid, nosymfollow
In addition, on 7.x, the following are equivalent:
mount -u -o atime /home
mount -u -o nonoatime /home
Ideally, when we introduce new mount options, we should avoid
options starting with "no". :)
Requested by: jhb
Reported by: Karol Kwiat <karol.kwiat gmail com>, Scott Hetzel <swhetzel gmail com>
Approved by: re (bmah)
Proxy commit for: rodrigc
- LK_RETRY prohibits vget() and vn_lock() to return error.
Remove associated code. [1]
- Properly use vhold() and vdrop() instead of their unlocked
versions, we are guaranteed to have the vnode's interlock
unheld. [1]
- Fix a pseudo-infinite loop caused by 64/32-bit arithmetic
with the same way used in modern NetBSD versions. [2]
- Reorganize tmpfs_readdir to reduce duplicated code.
Submitted by: kib [1]
Obtained from: NetBSD [2]
Approved by: re (tmpfs blanket)
- Respect cnflag and don't lock vnode always as LK_EXCLUSIVE [1]
- Properly lock around tn_vnode to avoid NULL deference
- Be more careful handling vnodes (*)
(*) This is a WIP
[1] by pjd via howardsu
Thanks kib@ for his valuable VFS related comments.
Tested with: fsx, fstest, tmpfs regression test set
Found by: pho's stress2 suite
Approved by: re (tmpfs blanket)
(uio_offset < 0) since this can't happen. If this happens, then the
general code handles the problem safely (better than before for reading,
returning 0 (EOF) instead of the bogus errno EINVAL, and the same as
before for writing, returning EFBIG).
In msdosfs_read(), don't check for (uio_resid < 0). msdosfs_write()
already didn't check.
In msdosfs_read(), document in a comment our assumptions that the caller
passed a valid uio_offset and uio_resid. ffs checks using KASSERT(),
and that is enough sanity checking. In the same comment, partly document
there is no need to check for the EOVERFLOW case, unlike in ffs where this
case can happen at least in theory.
In msdosfs_write(), add a comment about why the checking of
(uio_resid == 0) is explicit, unlike in ffs.
In msdosfs_write(), check for impossibly large final offsets before
checking if the file size rlimit would be exceeded, so that we don't
have an overflow bug in the rlimit check and are consistent with ffs.
We now return EFBIG instead of EFBIG plus a SIGXFSZ signal if the final
offset would be impossibly large but not so large as to cause overflow.
Overflow normally gave the benign behaviour of no signal.
Approved by: re (kensmith) (blanket)