Provide a custom lock around initializing and tearing down EA area,
to prevent both memory leaks and double-free of it. Count the number
of EA area accessors.
Lock protocol requires either holding exclusive vnode lock to modify
i_ea_area, or shared vnode lock and owning IN_EA_LOCKED flag in i_flag.
Noted by: YAMAMOTO, Taku <taku tackymt homeip net>
Tested by: pho (previous version)
MFC after: 2 weeks
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
shared vnode lock for the leaf vnode only if the mount point has the
extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
FIFO's require exclusive vnode locks for their open() and close()
routines. (My recent MPSAFE patches for UDF and cd9660 already included
this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.
Submitted by: ups
Reviewed by: pjd (ZFS bits)
MFC after: 1 month
address space sizes to be longs instead of ints. Specifically, the follow
values are now longs: runningbufspace, bufspace, maxbufspace,
bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace,
hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a
relatively small number (~ 44000) of buffers set in kern.nbuf would result
in integer overflows resulting either in hangs or bogus values of
hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see
such problems. There was a check for a nbuf setting that would cause
overflows in the auto-tuning of nbuf. I've changed it to always check and
cap nbuf but warn if a user-supplied tunable would cause overflow.
Note that this changes the ABI of several sysctls that are used by things
like top(1), etc., so any MFC would probably require a some gross shims
to allow for that.
MFC after: 1 month
msdosfs_unmount() and ffs_unmount() exit early after getting ENXIO.
However, dounmount() treats ENXIO as a success and proceeds with
unmounting. In effect, the filesystem gets unmounted without closing
GEOM provider etc.
Reviewed by: kib
Approved by: rwatson (mentor)
Tested by: dho
Sponsored by: FreeBSD Foundation
cleanup. Before the GEOM consumer would not have been closed.
- Bump the reference on the character device being mounted while the
associated devfs vnode is locked.
Reviewed by: kib
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.
Approved by: rwatson (mentor)
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()
and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()
Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.
As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP
Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
wait until the current suspension is lifted instead of silently returning
success immediately. The consequences of calling vfs_write() resume when
not owning the suspension are not well-defined at best.
Add the vfs_susp_clean() mount method to be called from
vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and
stop calling it manually.
Add the thread flag TDP_IGNSUSP that allows to bypass the suspension
point in the vn_start_write. It is intended for use by VFS in the
situations where the suspender want to do some i/o requiring calls to
vn_start_write(), and this i/o cannot be done later.
Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month
Show the b_dep value for the buffer in the show buffer command.
Add a comand to dump the dirty/clean buffer list for vnode.
Reviewed by: tegge
Tested and used by: pho
MFC after: 1 month
MNT_RDONLY flag before the VFS_MOUNT() is called. In ufs_inactive()
and ufs_itimes_locked(), UFS verifies whether the fs is read-only by
checking MNT_RDONLY, but this may cause loss of the IN_MODIFIED flag
for inode on the fs being remounted rw->ro.
Introduce UFS_RDONLY() struct ufsmount' method that reports the value
of the fs_ronly. The later is set to 1 only after the remount is
finished.
Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month
dependencies. In particular, it may need this while syncing filesystem
being unmounted. Since during unmount MNTK_NOINSMNTQUE flag is set,
that could sometimes disallow insertion of the vnode into the vnode
mount list, softdep code needs to overwrite the MNTK_NOINSMNTQUE flag.
Create the ffs_vgetf() function that sets the VV_FORCEINSMQ flag for
new vnode and use it consistently from the softdep code instead of
ffs_vget().
Add the retry logic to the softdep_flushfiles() to flush the vnodes
that could be instantiated while flushing softdep dependencies.
Tested by: pho, kris
Reviewed by: tegge
MFC after: 1 month
performed with snapshot option, while the mp->mnt_opt is NULL.
Protect against NULL pointer dereference.
Noted by: Mateusz Guzik <mjguzik gmail com>
MFC after: 3 days
delete "snapshot" from the persistent mount options list.
This should fix problems with doing a mount -o snapshot of a file system, followed by
an NFS export of the same file system.
PR: 122833
Reported by: Leon Kos <leon.kos lecad fs uni-lj si>,
Jaakko Heinonen <jh saunalahti fi>
MFC after: 1 month
here, because we already do them further up in vfs_donmount() in vfs_mount.c
async -> MNT_ASYNC
force -> MNT_FORCE
multilabel -> MNT_MULTILABEL
noatime -> MNT_NOATIME
noclusterr -> MNT_NOCLUSTERR
noclusterw -> MNT_NOCLUSTERW
MFC after: 1 month
(such as 'atime' vs 'noatime'). The filesystems will always see either
'nofoo' or 'nonofoo', never plain 'foo'. As such, their list of valid
mount options should include 'nofoo' instead of 'foo'. With this fix,
you can do 'mount -u -o atime' on a FFS filesystem that isn't marked as
noatime without getting an error. You can also update a noatime FFS
filesystem mounted via mount(2) (e.g. 6.x /sbin/mount binary) to 'atime'
using nmount(2) (e.g. 7.x /sbin/mount binary).
MFC after: 1 week
Reviewed by: crodig
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.
Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)
It is normally initialized by ffs_statfs() after ffs_mount finished.
The extattr autostart code calls the ufs_lookup(), that uses value above
to iterate over the directory blocks, see bmask initialization in the
ufs_lookup() and ufsdirhash. Having the filesystem with root directory
spanning more then one block would result in reading a random kernel
memory.
PR: kern/120781
Test case provided by: rwatson
MFC after: 1 week
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode
In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock
Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.
- Remove the "thread" argument from the lockmgr() function as it is
always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
present
Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h
KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.
Tested by: matteo
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj
BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.
KPI results, obviously, broken so further commits will update manpages
and freebsd version.
Tested by: kris (on UFS and NFS)
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.
KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.
Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.
Manpage and FreeBSD_version will be updated through further commits.
As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.
Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:
mac_<object>_<method/action>
mac_<object>_check_<method/action>
The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.
All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.
Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer
Each struct dquot gets dq_lock mutex to protect dq_flags and to interlock
with DQ_LOCK. qhash, dqfreelist and dq.dq_cnt are protected by global
dqhlock mutex.
i_dquot array for inode is protected by lockmgr' vnode lock, corresponding
assert added to the dqget(). Access to struct ufsmount quota-related fields
(um_quotas and um_qflags) is protected by um_lock.
Tested by: Peter Holm
Reviewed by: tegge
Approved by: re (kensmith)
This work were not possible without enormous amount of help given by
Tor Egge and Peter Holm. Tor reviewed each version of patch, pointed out
numerous errors and provided invaluable suggestions. Peter did tireless
testing of the patch as it was developed.
late stages of unmount). On failure, the vnode is recycled.
Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.
Change getnewvnode() to no longer call insmntque(). Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.
Change vfs_hash_insert() to no longer lock the vnode. The caller now
has that responsibility.
Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup. Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.
Approved by: re (kensmith)
Reviewed by: kib
In collaboration with: kib
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.
Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.
VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.
Approved by: mckusick
Discussed with: many (on IRC)
Tested with: ufs, msdosfs, cd9660, nullfs and zfs
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
object is still open), increase fs_unrefs and cg_unrefs fields,
which is a hint for fsck in which cylinder groups looks for such
(orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.
Sponsored by: home.pl
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
bread() the UFS superblock. Should eliminate crashes when trying
to do: mount -t ufs on an audio CD.
PR: kern/85893
Reported by: Russell Francis <rfrancis at ev dot net>
MFC after: 1 week
buffers to go on the buf daemon's DIRTYGIANT queue.
- Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler
runs in the context of the buf daemon and may require Giant.
enables multilabel, or any option for that matter, most likely they have
a reason. This will allow users to see that mulilabel is enabled via an
issued "mount" command and remove an annoying warning - printed only when
a MAC kernel is not installed - on boot up.
Discussed with: green, brueffer, Samy Al Bahra.
Probably ran past: csjp (though I can't remember).
Fix detection of active unlinked files by checking VI_OWEINACT and
VI_DOINGINACT in addition to v_usecount.
Defer inactive handling for unlinked files if the file system is mostly
suspended (secondary writes being blocked).
Perform deferred inactive handling after the file system is resumed.
replacement for vn_write_suspend_wait() to better account for secondary write
processing.
Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.
Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.
for export structure and pass that to vfs_export().
Currently in userland mount(8), an export structure is unconditionally
passed in, only for UFS. This is an attempt to move that UFS-specific
behavior out of mount(8) and into the UFS filesystem code.
callpath via vfs_getopt(), and set the appropriate MNT_* flag:
-> acls, async, force, multilabel, noasync, noatime,
-> noclusterr, noclusterw, snapshot, update
- Allow errmsg as a valid mount option via vfs_getopt(),
so we can later add a hook to propagate mount errors back
to userspace via vfs_mount_error().
to (max block - 1) * bsize. For DEV_BSIZE, this doubles the limit from
0.5 TB to 1 TB. For the old 4.4 FFS case, decrease the limit from 0.5 TB
to 2 GB - 1. Older systems had a 32 bit off_t so they couldn't access the
larger files anyway.
Collaboration with: bde
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function. Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().
Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.
Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks
are subtle differences in the read and write completion path. Instead,
grab an extra write ref so the write path can drop it when we recursively
call bufdone(). I believe this may be the source of the wrong bufobj
panics.
Reported by: pho, kkenn
because it may change identities while we're sleeping on the lock.
Otherwise we may bail out of ffs_sync() early due to an error from
deadfs.
- Collapse a VOP_UNLOCK, vrele into a single vput().
two bugs.
- ffs_disk_prewrite was pulling the vp from the buf and checking for
COPYONWRITE, when really it wanted the vp from the bufobj that we're
writing to, which is the devvp. This lead to us skipping the copy on
write to all file data, which significantly broke snapshots for the
last few months.
- When the SOFTUPDATES option was not included in the kernel config we
would also skip the copy on write check, which would effectively disable
snapshots.
- Remove an invalid mp_fixme().
Debugging tips from: mckusick
Reported by: iedowse, others
Discussed with: phk
the filesystem. Check that rather than VI_XLOCK.
- Shorten ffs_reload by one step. The old check for an inactive vnode
was slightly racey, and the code which deals with still active vnodes
is not much more expensive.
Sponsored by: Isilon Systems, Inc.
patch from kan@).
Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.
Give FFS vnodes a specific bufwrite method which contains all the
background write stuff and then calls into the default bufwrite()
for the rest of the job.
Remove all the background write related stuff from the normal bufwrite.
This drags the softdep_move_dependencies() back into FFS.
Long term, it is worth looking at simply copying the data into
allocated memory and issuing the bio directly and not create the
"shadow buf" in the first place (just like copy-on-write is done
in snapshots for instance). I don't think we really gain anything
but complexity from doing this with a buf.
The "business class upgrade" was implemented in UFS's VOP_LOCK
implementation ufs_lock() which is the wrong layer, so move it to
ffs_lock().
Also, as long as we have not abandonned advanced vfs-stacking we
should not preclude it from happening: instead of implementing a
copy locally, use the VOP_LOCK_APV(&ufs) to correctly arrive at
vop_stdlock() at the bottom.
- Use the buffer lock on the superblock buf to serialize calls to
sbupdate.
- Set the MNTK_MPSAFE flag when QUOTA is not defined in the kernel.
Sponsored By: Isilon Systems, Inc.
Without this fix, when ACLs are set via tunefs(8) on the root file system,
they are removed on boot when 'mount -a' is called, because mount(8)
called for the root file system always add MNT_UPDATE flag and MNT_UPDATE
flag isn't perfect.
Now, one cannot remove ACLs stored in superblock (configured with tunefs(8))
via 'mount -a' nor 'mount -u -o noacls <file system>', but it is still
possible to mount file system which doesn't have ACLs in superblock via
'mount -o acls <file system>' or /etc/fstab's 'acls' option.
Reported by: Lech Lorens/pl.comp.os.bsd
Discussed with: phk, rwatson
Reviewed by: rwatson
MFC after: 2 weeks
I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:
The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.
Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.
If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.
Discussed with: rwatson
four different locations on a prospective filesystem.
If we found none, we forgot to invalidate the four buffers, thus the
following sequence would fails:
(md0 = blank disk)
mount /dev/md0 /mnt
(fails, no superblocks)
newfs /dev/md0
(writes using physio which does not go through buffercache).
mount /dev/md0 /mnt
(still fails, the four cached buffers still contain no superblocks)
Found by: ru
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:
cd9660:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()
nfs(client):
Convert to nmount (the simple way, mount_nfs(8) is still necessary).
Add omount compat shims.
Drop COMPAT_PRELITE2 mount arg compatibility.
ffs:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()
Remove vfs_omount() method, all filesystems are now converted.
Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.
Change rootmounting to use DEVFS trampoline:
vfs_mount.c:
Mount devfs on /. Devfs needs no 'from' so this is clean.
symlink /dev to /. This makes it possible to lookup /dev/foo.
Mount "real" root filesystem on /.
Surgically move the devfs mountpoint from under the real root
filesystem onto /dev in the real root filesystem.
Remove now unnecessary getdiskbyname().
kern_init.c:
Don't do devfs mounting and rootvnode assignment here, it was
already handled by vfs_mount.c.
Remove now unused bdevvp(), addaliasu() and addalias(). Put the
few necessary lines in devfs where they belong. This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.
Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway. Correct information is provided by
statfs(/).
doesn't. Most of the implementations have grown weeds for this so they
copy some fields from mnt_stat if the passed argument isn't that.
Fix this the cleaner way: Always call the implementation on mnt_stat
and copy that in toto to the VFS_STATFS argument if different.
initializations but we did have lofty goals and big ideals.
Adjust to more contemporary circumstances and gain type checking.
Replace the entire vop_t frobbing thing with properly typed
structures. The only casualty is that we can not add a new
VOP_ method with a loadable module. History has not given
us reason to belive this would ever be feasible in the the
first place.
Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.
Give coda correct prototypes and function definitions for
all vop_()s.
Generate a bit more data from the vnode_if.src file: a
struct vop_vector and protype typedefs for all vop methods.
Add a new vop_bypass() and make vop_default be a pointer
to another struct vop_vector.
Remove a lot of vfs_init since vop_vector is ready to use
from the compiler.
Cast various vop_mumble() to void * with uppercase name,
for instance VOP_PANIC, VOP_NULL etc.
Implement VCALL() by making vdesc_offset the offsetof() the
relevant function pointer in vop_vector. This is disgusting
but since the code is generated by a script comparatively
safe. The alternative for nullfs etc. would be much worse.
Fix up all vnode method vectors to remove casts so they
become typesafe. (The bulk of this is generated by scripts)
This eliminates a bunch of vnode overhead (approx 1-2 % speed
improvement) and gives us more control over the access to the storage
device.
Access counts on the underlying device are not correctly tracked and
therefore it is possible to read-only mount the same disk device multiple
times:
syv# mount -p
/dev/md0 /var ufs rw 2 2
/dev/ad0 /mnt ufs ro 1 1
/dev/ad0 /mnt2 ufs ro 1 1
/dev/ad0 /mnt3 ufs ro 1 1
Since UFS/FFS is not a synchrousely consistent filesystem (ie: it caches
things in RAM) this is not possible with read-write mounts, and the system
will correctly reject this.
Details:
Add a geom consumer and a bufobj pointer to ufsmount.
Eliminate the vnode argument from softdep_disk_prewrite().
Pick the vnode out of bp->b_vp for now. Eventually we
should find it through bp->b_bufobj->b_private.
In the mountcode, use g_vfs_open() once we have used
VOP_ACCESS() to check permissions.
When upgrading and downgrading between r/o and r/w do the
right thing with GEOM access counts. Remove all the
workarounds for not being able to do this with VOP_OPEN().
If we are the root mount, drop the exclusive access count
until we upgrade to r/w. This allows fsck of the root
filesystem and the MNT_RELOAD to work correctly.
Set bo_private to the GEOM consumer on the device bufobj.
Change the ffs_ops->strategy function to call g_vfs_strategy()
In ufs_strategy() directly call the strategy on the disk
bufobj. Same in rawread.
In ffs_fsync() we will no longer see VCHR device nodes, so
remove code which synced the filesystem mounted on it, in
case we came there. I'm not sure this code made sense in
the first place since we would have taken the specfs route
on such a vnode.
Redo the highly bogus readblock() function in the snapshot
code to something slightly less bogus: Constructing an uio
and using physio was really quite a detour. Instead just
fill in a bio and ship it down.
Give ffs it's own bufobj->bo_ops vector and create a private strategy
routine, (currently misnamed for forwards compatibility), which is
just a copy of the generic bufstrategy routine except we call
softdep_disk_prewrite() directly instead of through the buf_prewrite()
indirection.
Teach UFS about the need for softdep_disk_prewrite() and call the
function directly in FFS.
Remove buf_prewrite() from the default bufstrategy() and from the
global bio_ops method vector.
We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.
It can be used to delay mounting root partition to give a chance to GEOM
providers to show up.
Now, when there is no needed provider, vfs_rootmount() function will look
for it every second and if it can't be find in defined time, it'll ask
for root device name (before this change it was done immediately).
This will allow to boot from gmirror device in degraded mode.
and refuse initializing filesystems with a wrong version. This will
aid maintenance activites on the 5-stable branch.
s/vfs_mount/vfs_omount/
s/vfs_nmount/vfs_mount/
Name our filesystems mount function consistently.
Eliminate the namiedata argument to both vfs_mount and vfs_omount.
It was originally there to save stack space. A few places abused
it to get hold of some credentials to pass around. Effectively
it is unused.
Reorganize the root filesystem selection code.
Add local rootvp variables as needed.
Remove checks for miniroot's in the swappartition. We never did that
and most of the filesystems could never be used for that, but it had
still been copy&pasted all over the place.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.
our cached 'next vnode' being removed from this mountpoint. If we
find that it was recycled, we restart our traversal from the start
of the list.
Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:
MNT_ILOCK(mp);
loop:
for (vp = TAILQ_FIRST(&mp...);
(vp = nvp) != NULL;
nvp = TAILQ_NEXT(vp,...)) {
if (vp->v_mount != mp)
goto loop;
MNT_IUNLOCK(mp);
...
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);
The code which takes vnodes off a mountpoint looks like this:
MNT_ILOCK(vp->v_mount);
...
TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
...
MNT_IUNLOCK(vp->v_mount);
...
vp->v_mount = something;
(Take a moment and try to spot the locking error before you read on.)
On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.
Fix:
Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go. This saves approx 65 lines of
duplicated code.
Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.
Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.
things which compare /etc/fstab entries to results from
getfsstat(). The real way to fix this is to make 'ufs2'
a recognized filesystem (for real, no beating around the
bush).
This should fix things like 'umount -a -t ufs' now.
Appologies for the previous breakage.
libufs, which only works for Charlie root.
This change reverts the introduction of libufs and moves the
check into the kernel. Since the f_fstypename is the same
for both ufs and ufs2, we check fs_magic for presence of
ufs2 and copy "ufs2" explicitly instead.
Submitted by: Christian S.J. Peron <maneo@bsdpro.com>
license, per letter dated July 22, 1999 and irc message from Robert
Watson saying that clause 3 can be removed from those files with an
NAI copyright that also have only a University of California
copyrights.
Approved by: core, rwatson
- don't unlock the vnode after vinvalbuf() only to have to relock it
almost immediately.
- don't refer to devices classified by vn_isdisk() as block devices.
operators) in and near revs.1.169-1.170 (open mode bandaid). This
(or better a proper fix) should have been done before cloning the
bandaid to many other file systems.
system super block after fsck has repaired the file system. The value of
fs_ronly was getting overwritten, which caused ffs_update() to attempt to
update inode timestamps even though the file system was still mounted
read-only.
This fixes the "giving up on N buffers" error that is triggered by running
fsck on the root file system and then rebooting without mounting the file
system read-write.
accurate reporting of multi-terabyte filesystem sizes.
You should build and boot a new kernel BEFORE doing a `make world'
as the new kernel will know about binaries using the old statfs
structure, but an old kernel will not know about the new system
calls that support the new statfs structure. Running an old kernel
after a `make world' will cause programs such as `df' that do a
statfs system call to fail with a bad system call.
Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Tim Robbins <tjr@freebsd.org>
Reviewed by: Julian Elischer <julian@elischer.org>
Reviewed by: the hoards of <arch@freebsd.org>
Sponsored by: DARPA & NAI Labs.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.
Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.
Discussed with: jeff
wasn't curthread, i.e. when we receive a thread pointer to use
as a function argument. Use VOP_UNLOCK/vrele in these cases.
The only case there td != curthread known at the moment is
boot() calling sync with thread0 pointer.
This fixes the panic on shutdown people have reported.
so make the code slightly more uniform. The vnode lock is acquired in
all cases and now the only difference between VCHR and other is we
call UFS_UPDATE instead of VOP_FSYNC().
- Slightly rewrite the fsync loop to be more lock friendly. We must
acquire the vnode interlock before dropping the mnt lock. We must
also check XLOCK to prevent vclean() races.
- Use LK_INTERLOCK in the vget() in ffs_sync to further prevent vclean()
races.
- Use a local variable to store the results of the nvp == TAILQ_NEXT
test so that we do not access the vp after we've vrele()d it.
- Add an XXX comment about UFS_UPDATE() not being protected by any lock
here. I suspect that it should need the VOP lock.
we release the mntvnode_mtx.
- Call vgonel() directly instead of going through vrecycle() since we own
the interlock now.
- Remove a few cases where we locked the interlock just so that we could
call VOP_UNLOCK with interlock held.
get the same value from ip->i_ump->um_devvp.
This saves a pointer in the memory copies of inodes, which can
easily run into several hundred kilobytes.
The extra indirection is unmeasurable in benchmarks.
Approved by: mckusick
contain the filedescriptor number on opens from userland.
The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*
For now pass -1 all over the place.