from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:
mac_<object>_<method/action>
mac_<object>_check_<method/action>
The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.
All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.
Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.
I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.
is given (with newfs or tunefs) and dirsize overflows.
In case dirsize is <= 0 because of an overflow set maxcontigdirs
to 0 so it will be 1 later. This is what would happen for large
fs_avgfilesize. [1]
Identified with help from: roberto, pjd
Submitted by: pjd [1]
Approved by: re (rwatson)
MFC after: 8 days
some quota limit was exceeded. Sequence of UFS_VALLOC()/UFS_VFREE()
call there could cause inodeblock to have both freefile and inodedep
dependencies without any inode in the block being marked for write.
Then, softdep_check_suspend() would return EAGAIN forewer.
Force write of inodeblock with allocated freefile softdependency by
setting IN_MODIFIED flag in softdep_freefile and unconditionally calling
UFS_UPDATE() in ufs_reclaim.
Reported by: kris
Debug help and tested by: Peter Holm
Approved by: re (kensmith)
MFC after: 3 weeks
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.
Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.
We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.
Reviewed by: csjp
Obtained from: TrustedBSD Project
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
argument from being file descriptor index into the pointer to struct file:
part 2. Convert calls missed in the first big commit.
Noted by: rwatson
Pointy hat to: kib
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.
Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.
Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)
It's hard to measure performance improvement on my test machine, but the
change won't degrade performance for sure. I can measure slight improvement
for debugging kernel and it can also be a win for machines where atomic
operation is more expensive.
Reviewed by: kib
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.
does not prevent handle_workitem_remove() from recursing into a blocking
version. Add the dirrem to worklist instead of processing it now if this
is the case.
Reported and tested by: kris
Submitted by: tegge
MFC after: 2 weeks
defined with VFS_LOCK_GIANT(NULL) call.
This shall fix softdep operation when mpsafe_vfs = 0.
Reported and tested by: kris
Submitted by: tegge
MFC after: 1 week
already been deleted. The assertion is important to show that
we won't end up accounting for extended attribute blocks (using
fs_pendingblocks) in our subsequent call to fs_alloc().
Agreed verbally by: mckusick
MFC after: 3 weeks
Each struct dquot gets dq_lock mutex to protect dq_flags and to interlock
with DQ_LOCK. qhash, dqfreelist and dq.dq_cnt are protected by global
dqhlock mutex.
i_dquot array for inode is protected by lockmgr' vnode lock, corresponding
assert added to the dqget(). Access to struct ufsmount quota-related fields
(um_quotas and um_qflags) is protected by um_lock.
Tested by: Peter Holm
Reviewed by: tegge
Approved by: re (kensmith)
This work were not possible without enormous amount of help given by
Tor Egge and Peter Holm. Tor reviewed each version of patch, pointed out
numerous errors and provided invaluable suggestions. Peter did tireless
testing of the patch as it was developed.
late stages of unmount). On failure, the vnode is recycled.
Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.
Change getnewvnode() to no longer call insmntque(). Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.
Change vfs_hash_insert() to no longer lock the vnode. The caller now
has that responsibility.
Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup. Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.
Approved by: re (kensmith)
Reviewed by: kib
In collaboration with: kib
<sys/extattr.h> to <ufs/ufs/extattr.h>. Move description
of extended attributes in UFS from man9/extattr.9 to
man5/fs.5.
Note that restore will not compile until <sys/extattr.h>
and <ufs/ufs/extattr.h> have been updated.
Suggested by: Robert Watson
inode's i_flag.
It's possible that after ufs_infactive() calls softdep_releasefile(),
i_nlink stays >0 for a considerable amount of time (> 60 seconds here).
During this period, any ffs allocation routines that alter di_blocks
must also account for the blocks in the filesystem's fs_pendingblocks
value.
This change fixes an eventual df/du discrepency that will happen as
the result of fs_pendingblocks being reduced to <0.
The only manifestation of this that people may recognise is the
following message on boot:
/somefs: update error: blocks -N files M
at which point the negative pending block count is adjusted to zero.
Reviewed by: tegge
MFC after: 3 weeks
VFS privilege namespace: exceedquota, getquota, and setquota. Leave
UFS-specific quota configuration privileges in the UFS name space.
This renumbers VFS and UFS privileges, so requires rebuilding modules
if you are using security policies aware of privilege identifiers.
This is likely no one at this point since none of the committed MAC
policies use the privilege checks.
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.
Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.
VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.
Approved by: mckusick
Discussed with: many (on IRC)
Tested with: ufs, msdosfs, cd9660, nullfs and zfs
a user or group, when the kernel first sees this, it will update
the grace time value. However, it never flags the quota as modified
and the updated value never makes it to the quota data file unless
the user actually makes some other change that would write the
data out.
Fixed to flag the quota as modified if the soft limit has actually
been reached and should be now enforced.
using the callers UID instead of the GID when performing group
operations. This could allow users to determine group quota
information for groups they are not a member of in some cases.
Rename the "uid" parameter in ufs_quotactl to "id" to better show
that it is used for more than just the uid, and to be more in line
with the naming conventions in the other quota routines.
PR: kern/33940
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
1) Do not do quota accounting for the actual quota data files
or for file system snapshot files ("system" files). This
prevents a deadlock descibed in PR kern/30958 if the kernel
ever has to grow the quota file. Snapshot files were already
exempt from the quota checks, but this change generalized the check.
2) Fix a cast that caused extremely large uids/gids to incorrectly
write the quota information to the data file at a truncated
value for a uint_t32 id value. The incorrect cast caused quota
files in this case to be around 4GB in size, with the correct cast
they can now be 131GB in size. Also related to PR kern/30958.
3) Check for what appear to be negative UIDs/GIDs and not account
for them. This prevents the quota files from becoming 131GB in
size and causing quotacheck to run forever at bootup. This could
also cause the kernel to try and expand the quota file, which might
deadlock due to the issue in #1. kern/30958 and kern/38156
(and some much older closed PR's).
4) With the deadlock problems gone, the kernel can now expand the
size of the quota database files if it needs to.
5) Pass in the i-node count change value to chkiq and chkiqchg as an
int, like it used to be before the common routine was split up
into 2 different routines to increase / decrease the i-node in-use
count. Prevents an underflow on the i-node count. Related
to PR kern/89247.
6) Prevent the block usage from growing slowly if a file system is
full and the write was denied due to that fact. PR kern/89247.
Some of these changes require an updated quotacheck to prevent
the creation of huge (131GB) quota data files (item #3).
#1/#4 probably fixes a lot of the random hangs when quotas are enabled,
possibly some of the jail hangs.
specific privilege names to a broad range of privileges. These may
require some future tweaking.
Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
object is still open), increase fs_unrefs and cg_unrefs fields,
which is a hint for fsck in which cylinder groups looks for such
(orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.
Sponsored by: home.pl
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
is suspending/suspended. Doing so may result in deadlock. Instead, set the
(new) IN_LAZYACCESS flag, that becomes IN_MODIFIED when suspend is lifted.
Change the locking protocol in order to set the IN_ACCESS and timestamps
without upgrading shared vnode lock to exclusive (see comments in the
inode.h). Before that, inode was modified while holding only shared
lock.
Tested by: Peter Holm
Reviewed by: tegge, bde
Approved by: pjd (mentor)
MFC after: 3 weeks
closing a window where a file system using softupdates could be async
for a short while if both MNT_UPDATE and MNT_ASYNC were passed as flags
to nmount(). Add MNTK_SOFTDEP flag to ensure that softdep_mount()
doesn't increase mnt_noasync multiple times.
sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.
switch by worklist type contains two for() loops, for D_INDIRDEP and
D_PAGEDEP. On error, these loops are exited by break, where the switch
actually shall be leaved. Use goto instead of break to reach the error
handling code.
Reported by: Peter Holm
Reviewed by: tegge
Approved by: pjd (mentor)
MFC after: 2 weeks
with other commonly used sysctl name spaces, rather than declaring them
all over the place.
MFC after: 1 month
Sponsored by: nCircle Network Security, Inc.
first filter out metadata update. Otherwise, devfs vnode could be
erronously interpreted as ufs one, causing further check of i_flags
to use random memory.
PR: kern/100365
Debugged and fix described by: tegge
Approved by: pjd (mentor)
MFC after: 2 weeks
process. UFS must check a whiteout name when it uses the
whiteout, but the current implementation does not check
the whileout name, so sometimes UFS writes over a wrong
whtieout. UFS *MUST* check the whiteout name to use a
corrent whiteout. This bug leads unionfs. panic.
This commit fixes this trouble.
Submitted by: Masanori Ozawa <ozawa@ongs.co.jp> (unionfs developer)
Reviewed by: tegge & rodrigc (mentor)
Approved by: rodrigc (mentor)
MFC after: 2 weeks
bread() the UFS superblock. Should eliminate crashes when trying
to do: mount -t ufs on an audio CD.
PR: kern/85893
Reported by: Russell Francis <rfrancis at ev dot net>
MFC after: 1 week
for file types other than VREG, VDIR and shared memory objects.
We already handle VREG, VLNK and VDIR cases. Silently ignore
truncate requests for all the rest. Adjust comments.
PR: kern/98064
Submitted by: bde
Security: local DoS
Regress. test: regression/fifo/fifo_misc
MFC after: 2 weeks
assuming them to be inflight write buffers. This is not always the case.
bufdaemon might hold the buffer lock and give up writing the buffer due to it
having dependencies, the file system being suspended or the vnode lock being
held by another thread. When bufdaemon decides to write the buffer there is
still a window before bufobj_wref() has been called, allowing other threads to
believe that the vnode has no dirty buffers or inflight writes.
Try harder to flush first block of new subdirectory to get rid of MKDIR_BODY
dependency.
same time as it is changed back into a normal file. The locker would
get the shared "snaplk" lock which would no longer be the correct lock
for the vnode.
buffers to go on the buf daemon's DIRTYGIANT queue.
- Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler
runs in the context of the buf daemon and may require Giant.
enables multilabel, or any option for that matter, most likely they have
a reason. This will allow users to see that mulilabel is enabled via an
issued "mount" command and remove an annoying warning - printed only when
a MAC kernel is not installed - on boot up.
Discussed with: green, brueffer, Samy Al Bahra.
Probably ran past: csjp (though I can't remember).
This does not do what I wanted as all dirty buffers must be flushed
by the call to ffs_sync and any remaining dependency work would mean
that this failed.
Pointed out by: tegge
This does not do what I wanted as all dirty buffers must be flushed
by the call to ffs_sync and any remaining dependency work would mean
that this failed.
Pointed out by: tegge
Fix detection of active unlinked files by checking VI_OWEINACT and
VI_DOINGINACT in addition to v_usecount.
Defer inactive handling for unlinked files if the file system is mostly
suspended (secondary writes being blocked).
Perform deferred inactive handling after the file system is resumed.
replacement for vn_write_suspend_wait() to better account for secondary write
processing.
Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.
Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.
be called without any vnode locks held. Remove calls to vn_start_write() and
vn_finished_write() in vnode_pager_putpages() and add these calls before the
vnode lock is obtained to most of the callers that don't already have them.
has many positive effects including improved smp locking, reducing
interdependencies between mounts that can lead to deadlocks, etc.
- Add the softdep worklist and various counters to the ufsmnt structure.
- Add a mount pointer to the workitem and remove mount pointers from the
various structures derived from the workitem as they are now redundant.
- Remove the poor-man's semaphore protecting softdep_process_worklist and
softdep_flushworklist. Several threads may now process the list
simultaneously.
- Add softdep_waitidle() to block the thread until all pending
dependencies being operated on by other threads have been flushed.
- Use softdep_waitidle() in unmount and snapshots to block either
operation until the fs is stable.
- Remove softdep worklist processing from the syncer and move it into the
softdep_flush() thread. This thread processes all softdep mounts
once each second and when it is called via the new softdep_speedup()
when there is a resource shortage. This removes the softdep hook
from the kernel and various hacks in header files to support it.
Reviewed by/Discussed with: tegge, truckman, mckusick
Tested by: kris
directory. vrele() may lock the passed vnode, which in these cases would
give an invalid lock order of child -> parent. These situations are
deadlock prone although do not typically deadlock because the vrele
is typically not releasing the last reference to the vnode. Users of
vrele must consider it as a call to vn_lock() and order it appropriately.
MFC After: 1 week
Sponsored by: Isilon Systems, Inc.
Tested by: kkenn
that NetBSD implemented it independently of them (don't know which one
was actually first). This saves about 24k for those times you don't
need snapshot support (like when running off a ram disk, or in an
embedded environment where size matters).
for export structure and pass that to vfs_export().
Currently in userland mount(8), an export structure is unconditionally
passed in, only for UFS. This is an attempt to move that UFS-specific
behavior out of mount(8) and into the UFS filesystem code.
callpath via vfs_getopt(), and set the appropriate MNT_* flag:
-> acls, async, force, multilabel, noasync, noatime,
-> noclusterr, noclusterw, snapshot, update
- Allow errmsg as a valid mount option via vfs_getopt(),
so we can later add a hook to propagate mount errors back
to userspace via vfs_mount_error().
- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.
- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.
- Disambiguate some collisions by adding subsystem prefixes to some
memory types.
- Generally prefer lower case to upper case.
- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.
Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.
to (max block - 1) * bsize. For DEV_BSIZE, this doubles the limit from
0.5 TB to 1 TB. For the old 4.4 FFS case, decrease the limit from 0.5 TB
to 2 GB - 1. Older systems had a 32 bit off_t so they couldn't access the
larger files anyway.
Collaboration with: bde
case by saving the value of dp->i_ino before unlocking the vnode
for the current directory and passing the saved value to VFS_VGET().
Without this change, another thread can overwrite dp->i_ino after
the current directory is unlocked, causing ufs_lookup() to lock
and return the wrong vnode in place of the vnode for its parent
directory. A deadlock can occur if dp->i_ino was changed to a
subdirectory of the current directory because the root to leaf vnode
lock ordering will be violated. A vnode lock can be leaked if
dp->i_ino was changed to point to the current directory, which
causes the current vnode lock for the current directory to be
recursed, which confuses lookup() into calling vrele() when it
should be calling vput().
The probability of this bug being triggered seems to be quite low
unless the sysctl variable debug.vfscache is set to 0.
Reviewed by: jhb
MFC after: 2 weeks
auto-start, set cnp.cn_lkflags to LK_EXCLUSIVE. This flag must now
be set so that lockmgr knows what kind of lock to acquire, and it
will panic if not specified. This resulted in a panic when using
extended attributes on UFS1 as of locking work present in the 6.x
branch.
This is a RELENG_6_0 merge candidate.
Reported by: lofi
MFC after: 3 days
updated by a process holding the snapshot lock. Another process updating a
different inode in the same inodeblock will do copy on write checks and lock in
the opposite direction.
The snapshot code force a copy on write of these blocks manually (cf. start of
expunge_ufs[12]) and these inode blocks are later put on snapblklist.
This partial fix is to 'drain' the relevant ffs_copyonwrite() operation after
installing new snapblklist. This is not a 100% solution since a failed block
allocation can cause implicit fsync() which might deadlock before the new
snapblklist has been installed.
file is flushed by a process not holding snaplk (e.g. bufdaemon). Another
process might hold snaplk and try to access the block due to ffs_copyonwrite
processing.
the cg map buffer being held when writing indirect blocks. The process ends up
in ffs_copyonwrite(), attempting to get snaplk while holding the cg map buffer
lock.
Another process might be in ffs_copyonwrite(), trying to allocate a new block
for a copy. It would hold snaplk while trying to get the cg map buffer lock.
Release the cg map buffer early and use the copy for most of the cgaccount
processing to avoid this deadlock.
skipping the call from ffs_snapremove() if the block number is zero.
Simplify snapshot locking in ffs_copyonwrite() and ffs_snapblkfree() by using
the same locking protocol for low block numbers as for larger block numbers.
This removes a lock leak that could happen if vn_lock() succeeded after
lockmgr() failed in ffs_snapblkfree().
Check if snapshot is gone before retrying a lock in ffs_copyonwrite().
reclamation. If the vnode previously was a fifo then v_op would point to
ffs_fifoops[12] instead of the expected ffs_vnodeops[12], causing a panic at
the end of ffsext_strategy.
stale flag bits left over from before the inode was recycled.
Without this change, a leftover IN_SPACECOUNTED flag could prevent
softdep_freefile() and softdep_releasefile() from incrementing
fs_pendinginodes. Because handle_workitem_freefile() unconditionally
decrements fs_pendinginodes, a negative value could be reported at
file system unmount time with a message like:
unmount pending error: blocks 0 files -3
The pending block count in fs_pendingblocks could also be negative
for similar reasons. These errors can cause the data returned by
statfs() to be slightly incorrect. Some other cleanup code in
softdep_releasefile() could also be incorrectly bypassed.
MFC after: 3 days