separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.
From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.
If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.
Reviewed in general by: mckusick
on directories.
o Allow privileged processes in jail() to create inodes with the
setgid bit set even if they are not a member of the group denoted
by the file creation gid. This occurs due to inherited gid's from
parent directories on file creation, allowing a user to create a
file with a gid that is not in the creating process's credentials.
Obtained from: TrustedBSD Project
and UFS file flags. Here's what the comment says, for reference:
Privileged processes in jail() are permitted to modify
arbitrary user flags on files, but are not permitted
to modify system flags.
In other words, privilege does allow a process in jail to modify user
flags for objects that the process does not own, but privilege will
not permit the setting of system flags on the file.
Obtained from: TrustedBSD Project
remove the setuid/setgid bits by virtue of a change to a file with those
bits set, even if the process doesn't own the file, or isn't a group
member of the file's gid.
Obtained from: TrustedBSD Project
safe as suser() no longer sets ASU.
o Note that in some cases, the PRISON_ROOT flag is used even though no
process structure is passed, to indicate that if a process structure
(and hence jail) was available, it would be ok. In the long run,
the jail identifier should probably be moved to ucred, as the uidinfo
information was.
o Some uid 0 checks remain relating to the quota code, which I'll leave
for another day.
Reviewed by: phk, eivind
Obtained from: TrustedBSD Project
filesystem lookup() routine if it unlocks parent directory. This flag should
be carefully tracked by filesystems if they want to work properly with nullfs
and other stacked filesystems.
VFS takes advantage of this flag to perform symantically correct usage
of vrele() instead of vput() if parent directory already unlocked.
If filesystem fails to track this flag then previous codepath in VFS left
unchanged.
Convert UFS code to set PDIRUNLOCK flag if necessary. Other filesystmes will
be changed after some period of testing.
Reviewed in general by: mckusick, dillon, adrian
Obtained from: NetBSD
- In ufs_extattr_enable(), return EEXIST instead of EOPNOTSUPP
if the caller tries to configure an attribute name that is
already configured
- Throughout, add IO_NODELOCKED to VOP_{READ,WRITE} calls to
indicate lock status of passed vnode. Apparently not a
problem, but worth fixing.
- For all writes, make use of IO_SYNC consistent. Really,
IO_UNIT and combining of VOP_WRITE's should happen, but I
don't have that tested. At least with this, it's
consistent usage. (pointed out by: bde)
- In ufs_extattr_get(), fixed nested locking of backing
vnode (fine due to recursive lock support, but make it
more consistent with other code)
- In ufs_extattr_get(), clean up return code to set uio_resid
more consistently with other pieces of code (worked fine,
this is just a cleanup)
- Fix ufs_extattr_rm(), which was broken--effectively a nop.
- Minor comment and whitespace fixes.
Obtained from: TrustedBSD Project
include:
* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)
* Per-CPU idle processes.
* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).
Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
attribute namespace and DAC protection on file:
- Attribute names beginning with '$' are in the system namespace
- The attribute name "$" is reserved
- System namespace attributes may only be read/set by suser()
or by kernel (cred == NULL)
- Other attribute names are in the application namespace
- The attribute name "" is reserved
- Application namespace attributes are protected in the manner
of the target file permission
o Kernel changes
- Add ufs_extattr_valid_attrname() to check whether the requested
attribute "set" or "enable" is appropriate (i.e., non-reserved)
- Modify ufs_extattr_credcheck() to accept target file vnode, not
to take inode uid
- Modify ufs_extattr_credcheck() to check namespace, then enforce
either kernel/suser for system namespace, or vaccess() for
application namespace
o EA backing file format changes
- Remove permission fields from extended attribute backing file
header
- Bump extended attribute backing file header version to 3
o Update extattrctl.c and extattrctl.8
- Remove now deprecated -r and -w arguments to initattr, as
permissions are now implicit
- (unrelated) fix error reporting and unlinking during failed
initattr to remove duplicate/inaccurate error messages, and to
only unlink if the failure wasn't in the backing file open()
Obtained from: TrustedBSD Project
object before falling back on privilege. Make vaccess() accept an
additional optional argument, privused, to determine whether
privilege was required for vaccess() to return 0. Add commented
out capability checks for reference. Rename some variables to make
it more clear which modes/uids/etc are associated with the object,
and which with the access mode.
o Update file system use of vaccess() to pass NULL as the optional
privused argument. Once additional patches are applied, suser()
will no longer set ASU, so privused will permit passing of
privilege information up the stack to the caller.
Reviewed by: bde, green, phk, -security, others
Obtained from: TrustedBSD Project
Remove old DEVFS support fields from dev_t.
Make uid, gid & mode members of dev_t and set them in make_dev().
Use correct uid, gid & mode in make_dev in disk minilayer.
Add support for registering alias names for a dev_t using the
new function make_dev_alias(). These will show up as symlinks
in DEVFS.
Use makedev() rather than make_dev() for MFSs magic devices to prevent
DEVFS from noticing this abuse.
Add a field for DEVFS inode number in dev_t.
Add new DEVFS in fs/devfs.
Add devfs cloning to:
disk minilayer (ie: ad(4), sd(4), cd(4) etc etc)
md(4), tun(4), bpf(4), fd(4)
If DEVFS add -d flag to /sbin/inits args to make it mount devfs.
Add commented out DEVFS to GENERIC
This allows ffs_fsync() to break out of a loop that might otherwise
be infinite on kernels compiled without the SOFTUPDATES option.
The observed symptom was a system hang at the first unmount attempt.
the SF_IMMUTABLE flag to prevent writing. Instead put in explicit
checking for the SF_SNAPSHOT flag in the appropriate places. With
this change, it is now possible to rename and link to snapshot files.
It is also possible to set or clear any of the owner, group, or
other read bits on the file, though none of the write or execute
bits can be set. There is also an explicit test to prevent the
setting or clearing of the SF_SNAPSHOT flag via chflags() or
fchflags(). Note also that the modify time cannot be changed as
it needs to accurately reflect the time that the snapshot was taken.
Submitted by: Robert Watson <rwatson@FreeBSD.org>
with the new snapshot code.
Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.
Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().
Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.
Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.
Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.
Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.
Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.
Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.
Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.
Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.
attribute data size.
o Fortunately it turned out to be an unused constant left over from an
earlier implementation, and is therefore being removed so as not to
confuse casual observers.
Submitted by: mbendiks@eunet.no
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.
Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).
in mount.h instead of ffs_extern.h. The correct solution is to use
an indirect function pointer so that the kernel does not have to be
built with options FFS, but that will be left for another day.
advance preparation for them to get migrated into place so that
subsequent changes in utilities will not fail to compile for lack
of up-to-date header files in /usr/include.
after the acquisition of any advisory locks. This fix corrects a case
in which a process tries to open a file with a non-blocking exclusive
lock. Even if it fails to get the lock it would still truncate the
file even though its open failed. With this change, the truncation
is done only after the lock is successfully acquired.
Obtained from: BSD/OS
the system would panic when a user's inode quota was exceeded (see
PR 18959 for details). This fixes that problem.
PR: 18959
Submitted by: Jason Godsey <jason@unixguy.fidalgo.net>
check to see if it has been committed to disk. If it has never
been written, it can be freed immediately. For short lived files
this change allows the same inode to be reused repeatedly.
Similarly, when upgrading a fragment to a larger size, if it
has never been claimed by an inode on disk, it too can be freed
immediately making it available for reuse often in the next slowly
growing block of the same file.