Don't grab Giant in the upper syscall/wrapper code
NET_LOCK_GIANT in the socket code (sockets/fifos).
mtx_lock(&Giant) in the vnode code.
mtx_lock(&Giant) in the opencrypto code. (This may actually not be
needed, but better safe than sorry).
Devfs grabs Giant if the driver is marked as needing Giant.
Don't grab Giant in the upper syscall/wrapper code
NET_LOCK_GIANT in the socket code (sockets/fifos).
mtx_lock(&Giant) in the vnode code.
Devfs grabs Giant if the driver is marked as needing Giant.
We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.
Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).
Reviewed by: green, rwatson (both earlier versions)
catch leaking into VFS without Giant.
Inch Giant a little lower in several file descriptor operations on
vnodes to cover only VFS operations that need it, rather than file
flag reading, etc.
Giant conditional on debug.mpsafenet in the socket soo_stat() routine,
unconditionally in vn_statfile() for VFS, and otherwise don't acquire
Giant. Accept an unlocked read in kqueue_stat(), and cryptof_stat() is
a no-op. Don't acquire Giant in fstat() system call.
Note: in fdescfs, fo_stat() is called while holding Giant due to the VFS
stack sitting on top, and therefore there will still be Giant recursion
in this case.
individual file object implementations can optionally acquire Giant if
they require it:
- soo_close(): depends on debug.mpsafenet
- pipe_close(): Giant not acquired
- kqueue_close(): Giant required
- vn_close(): Giant required
- cryptof_close(): Giant required (conservative)
Notes:
Giant is still acquired in close() even when closing MPSAFE objects
due to kqueue requiring Giant in the calling closef() code.
Microbenchmarks indicate that this removal of Giant cuts 3%-3% off
of pipe create/destroy pairs from user space with SMP compiled into
the kernel.
The cryptodev and opencrypto code appears MPSAFE, but I'm unable to
test it extensively and so have left Giant over fo_close(). It can
probably be removed given some testing and review.
size_t and size_t *, respectively. Update callers for the new interface.
This is a better fix for overflows that occurred when dumping segments
larger than 2GB to core files.
descriptors out of fdrop_locked() and into vn_closefile(). This
removes all knowledge of vnodes from fdrop_locked(), since the lock
behavior was specific to vnodes. This also removes the specific
requirement for Giant in fdrop_locked(), it's now only required by
code that it calls into.
Add GIANT_REQUIRED to vn_closefile() since VFS requires Giant.
the last chunk are misaligned relative to a MAXBSIZE byte boundary.
vn_rdwr_inchunks() is used mainly for elf core dumps, and elf sections
are usually perfectly misaligned relative to MAXBSIZE, and chunking
prevents the file system from doing much realigning.
This gives a surprisingly large speedup for core dumps -- from 50 to
13 seconds for a 512MB core dump here. The pessimization was mostly
from an interaction of the misalignment with IO_DIRECT. It increased
the number of i/o's for each chunk by a factor of 5 (3 writes and 2
read-before-writes instead of 1 write).
kernel ACL interfaces and system call names.
Break out UFS2 and FFS extattr delete and list vnode operations from
setextattr and getextattr to deleteextattr and listextattr, which
cleans up the implementations, and makes the results more readable,
and makes the APIs more clear.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
contain the filedescriptor number on opens from userland.
The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*
For now pass -1 all over the place.
attributes from objects over vop_setextattr() with a NULL uio; if
the file system doesn't support the vop_rmextattr() method, fall
back to the vop_setextattr() method.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.
By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.
At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.
pointer types, and remove a huge number of casts from code using it.
Change struct xfile xf_data to xun_data (ABI is still compatible).
If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.
which expects it to be NULL unless the return value was 0 will work.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned. This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.
Sponsored by: DARPA & NAI Labs.
LK_INTERLOCK. The interlock will never be held on return from these
functions even when there is an error. Errors typically only occur when
the XLOCK is held which means this isn't the vnode we want anyway. Almost
all users of these interfaces expected this behavior even though it was
not provided before.
with interlock held in error conditions when the caller did not specify
LK_INTERLOCK.
- Add several comments to vn_lock() describing the rational behind the code
flow since it was not immediately obvious.
for mac_check_vnode_{poll,read,stat,write}(). Pass in fp->f_cred
when calling these checks with a struct file available. Otherwise,
pass NOCRED. All currently MAC policies use active_cred, but
could now offer the cached credential semantic used for the base
system security model.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
mac_check_vnode_poll(), mac_check_vnode_read(), mac_check_vnode_write().
This improves the consistency with other existing vnode checks, and
allows policies to avoid implementing switch statements to determine
what operations they do and do not want to authorize.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
accept an 'active_cred' argument reflecting the credential of the thread
initiating the ioctl operation.
- Change fo_ioctl() to accept active_cred; change consumers of the
fo_ioctl() interface to generally pass active_cred from td->td_ucred.
- In fifofs, initialize filetmp.f_cred to ap->a_cred so that the
invocations of soo_ioctl() are provided access to the calling f_cred.
Pass ap->a_td->td_ucred as the active_cred, but note that this is
required because we don't yet distinguish file_cred and active_cred
in invoking VOP's.
- Update kqueue_ioctl() for its new argument.
- Update pipe_ioctl() for its new argument, pass active_cred rather
than td_ucred to MAC for authorization.
- Update soo_ioctl() for its new argument.
- Update vn_ioctl() for its new argument, use active_cred rather than
td->td_ucred to authorize VOP_IOCTL() and the associated VOP_GETATTR().
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential. Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential. Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument. This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.
Trickle this change down into fo_stat/poll() implementations:
- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL()
to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
and consumers so that this distinction is maintained at the VFS
as well as 'struct file' layer. Pass active_cred instead of
td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.
- fifofs: modify the creation of a "filetemp" so that the file
credential is properly initialized and can be used in the socket
code if desired. Pass ap->a_td->td_ucred as the active
credential to soo_poll(). If we teach the vnop interface about
the distinction between file and active credentials, we would use
the active credential here.
Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained. It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:
- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.
For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:
- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred
Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.
Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.
These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.
Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
'options MAC') as long as IO_NOMACCHECK is not set in the IO flags.
If IO_NOMACCHECK is set, bypass MAC checks in vn_rdwr(). This allows
vn_rdwr() to be used as a utility function inside of file systems
where MAC checks have already been performed, or where the operation
is being done on behalf of the kernel not the user.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI LAbs
VOP wrapper is called from within file systems so can result in odd
loopback effects when MAC enforcement is use with the active (as
opposed to saved) credential. These checks will be moved elsewhere.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.
Idea stolen from: BSD/OS
kernel access control
Invoke appropriate MAC framework entry points to authorize a number
of vnode operations, including read, write, stat, poll. This permits
MAC policies to revoke access to files following label changes,
and to limit information spread about the file to user processes.
Note: currently the file cached credential is used for some of
these authorization check. We will need to expand some of the
MAC entry point APIs to permit multiple creds to be passed to
the access control check to allow diverse policy behavior.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
kernel access control.
Restructure the vn_open_cred() access control checks to invoke
the MAC entry point for open authorization. Note that MAC can
reject open requests where existing DAC code skips the open
authorization check due to O_CREAT. However, the failure mode
here is the same as other failure modes following creation,
wherein an empty file may be left behind.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
sysctl purposes. Also add two fields to struct vnode, v_cachedfs and
v_cachedid, which hold the vnode's device and file id and are filled in
by vn_open_cred() and vn_stat().
Sponsored by: DARPA, NAI Labs
open() of fhopen(). Currently this has no actual affect due to the
treatment of VAPPEND in vaccess() and vaccess_acl() as a subset of
VWRITE, but when MAC comes in, MAC will distinguish the two. Note:
if any file systems are cutting their own permission models, they
may wish to now take this into account.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
64-bit architectures that was introduced in the UFS2 code
merge two days ago. The stat structure change that caused
the problem was the addition of the file create time.
Submitted by: Bruce Evans <bde@zeta.org.au>
Sponsored by: DARPA & NAI Labs.
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.
Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.
Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).
Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>
behavior by default. Also, change the options line to reflect this.
If there are no problems reported this will become the only behavior and the
knob will be removed in a month or so.
Demanded by: obrien
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
kern/kern_descrip.c:
Aquire Giant in fdrop_locked when file refcount hits zero, this removes
the requirement for the caller to own Giant for the most part.
kern/kern_ktrace.c:
Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls.
kern/vfs_bio.c:
Aquire Giant in bwillwrite if needed.
kern/sys_generic.c
Giant pushdown, remove Giant for:
read, pread, write and pwrite.
readv and writev aren't done yet because of the possible malloc calls
for iov to uio processing.
kern/sys_socket.c
Grab giant in the socket fo_read/write functions.
kern/vfs_vnops.c
Grab giant in the vnode fo_read/write functions.
The stat() and open() calls have been changed to make use of this new functionality. Using shared locks in
these cases is sufficient and can significantly reduce their latency if IO is pending to these vnodes. Also,
this reduces the number of exclusive locks that are floating around in the system, which helps reduce the
number of deadlocks that occur.
A new kernel option "LOOKUP_SHARED" has been added. It defaults to off so this patch can be turned on for
testing, and should eventually go away once it is proven to be stable. I have personally been running this
patch for over a year now, so it is believed to be fully stable.
Reviewed by: jake, obrien
Approved by: jake
Includes some minor whitespace changes, and re-ordering to be able to document
properly (e.g, grouping of variables and the SYSCTL macro calls for them, where
the documentation has been added.)
Reviewed by: phk (but all errors are mine)
New locks are:
- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.
Please refer to sys/proc.h for the coverage of these locks.
Changes on the pgrp/session interface:
- pgfind() needs the pgrpsess_lock held.
- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.
- Call enterthispgrp() in order to enter an existing pgrp.
- pgsignal() requires a pgrp lock held.
Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)
call VOP_CLOSE() with vp unlocked; clean up the return path a little,
in as much as our namei/vnode operation return paths can be cleared
up. For a return case that was apparently never taken, this sure
is ugly.
Reviewed by: jeffr
VOP_CLOSE() on the vnode, so that VOP_OPEN() and VOP_CLOSE() calls
are symmetric in all failure cases. This prevents an 'open' reference
from being leaked in that unlikely failure scenario.
o Modify the system call syntax for extattr_{get,set}_{fd,file}() so
as not to use the scatter gather API (which appeared not to be used
by any consumers, and be less portable), rather, accepts 'data'
and 'nbytes' in the style of other simple read/write interfaces.
This changes the API and ABI.
o Modify system call semantics so that extattr_get_{fd,file}() return
a size_t. When performing a read, the number of bytes read will
be returned, unless the data pointer is NULL, in which case the
number of bytes of data are returned. This changes the API only.
o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t
argument so as to return the size, if desirable. If set to NULL,
the size will not be returned.
o Update various filesystems (pseodofs, ufs) to DTRT.
These changes should make extended attributes more useful and more
portable. More commits to rebuild the system call files, as well
as update userland utilities to follow.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
operation. The vgonel() code has always called vclean() but until we
started proactively freeing vnodes it would never actually be called with
a dirty vnode, so this situation did not occur prior to the vnlru() code.
Now that we proactively free vnodes when kern.maxvnodes is hit, however,
vclean() winds up with work to do and improperly generates the warnings.
Reviewed by: peter
Approved by: re (for MFC)
MFC after: 1 day
Seigo Tanimura (tanimura) posted the initial delta.
I've polished it quite a bit reducing the need for locking and
adapting it for KSE.
Locks:
1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.
1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.
1 sx lock for the global filelist.
struct file * fhold(struct file *fp);
/* increments reference count on a file */
struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */
struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */
struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */
I still have to smp-safe the fget cruft, I'll get to that asap.
by me to make it more efficient. The original code had serious balancing
problems and could also deadlock easily. This code relegates the vnode
reclamation to its own kproc and relaxes the vnode reclamation requirements
to better maintain kern.maxvnodes. This code still doesn't balance as well
as it could, but it does a much better job then the original code.
Approved by: re@freebsd.org
Obtained from: ps, peter, dillon
MFS Assuming: Assuming no problems crop up in Yahoo testing
MFC after: 7 days
one to perform a vn_open using temporary/other/fake credentials.
Modify the nfs client side locking code to use vn_open_cred() passing
proc0's ucred instead of the old way which was to temporary raise
privs while running vn_open(). This should close the race hopefully.
all successful calls to VOP_OPEN() might be reflected in a call to
VOP_CLOSE(). For now, simply add a comment reflecting this problem;
this should be fixed at some point.
in vn_rdwr_inchunks(), allowing other processes to gain an exclusive
lock on the vnode. Specifically: directory scanning, to avoid a race to the
root directory, and multiple child processes coring simultaniously so they
can figure out that some other core'ing child has an exclusive adv lock and
just exit instead.
This completely fixes performance problems when large programs core. You
can have hundreds of copies (forked children) of the same binary core all
at once and not notice.
MFC after: 3 days
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
me (addition of vn_rdwr_inchunks). The problem Yahoo is solving is that
if you have large process images core dumping, or you have a large number of
forked processes all core dumping at the same time, the original coredump code
would leave the vnode locked throughout. This can cause the directory vnode
to get locked up, which can cause the parent directory vnode to get locked
up, and so on all the way to the root node, locking the entire machine up
for extremely long periods of time.
This patch solves the problem in two ways. First it uses an advisory
non-blocking lock to abort multiple processes trying to core to the same
file. Second (my contribution) it chunks up the writes and uses bwillwrite()
to avoid holding the vnode locked while blocking in the buffer cache.
Submitted by: ps
Reviewed by: dillon
MFC after: 2 weeks
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.
Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.
I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.
Submitted by: tegge, dillon
by the inactive routine. Because the freeing causes the filesystem
to be modified, the close must be held up during periods when the
filesystem is suspended.
For snapshots to be consistent across crashes, they must write
blocks that they copy and claim those written blocks in their
on-disk block pointers before the old blocks that they referenced
can be allowed to be written.
Close a loophole that allowed unwritten blocks to be skipped when
doing ffs_sync with a request to wait for all I/O activity to be
completed.
introduce a new argument, "namespace", rather than relying on a first-
character namespace indicator. This is in line with more recent
thinking on EA interfaces on various mailing lists, including the
posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces
are defined by default, EXTATTR_NAMESPACE_SYSTEM and
EXTATTR_NAMESPACE_USER, where the primary distinction lies in the
access control model: user EAs are accessible based on the normal
MAC and DAC file/directory protections, and system attributes are
limited to kernel-originated or appropriately privileged userland
requests.
o These API changes occur at several levels: the namespace argument is
introduced in the extattr_{get,set}_file() system call interfaces,
at the vnode operation level in the vop_{get,set}extattr() interfaces,
and in the UFS extended attribute implementation. Changes are also
introduced in the VFS extattrctl() interface (system call, VFS,
and UFS implementation), where the arguments are modified to include
a namespace field, as well as modified to advoid direct access to
userspace variables from below the VFS layer (in the style of recent
changes to mount by adrian@FreeBSD.org). This required some cleanup
and bug fixing regarding VFS locks and the VFS interface, as a vnode
pointer may now be optionally submitted to the VFS_EXTATTRCTL()
call. Updated documentation for the VFS interface will be committed
shortly.
o In the near future, the auto-starting feature will be updated to
search two sub-directories to the ".attribute" directory in appropriate
file systems: "user" and "system" to locate attributes intended for
those namespaces, as the single filename is no longer sufficient
to indicate what namespace the attribute is intended for. Until this
is committed, all attributes auto-started by UFS will be placed in
the EXTATTR_NAMESPACE_SYSTEM namespace.
o The default POSIX.1e attribute names for ACLs and Capabilities have
been updated to no longer include the '$' in their filename. As such,
if you're using these features, you'll need to rename the attribute
backing files to the same names without '$' symbols in front.
o Note that these changes will require changes in userland, which will
be committed shortly. These include modifications to the extended
attribute utilities, as well as to libutil for new namespace
string conversion routines. Once the matching userland changes are
committed, a buildworld is recommended to update all the necessary
include files and verify that the kernel and userland environments
are in sync. Note: If you do not use extended attributes (most people
won't), upgrading is not imperative although since the system call
API has changed, the new userland extended attribute code will no longer
compile with old include files.
o Couple of minor cleanups while I'm there: make more code compilation
conditional on FFS_EXTATTR, which should recover a bit of space on
kernels running without EA's, as well as update copyright dates.
Obtained from: TrustedBSD Project
mtx_enter(lock, type) becomes:
mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)
similarily, for releasing a lock, we now have:
mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.
The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.
Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:
MTX_QUIET and MTX_NOSWITCH
The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:
mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.
Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.
Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.
Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.
Finally, caught up to the interface changes in all sys code.
Contributors: jake, jhb, jasone (in no particular order)
Removed most of the hacks that were trying to deal with low-memory
situations prior to now.
The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.
Code has been added to stall in a low-memory situation prior to a vnode
being locked.
Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.
Implement a number of VFS/BIO fixes
(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.
In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.
Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.
In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.
There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.
Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.
vn_extattr_get() and vn_extattr_set(). vn_extattr_rm() removes the
specified extended attribute from a vnode, authorizing the change as
the kernel (NULL cred).
Obtained from: TrustedBSD Project
and VOP_SETEXTATTR to simplify calling from in-kernel consumers,
such as capability code. Both accept a vnode (optionally locked,
with ioflg to indicate that), attribute name, and a buffer + buffer
length in UIO_SYSSPACE. Both authorize the call as a kernel request,
with cred set to NULL for the actual VOP_ calls.
Obtained from: TrustedBSD Project
with the new snapshot code.
Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.
Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().
Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.
Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.
Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.
Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.
Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.
Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.
Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.
Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.
Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).
after the acquisition of any advisory locks. This fix corrects a case
in which a process tries to open a file with a non-blocking exclusive
lock. Even if it fails to get the lock it would still truncate the
file even though its open failed. With this change, the truncation
is done only after the lock is successfully acquired.
Obtained from: BSD/OS
<sys/bio.h>.
<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.
Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.
Still a few bogus uses of struct buf to track down.
Repocopy by: peter
async I/O's. The sequential read heuristic has been extended to
cover writes as well. We continue to call cluster_write() normally,
thus blocks in the file will still be reallocated for large (but still
random) I/O's, but I/O will only be initiated for truely sequential
writes.
This solves a number of annoying situations, especially with DBM (hash
method) writes, and also has the side effect of fixing a number of
(stupid) benchmarks.
Reviewed-by: mckusick
returned to user mode in the spare fields of the stat structure.
PR: kern/14966
Reviewed by: dillon@freebsd.org
Submitted by: Kelly Yancey kbyanc@posi.net
far-reaching in fd-land, so you'll want to consult the code for
changes. The biggest change is that now, you don't use
fp->f_ops->fo_foo(fp, bar)
but instead
fo_foo(fp, bar),
which increments and decrements the fp refcount upon entry and exit.
Two new calls, fhold() and fdrop(), are provided. Each does what it
seems like it should, and if fdrop() brings the refcount to zero, the
fd is freed as well.
Thanks to peter ("to hell with it, it looks ok to me.") for his review.
Thanks to msmith for keeping me from putting locks everywhere :)
Reviewed by: peter
a quick think and discussion among various people some form of some of
these changes will probably be recommitted.
The reversion requested was requested by dg while discussions proceed.
PHK has indicated that he can live with this, and it has been agreed
that some form of some of these changes may return shortly after further
discussion.
the highly non-recommended option ALLOW_BDEV_ACCESS is used.
(bdev access is evil because you don't get write errors reported.)
Kill si_bsize_best before it kills Matt :-)
Use the specfs routines rather having cloned copies in devfs.
now used in f_ops in place of NULL, and modifications to the files
are more carefully ordered. f_ops should also be set to &badfileops
upon "close" of a file.
This does not fix other problems mentioned in this PR than the first
one.
PR: 11629
Reviewed by: peter
large (1G) memory machine configurations. I was able to run 'dbench 32'
on a 32MB system without bring the machine to a grinding halt.
* buffer cache hash table now dynamically allocated. This will
have no effect on memory consumption for smaller systems and
will help scale the buffer cache for larger systems.
* minor enhancement to pmap_clearbit(). I noticed that
all the calls to it used constant arguments. Making
it an inline allows the constants to propogate to
deeper inlines and should produce better code.
* removal of inherent vfs_ioopt support through the emplacement
of appropriate #ifdef's, with John's permission. If we do not
find a use for it by the end of the year we will remove it entirely.
* removal of getnewbufloops* counters & sysctl's - no longer
necessary for debugging, getnewbuf() is now optimal.
* buffer hash table functions removed from sys/buf.h and localized
to vfs_bio.c
* VFS_BIO_NEED_DIRTYFLUSH flag and support code added
( bwillwrite() ), allowing processes to block when too many dirty
buffers are present in the system.
* removal of a softdep test in bdwrite() that is no longer necessary
now that bdwrite() no longer attempts to flush dirty buffers.
* slight optimization added to bqrelse() - there is no reason
to test for available buffer space on B_DELWRI buffers.
* addition of reverse-scanning code to vfs_bio_awrite().
vfs_bio_awrite() will attempt to locate clusterable areas
in both the forward and reverse direction relative to the
offset of the buffer passed to it. This will probably not
make much of a difference now, but I believe we will start
to rely on it heavily in the future if we decide to shift
some of the burden of the clustering closer to the actual
I/O initiation.
* Removal of the newbufcnt and lastnewbuf counters that Kirk
added. They do not fix any race conditions that haven't already
been fixed by the gbincore() test done after the only call
to getnewbuf(). getnewbuf() is a static, so there is no chance
of it being misused by other modules. ( Unless Kirk can think
of a specific thing that this code fixes. I went through it
very carefully and didn't see anything ).
* removal of VOP_ISLOCKED() check in flushbufqueues(). I do not
think this check is necessary, the buffer should flush properly
whether the vnode is locked or not. ( yes? ).
* removal of extra arguments passed to getnewbuf() that are not
necessary.
* missed cluster_wbuild() that had to be a cluster_wbuild_wb() in
vfs_cluster.c
* vn_write() now calls bwillwrite() *PRIOR* to locking the vnode,
which should greatly aid flushing operations in heavy load
situations - both the pageout and update daemons will be able
to operate more efficiently.
* removal of b_usecount. We may add it back in later but for now
it is useless. Prior implementations of the buffer cache never
had enough buffers for it to be useful, and current implementations
which make more buffers available might not benefit relative to
the amount of sophistication required to implement a b_usecount.
Straight LRU should work just as well, especially when most things
are VMIO backed. I expect that (even though John will not like
this assumption) directories will become VMIO backed some point soon.
Submitted by: Matthew Dillon <dillon@backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.
For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".
Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.
Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.
It generally does what one would expect, but setting up a jail
still takes a little knowledge.
A few notes:
I have no scripts for setting up a jail, don't ask me for them.
The IP number should be an alias on one of the interfaces.
mount a /proc in each jail, it will make ps more useable.
/proc/<pid>/status tells the hostname of the prison for
jailed processes.
Quotas are only sensible if you have a mountpoint per prison.
There are no privisions for stopping resource-hogging.
Some "#ifdef INET" and similar may be missing (send patches!)
If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!
Tools, comments, patches & documentation most welcome.
Have fun...
Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/