one go before returning. This avoids calling uiomove() while holding
allproc_lock.
Don't adjust uio->uio_offset manually, uiomove() does that for us.
Don't drop allproc_lock before calling panic().
Suggested by: alfred
our cached 'next vnode' being removed from this mountpoint. If we
find that it was recycled, we restart our traversal from the start
of the list.
Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:
MNT_ILOCK(mp);
loop:
for (vp = TAILQ_FIRST(&mp...);
(vp = nvp) != NULL;
nvp = TAILQ_NEXT(vp,...)) {
if (vp->v_mount != mp)
goto loop;
MNT_IUNLOCK(mp);
...
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);
The code which takes vnodes off a mountpoint looks like this:
MNT_ILOCK(vp->v_mount);
...
TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
...
MNT_IUNLOCK(vp->v_mount);
...
vp->v_mount = something;
(Take a moment and try to spot the locking error before you read on.)
On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.
Fix:
Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go. This saves approx 65 lines of
duplicated code.
Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.
Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.
FAT32 filesystems to be mounted, subject to some fairly serious limitations.
This works by extending the internal pseudo-inode-numbers generated from
the file's starting cluster number to 64-bits, then creating a table
mapping these into arbitrary 32-bit inode numbers, which can fit in
struct dirent's d_fileno and struct vattr's va_fileid fields. The mappings
do not persist across unmounts or reboots, so it's not possible to export
these filesystems through NFS. The mapping table may grow to be rather
large, and may grow large enough to exhaust kernel memory on filesystems
with millions of files.
Don't enable this option unless you understand the consequences.
waiting for the socket to connect and use msleep() on the socket
mute rather than tsleep(). Acquire socket buffer mutexes around
read-modify-write of socket buffer flags.
depending on namespace pollution in <sys/vnode.h> for the definition
of mutex interfaces used in SOCKBUF_*LOCK().
Sorted includes.
Removed unused includes.
rwatson_netperf:
Introduce conditional locking of the socket buffer in fifofs kqueue
filters; KNOTE() will be called holding the socket buffer locks in
fifofs, but sometimes the kqueue() system call will poll using the
same entry point without holding the socket buffer lock.
Introduce conditional locking of the socket buffer in the socket
kqueue filters; KNOTE() will be called holding the socket buffer
locks in the socket code, but sometimes the kqueue() system call
will poll using the same entry points without holding the socket
buffer lock.
Simplify the logic in sodisconnect() since we no longer need spls.
NOTE: To remove conditional locking in the kqueue filters, it would
make sense to use a separate kqueue API entry into the socket/fifo
code when calling from the kqueue() system call.
- Lock down low hanging fruit use of sb_flags with socket buffer
lock.
- Lock down low hanging fruit use of so_state with socket lock.
- Lock down low hanging fruit use of so_options.
- Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with
socket buffer lock.
- Annotate situations in which we unlock the socket lock and then
grab the receive socket buffer lock, which are currently actually
the same lock. Depending on how we want to play our cards, we
may want to coallesce these lock uses to reduce overhead.
- Convert a if()->panic() into a KASSERT relating to so_state in
soaccept().
- Remove a number of splnet()/splx() references.
More complex merging of socket and socket buffer locking to
follow.
The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()
Various minor adjustments including handling of userland access to kernel
space struct cdev etc.
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:
SS_CANTRCVMORE (so_state)
SS_CANTSENDMORE (so_state)
SS_RCVATMARK (so_state)
Rename respectively to:
SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)
This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
them to behave the same as if the SS_NBIO socket flag had been set
for this call. The SS_NBIO flag for ordinary sockets is set by
fcntl(fd, F_SETFL, O_NONBLOCK).
Pass the MSG_NBIO flag to the soreceive() and sosend() calls in
fifo_read() and fifo_write() instead of frobbing the SS_NBIO flag
on the underlying socket for each I/O operation. The O_NONBLOCK
flag is a property of the descriptor, and unlike ordinary sockets,
fifos may be referenced by multiple descriptors.
to avoid lock order problems when manipulating the sockets associated
with the fifo.
Minor optimization of a couple of calls to fifo_cleanup() from
fifo_open().
allocation and deallocation. This flag's principal use is shortly after
allocation. For such cases, clearing the flag is pointless. The only
unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never
requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed
page.
Reviewed by: tegge@
1. This check if wrong, because it is true by default
(kern.ps_argsopen is 1 by default) (p_cansee() is not even checked).
2. Sysctl kern.ps_argsopen is going away.
and consume that interface in portalfs and fifofs instead. In the
new world order, unp_connect2() assumes that the unpcb mutex is
held, whereas uipc_connect2() validates that the passed sockets are
UNIX domain sockets, then grabs the mutex.
NB: the portalfs and fifofs code gets down and dirty with UNIX domain
sockets. Maybe this is a bad thing.
stuff was here (NFS) was fixed by Alfred in November. The only remaining
consumer of the stub functions was umapfs, which is horribly horribly
broken. It has missed out on about the last 5 years worth of maintenence
that was done on nullfs (from which umapfs is derived). It needs major
work to bring it up to date with the vnode locking protocol. umapfs really
needs to find a caretaker to bring it into the 21st century.
Functions GC'ed:
vop_noislocked, vop_nolock, vop_nounlock, vop_sharedlock.
255; USB keychains exist that use 256 as the number of heads. This
check has also been removed in Darwin (along with most of the other
head/sector sanity checks).
were a rather overwhelming task. I soon learned that if you don't know
where you're going to store something, at least try to pile it next to
something slightly related in the hope that a pattern emerges.
Apply the same principle to the ffs/snapshot/softupdates code which have
leaked into specfs: Add yet a buf-quasi-method and call it from the
only two places I can see it can make a difference and implement the
magic in ffs_softdep.c where it belongs.
It's not pretty, but at least it's one less layer violated.
functions in kern_socket.c.
Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT
in from the caller context rather than "1" or "0".
Correct mflags pass into mac_init_socket() from previous commit to not
include M_ZERO.
Submitted by: sam
by 1 u_int if the number of clusters was 1 more than a multiple of
(8 * sizeof(u_int)). The bitmap is malloced and large (often huge), so
fatal overrun probably only occurred if the number of clusters was 1
more than 1 multiple of PAGE_SIZE/8.
This is what we came here for: Hang dev_t's from their cdevsw,
refcount cdevsw and dev_t and generally keep track of things a lot
better than we used to:
Hold a cdevsw reference around all entrances into the device driver,
this will be necessary to safely determine when we can unload driver
code.
Hold a dev_t reference while the device is open.
KASSERT that we do not enter the driver on a non-referenced dev_t.
Remove old D_NAG code, anonymous dev_t's are not a problem now.
When destroy_dev() is called on a referenced dev_t, move it to
dead_cdevsw's list. When the refcount drops, free it.
Check that cdevsw->d_version is correct. If not, set all methods
to the dead_*() methods to prevent entrance into driver. Print
warning on console to this effect. The device driver may still
explode if it is also incompatible with newbus, but in that case
we probably didn't get this far in the first place.
Remove the unused second argument from udev2dev().
Convert all remaining users of makedev() to use udev2dev(). The
semantic difference is that udev2dev() will only locate a pre-existing
dev_t, it will not line makedev() create a new one.
Apart from the tiny well controlled windown in D_PSEUDO drivers,
there should no longer be any "anonymous" dev_t's in the system
now, only dev_t's created with make_dev() and make_dev_alias()
Introduce d_version field in struct cdevsw, this must always be
initialized to D_VERSION.
Flip sense of D_NOGIANT flag to D_NEEDGIANT, this involves removing
four D_NOGIANT flags and adding 145 D_NEEDGIANT flags.
to size_t *, which is incorrect because they may have different widths.
This caused some subtle forms of corruption, the mostly frequently
reported one being that the last character of a filename was sometimes
duplicated on amd64.
it means that the correct value is unknown. Since this value is just
a hint to improve performance, initially assume that the first non-reserved
cluster is free, then correct this assumption if necessary before writing
the FSInfo block back to disk.
PR: 62826
MFC after: 2 weeks
- don't unlock the vnode after vinvalbuf() only to have to relock it
almost immediately.
- don't refer to devices classified by vn_isdisk() as block devices.
created with the same name, and vice versa:
- Immediately recycle vnodes of files & directories that have been deleted
or renamed.
- When looking an entry in the VFS name cache or smbfs's private
cache, make sure the vnode type is consistent with the type of file
the server thinks it is, and re-create the vnode if it isn't.
The alternative to this is to recycle vnodes unconditionally when their
use count drops to 0, but this would make all the caching we do
mostly useless.
PR: 62342
MFC after: 2 weeks
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.
Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64
file has been removed, it should be purged from the cache, but it need
not be removed from the directory stack causing corruption; instead,
it will simply be removed once the last references and holds on it
are dropped at the end of the unlink/rmdir system calls, and the
normal !UN_CACHED VOP_INACTIVE() handler for unionfs finishes it off.
This is easily reproduced by repeated "echo >file; rm file" on a
unionfs mount. Strangely, "echo -n >file; rm file" didn't make
it happen.
(msdosfs uses normal 8-char indentation almost everywhere else),
too-long lines, and minor English usage errors. The verbose formal
comment before the new function is still abnormal.
(mainly unsorting). There were no changes related to the dirty flag
here. The reference NetBSD implementation put msdosfs_advlock() in a
different place. This commit only moves its declarations and changes
some of the function body to be like the NetBSD version.
disposing fifo resources in fifo_cleanup() instead using of
"vp->v_usecount == 1". There may be other references to the vnode, for
instance by nullfs, at the time fifo_open() or fifo_close() is called,
which could cause a resource leak.
Don't bother grabbing the vnode interlock in fifo_cleanup() since it no
longer accesses v_usecount.
vnode of the parent. However, this check should not be performed if
the lookup failed. This change should fix "union_lookup returning
. not same as startdir" panics people were seeing. The bug was
introduced by an incomplete import of a NetBSD delta in rev 1.38.
- Move the aforementioned check out from DIAGNOSTIC. Performance
is the least of our unionfs worries.
- Minor reorganization.
PR: 53004
MFC after: 1 week
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.
This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.
While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.
NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.
Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
a resource leak. Move the resource deallocation code from fifo_close()
to a new function, fifo_cleanup(), and call fifo_cleanup() from
fifo_close() and the appropriate places in fifo_open().
Tested by: Lukas Ertl
Pointy hat to: truckman
thread being waken up. The thread waken up can run at a priority as
high as after tsleep().
- Replace selwakeup()s with selwakeuppri()s and pass appropriate
priorities.
- Add cv_broadcastpri() which raises the priority of the broadcast
threads. Used by selwakeuppri() if collision occurs.
Not objected in: -arch, -current
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.
Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.
Discussed with: jeff
wasn't curthread, i.e. when we receive a thread pointer to use
as a function argument. Use VOP_UNLOCK/vrele in these cases.
The only case there td != curthread known at the moment is
boot() calling sync with thread0 pointer.
This fixes the panic on shutdown people have reported.
pick up the DEVFS inode number from the dev_t and find our directory
entry from that, we don't need to scan the directory to find it.
This also solves an issue with on-demand devices in subdirectories.
Submitted by: cognet
passes the fdidx from VOP_OPEN down.
This is for all I know the final API for this functionality, but
the locking semantics for messing with the filedescriptor from
the device driver are not settled at this time.
stack trace supplied by phk, I now understand what's going on here. The
check for VI_XLOCK stops us from calling vinvalbuf once the vnode has been
partially torn down in vclean(). It is not clear that this would cause
a problem. Document this in nfs_bio.c, which is where the other two
filesystems copied this code from.
validating the offset within a given memory buffer before handing the
real work off to uiomove(9).
Use uiomove_frombuf in procfs to correct several issues with
integer arithmetic that could result in underflows/overflows. As a
side-effect, the code is significantly simplified.
Add additional sanity checks when computing a memory allocation size
in pfs_read.
Submitted by: rwatson (original uiomove_frombuf -- bugs are mine :-)
Reported by: Joost Pol <joost@pine.nl> (integer underflows/overflows)
file for vnode mappings. Note that this uses vn_fullpath() and may
be somewhat unreliable, although not too unreliable for shared
libraries. For non-vnode mappings, just print "-" for the field.
Obtained from: TrustedBSD Projects
Sponsored by: DARPA, AFRL, Network Associates Laboratories
struct msdosfsmount so that this file has the same prerequisites as
it used to. The new prerequistite was a meta-style bug. It required
many style bugs (unsorted includes ...) elsewhere.
Formatted prototypes in KNF. Resisted urge to sort all the prototypes,
to minimise differences with NetBSD. (NetBSD has reformatted the
prototypes but has not sorted them and still uses __P(()).)
are allowed by Windows (ref: MS KB article 120138).
XXX From my reading of the CIFS specification, it's not clear that
clients need to validate filenames at all.
PR: 57123
Submitted by: Paul Coucher
MFC after: 1 month
sufficient to guarantee that this race is not hit. The XLOCK will likely
have to be redesigned due to the way reference counting and mutexes work
in FreeBSD. We currently can not be guaranteed that xlock was not set
and cleared while we were blocked on the interlock while waiting to check
for XLOCK. This would lead us to reference a vnode which was not the
vnode we requested.
- Add a backtrace() call inside of INVARIANTS in the hopes of finding out if
this condition is ever hit. It should not, since we should be retaining
a reference to the vnode in these cases. The reference would be sufficient
to block recycling.
FIDs to be 128-bits wide and adds support for realms.
Add a new CODA_COMPAT_5 option, which requests support for the old
Coda 5.x interface instead of the new one.
Create a new coda5.ko module that supports the 5.x interface, and make
the existing coda.ko module use the new 6.x interface. These modules
cannot both be loaded at the same time.
Obtained from: Jan Harkes & the coda-6.0.2 distribution,
NetBSD (drochner) (CODA_COMPAT_5 option).
32K pages are selected. In spec_getpages() change the printf format
specifier and add an explicit cast so that we always print the field
as a long type.
also fixes pfs_access() since it relies on VOP_GETATTR() which will call
pfs_getattr(). This prevents jailed processes from discovering the
existence, start time and ownership of processes outside the jail.
PR: kern/48156
directories. Previously, pfs_iterate() would return -1 when it
reached the end of the process list while processing a process
directory node, even if the parent directory contained further nodes
(which is the case for the linprocfs root directory, where the process
directory node is actually first in the list). With this patch,
pfs_iterate() will continue to traverse the parent directory's node
list after exhausting the process list (as was the intention all
along). The code should hopefully be easier to read as well.
While I'm here, have pfs_iterate() assert that the allproc lock is
held.
masks for files and directories. This should make some
of the Midnight Commander users happy.
Remove an extra ')' in the manual page.
PR: 35699
Submitted by: Eugene Grosbein <eugen@grosbein.pp.ru> (original version)
Tested by: simon
contain the filedescriptor number on opens from userland.
The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*
For now pass -1 all over the place.
in ntfs_writentvattr_plain and ntfs_readntvattr_plain, and purge the boot
block from the buffer cache if isn't exactly one cluster long. These two
changes work around the same buffer cache bug that ntfs_subr.c 1.30 tried
to, but in a different way. This may decrease throughput by reading smaller
amounts of data from the disk at a time, but may increase it by avoiding
bogus writes of clean buffers.
Problem (re)reported by Karel J. Bosschaart on -current.
the user requests a read-only mount. This is necessary because we
don't do the VOP_OPEN again if they upgrade a read-only mount to
read-write.
Fixes lockup when creating files on msdosfs mounts that have been
mounted read-only then upgraded to read-write. The exact cause of
the lockup is not known, but it is likely to be the kernel getting
stuck in an infinite loop trying to write dirty buffers to a device
without write permission.
Reported/tested by andreas, discussed with phk.
an MSDOSFS file system either failed, silently corrupted the file, or
sometimes corrupted the neighboring file.
PR: 53695
Submitted by: Ariff Abdullah <skywizard@MyBSD.org.my> (original version)
MFC: 3 days
Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.
By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.
At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.
- Avoid calling bread() with different sizes on the same blkno.
Although the buffer cache is designed to handle differing size
buffers, it erroneously tries to write the incorrectly-sized buffer
buffer back to disk before reading the correctly-sized one, even
when it's not dirty. This behaviour caused a panic for read-only
NTFS mounts when INVARIANTS was enabled ("bundirty: buffer x still
on queue y"), reported by NAKAJI Hiroyuki.
- Fix a bug in the code handling holes: a variable was incremented
instead of decremented, which could cause an infinite loop.
smbfs_close(). This fixes paging to and from mmap()'d regions of smbfs
files after the descriptor has been closed, and makes thttpd, GNU ld,
and perhaps more things work that depend on being able to do this.
PR: 48291
- Emulate lock draining (LK_DRAIN) in null_lock() to avoid deadlocks
when the vnode is being recycled.
- Don't allow null_nodeget() to return a nullfs vnode from the wrong
mount when multiple nullfs's are mounted. It's unclear why these checks
were removed in null_subr.c 1.35, but they are definitely necessary.
Without the checks, trying to unmount a nullfs mount will erroneously
return EBUSY, and forcibly unmounting with -f will cause a panic.
- Bump LOG2_SIZEVNODE up to 8, since vnodes are >256 bytes now. The old
value (7) didn't cause any problems, but made the hash algorithm
suboptimal.
These changes fix nullfs enough that a parallel buildworld succeeds.
Submitted by: tegge (partially; LK_DRAIN)
Tested by: kris
resource deallocation back to fifo_close(). This eliminates any
stale data that might be stuck in the socket buffers after all the
readers and writers have closed the fifo.
Tested by: Thorsten Schroeder <ths@katjusha.de>
directory vnodes use to refer to their constituent vnodes, into
union_dircache_free(). Also s/union_dircache/union_dircache_get/ and
tweak the structure of union_dircache_r().
MFC after: 3 days
in smb_fphelp(): the parent vnode may have already been recycled
since we don't hold a reference to it. Fixes a panic when rebooting
with mdconfig -t vnode devices referring to vnodes on a smbfs mount.
been tested extensively, but -CURRENT testing has been hampered by a
number of panics that also occur without the patch. Since the
destabilizing changes between 4.X and 5.X are external to unionfs,
I believe this patch applies equally well to both.
Thanks to scrappy for assistance testing these and other changes.
MFC after: 4 days
Restructure the error handling portion of the resource allocation
code to eliminate duplicated code.
Test for the O_NONBLOCK && fi_readers == 0 case before incrementing
fi_writers and modifying the the socket flag to avoid having to
undo these operations in this error case.
Restructure and simplify the code that handles blocking opens.
There should be no change to functionality.
Sleep on the vnode interlock while waiting for another
caller to increment fi_readers or fi_writers. Hold the
vnode interlock while incrementing fi_readers or fi_writers
to prevent a wakeup from being missed.
Only access fi_readers and fi_writers while holding the vnode
lock. Previously fifo_close() decremented their values without
holding a lock.
Move resource deallocation from fifo_close() to fifo_inactive(),
which allows the VOP_CLOSE() call in the error return path in
fifo_open() to be removed. Fifo_open() was calling VOP_CLOSE()
with the vnode lock held, in violation the current vnode locking
API. Also the way fifo_close() used vrefcnt() to decide whether
to deallocate resources was bogus according to comments in the
vrefcnt() implementation.
Reviewed by: bde
entering sys_process.c debugging primitives, or we violate assertions.
Also, be more careful about releasing the process lock around calls
to uiomove() which may sleep waiting for paging machinations or
related notions. We may want to defer the uiomove() in at least
one case, but jhb will look into that at a later date.
Reported by: Philippe Charnier <charnier@xp11.frmug.org>
Reviewed by: jhb
uptime. Where necessary, convert it back to Unix time by adding boottime
to it. This fixes a potential problem in the accounting code, which would
compute the elapsed time incorrectly if the Unix time was stepped during
the lifetime of the process.
race where a thread could assume that a process was swapped in by
PHOLD() when it actually wasn't fully swapped in yet.
- In faultin(), always msleep() if PS_SWAPPINGIN is set instead of doing
this check after bumping p_lock in the PS_INMEM == 0 case. Also,
sched_lock is only needed for setting and clearning swapping PS_*
flags and the swap thread inhibitor.
- Don't set and clear the thread swap inhibitor in the same loops as the
pmap_swapin/out_thread() since we have to do it under sched_lock.
Instead, mimic the treatment of the PS_INMEM flag and use separate loops
to set the inhibitors when clearing PS_INMEM and clear the inhibitors
when setting PS_INMEM.
- swapout() now returns with the proc lock held as it holds the lock
while adjusting the swapping-related PS_* flags so that the proc lock
can be used to test those flags.
- Only use the proc lock to check the swapping-related PS_* flags in
several places.
- faultin() no longer requires sched_lock to be held by callers.
- Rename PS_SWAPPING to PS_SWAPPINGOUT to be less ambiguous now that we
have PS_SWAPPINGIN.
printed out needs a prefix such as when a thread is blocked on a lock.
- Use another local variable to close another race for the td_wmesg and
td_wchan members of struct thread.
on my system where I preload msdosfs and have it in my kernel.
There's likely another bug that's causing msdosfs_init() to be called
multiple times, but this makes that harmless.
a follow on commit to kern_sig.c
- signotify() now operates on a thread since unmasked pending signals are
stored in the thread.
- PS_NEEDSIGCHK moves to TDF_NEEDSIGCHK.
flexible process_fork, process_exec, and process_exit eventhandlers. This
reduces code duplication and also means that I don't have to go duplicate
the eventhandler locking three more times for each of at_fork, at_exec, and
at_exit.
Reviewed by: phk, jake, almost complete silence on arch@
fifo_open() waiting for another reader or writer if one arrived and
departed while we were waiting (or a little earlier).
Rev.1.79 broke blocking opens of fifos by making them time out after 1
second. This was bad for at least apsfilter.
Tested by: "Simon 'corecode' Schubert" <corecode@corecode.ath.cx>,
Alexander Leidinger <Alexander@leidinger.net>,
phk
MFC after: 4 weeks
to avoid a "locking against myself" panic when udf_hashins() tries
to lock it again. Lock the vnode in udf_hashins() before adding it to
the hash bucket.
- Create a new function bdone() which sets B_DONE and calls wakup(bp). This
is suitable for use as b_iodone for buf consumers who are not going
through the buf cache.
- Create a new function bwait() which waits for the buf to be done at a set
priority and with a specific wmesg.
- Replace several cases where the above functionality was implemented
without locking with the new functions.
closely what function is really doing. Update all existing consumers
to use the new name.
Introduce a new vfs_stdsync function, which iterates over mount
point's vnodes and call FSYNC on each one of them in turn.
Make nwfs and smbfs use this new function instead of rolling their
own identical sync implementations.
Reviewed by: jeff
occurs when mounting the filesystem. The problem is that venus issues
the mount() syscall, which calls vfs_mount(), which calls coda_root()
which attempts to communicate with venus.
- Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT
flag to the initial BUF_LOCK(). This will eventually be used in cases
were we want to use a buffer only if it is not currently in use.
- Convert all consumers of the getblk() api to use this extra parameter.
Reviwed by: arch
Not objected to by: mckusick
Remove extraneous uses of vop_null, instead defering to the default op.
Rename vnode type "vfs" to the more descriptive "syncer".
Fix formatting for various filesystems that use vop_print.
branches:
Initialize struct cdevsw using C99 sparse initializtion and remove
all initializations to default values.
This patch is automatically generated and has been tested by compiling
LINT with all the fields in struct cdevsw in reverse order on alpha,
sparc64 and i386.
Approved by: re(scottl)
One of the vnodes is on different mount and is possibly on a different
kind of filesystem; treating it as an smbfs vnode then writing to it
will probably corrupt it.
PR: 48381
MFC after: 1 month
that is protected by the vnode lock.
- Move B_SCANNED into b_vflags and call it BV_SCANNED.
- Create a vop_stdfsync() modeled after spec's sync.
- Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some
fs specific processing. This gives all of these filesystems proper
behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag.
- Annotate the locking in buf.h
prevent the compiler from optimizing assignments into byte-copy
operations which might make access to the individual fields non-atomic.
Use the individual fields throughout, and don't bother locking them with
Giant: it is no longer needed.
Inspired by: tjr
kind of pseudofs-based filesystem. Fixes (at least) one problem where
when procfs is mounted mupltiple times, trying to unmount one will often
cause the wrong one to get unmounted, and other problem where mounting
one procfs on top of another caused the kernel to lock up.
Reviewed by: des
was used to control code which were conditional on DEVFS' precense
since this avoided the need for large-scale source pollution with
#include "opt_geom.h"
Now that we approach making DEVFS standard, replace these tests
with an #ifdef to facilitate mechanical removal once DEVFS becomes
non-optional.
No functional change by this commit.
NULL. union_whiteout() expects the componentname argument to be non-NULL.
Fixes a NULL dereference panic when an existing union mount becomes the
upper layer of a new union mount.
access its controlling terminal.
In essense, history dictates that any process is allowed to open
/dev/tty for RW, irrespective of credential, because by definition
it is it's own controlling terminal.
Before DEVFS we relied on a hacky half-device thing (kern/tty_tty.c)
which did the magic deep down at device level, which at best was
disgusting from an architectural point of view.
My first shot at this was to use the cloning mechanism to simply
give people the right tty when they ask for /dev/tty, that's why
you get this, slightly counter intuitive result:
syv# ls -l /dev/tty `tty`
crw--w---- 1 u1 tty 5, 0 Jan 13 22:14 /dev/tty
crw--w---- 1 u1 tty 5, 0 Jan 13 22:14 /dev/ttyp0
Trouble is, when user u1 su(1)'s to user u2, he cannot open
/dev/ttyp0 anymore because he doesn't have permission to do so.
The above fix allows him to do that.
The interesting side effect is that one was previously only able
to access the controlling tty by indirection:
date > /dev/tty
but not by name:
date > `tty`
This is now possible, and that feels a lot more like DTRT.
PR: 46635
MFC candidate: could be.
pointer types, and remove a huge number of casts from code using it.
Change struct xfile xf_data to xun_data (ABI is still compatible).
If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.
Previously all filesystems which relied on specfs to do devices
would have private overrides for vop_std*, so the vop_no* overrides
here had no effect. I overlooked the transitive nature of the vop
vectors when I removed the vop_std* in those filesystems.
Removing the override here restores device node locking to it's
previous modus operandi.
Spotted by: bde
to sort out disk-io from file-io in the vm/buffer/filesystem space.
The intent is to sort VOP_STRATEGY calls into those which operate
on "real" vnodes and those which operate on VCHR vnodes. For
the latter kind, the call will be changed to VOP_SPECSTRATEGY,
possibly conditionally for those places where dual-use happens.
Add a default VOP_SPECSTRATEGY method which will call the normal
VOP_STRATEGY. First time it is called it will print debugging
information. This will only happen if a normal vnode is passed
to VOP_SPECSTRATEGY by mistake.
Add a real VOP_SPECSTRATEGY in specfs, which does what VOP_STRATEGY
does on a VCHR vnode today.
Add a new VOP_STRATEGY method in specfs to catch instances where
the conversion to VOP_SPECSTRATEGY has not yet happened. Handle
the request just like we always did, but first time called print
debugging information.
Apart up to two instances of console messages per boot, this amounts
to a glorified no-op commit.
If you get any of the messages on your console I would very much
like a copy of them mailed to phk@freebsd.org
kern/vfs_defaults.c it is wrong for the individual filesystems to use
the std* functions as that prevents override of the default.
Found by: src/tools/tools/vop_table
here. It manifests itself by sendmail hanging in "fifoow" during
boot on a diskless machine with sendmail disabled.
Giving the sleep a 1sec timout breaks the deadlock, but does not solve
the underlying problem.
XXX comment applied.
busy and we are making progress towards making them not busy. This is
needed because smbfs vnodes reference their parent directory but may
appear after their parent in the mount's vnode list; one pass over the
list is not sufficient in this case.
This stops attempts to unmount idle smbfs mounts failing with EBUSY.
not to the parent's smbnode, which may be freed during the lifetime
of the child if the mount is forcibly unmounted. umount -f should now
work properly (ie. not panic) on smbfs mounts.
unused. Replace it with a dm_mount back-pointer to the struct mount
that the devfs_mount is associated with. Export that pointer to MAC
Framework entry points, where all current policies don't use the
pointer. This permits the SEBSD port of SELinux's FLASK/TE to compile
out-of-the-box on 5.0-CURRENT with full file system labeling support.
Approved by: re (murray)
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
checking for "path == NULL" (like ffs) rather than MNT_ROOT. Otherwise
when you try and do an update or mountd does an NFS export, the remount
fails because the code tries to mount a fresh rootfs and gets an EBUSY.
The same bug is in 4.x (which is where I found it).
Sanity check by: mux
has a valid b_iocmd. Valid is any one of BIO_{READ,WRITE,DELETE}.
I have seen at least one case where the bio_cmd field was zero once the
request made it into GEOM. Putting the KASSERT here allows us to spot
the culprit in the backtrace.
terminating zero (it was treated as length missmatch). The mtools create
such slots if the name len is the product of 13 (max number of unicode
chars fitting in directory slot).
MFC after: 1 week
"refreshing" the label on the vnode before use, just get the label
right from inception. For single-label file systems, set the label
in the generic VFS getnewvnode() code; for multi-label file systems,
leave the labeling up to the file system. With UFS1/2, this means
reading the extended attribute during vfs_vget() as the inode is
pulled off disk, rather than hitting the extended attributes
frequently during operations later, improving performance. This
also corrects sematics for shared vnode locks, which were not
previously present in the system. This chances the cache
coherrency properties WRT out-of-band access to label data, but in
an acceptable form. With UFS1, there is a small race condition
during automatic extended attribute start -- this is not present
with UFS2, and occurs because EAs aren't available at vnode
inception. We'll introduce a work around for this shortly.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned. This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.
Sponsored by: DARPA & NAI Labs.
that works in the new threaded kernel. It was commented out of
the disksort routine earlier this year for the reasons given in
kern/subr_disklabel.c (which is where this code used to reside
before it moved to kern/subr_disk.c):
----------------------------
revision 1.65
date: 2002/04/22 06:53:20; author: phk; state: Exp; lines: +5 -0
Comment out Kirks io-request priority hack until we can do this in a
civilized way which doesn't cause grief.
The problem is that it is not generally safe to cast a "struct bio
*" to a "struct buf *". Things like ccd, vinum, ata-raid and GEOM
constructs bio's which are not entrails of a struct buf.
Also, curthread may or may not have anything to do with the I/O request
at hand.
The correct solution can either be to tag struct bio's with a
priority derived from the requesting threads nice and have disksort
act on this field, this wouldn't address the "silly-seek syndrome"
where two equal processes bang the diskheads from one edge to the
other of the disk repeatedly.
Alternatively, and probably better: a sleep should be introduced
either at the time the I/O is requested or at the time it is completed
where we can be sure to sleep in the right thread.
The sleep also needs to be in constant timeunits, 1/hz can be practicaly
any sub-second size, at high HZ the current code practically doesn't
do anything.
----------------------------
As suggested in this comment, it is no longer located in the disk sort
routine, but rather now resides in spec_strategy where the disk operations
are being queued by the thread that is associated with the process that
is really requesting the I/O. At that point, the disk queues are not
visible, so the I/O for positively niced processes is always slowed
down whether or not there is other activity on the disk.
On the issue of scaling HZ, I believe that the current scheme is
better than using a fixed quantum of time. As machines and I/O
subsystems get faster, the resolution on the clock also rises.
So, ten years from now we will be slowing things down for shorter
periods of time, but the proportional effect on the system will
be about the same as it is today. So, I view this as a feature
rather than a drawback. Hence this patch sticks with using HZ.
Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@critter.freebsd.dk>
that use it. Specifically, vop_stdlock uses the lock pointed to by
vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to
reference vp->v_lock. Filesystems that wish to use the default
do not need to allocate a lock at the front of their node structure
(as some still did) or do a lockinit. They can simply start using
vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks,
but still use the vop_stdlock functions (such as nullfs) can simply
replace vp->v_vnlock with a pointer to the lock that they wish to
have used for the vnode. Such filesystems are responsible for
setting the vp->v_vnlock back to the default in their vop_reclaim
routine (e.g., vp->v_vnlock = &vp->v_lock).
In theory, this set of changes cleans up the existing filesystem
lock interface and should have no function change to the existing
locking scheme.
Sponsored by: DARPA & NAI Labs.
devfs VOP symlink creation by introducing a new entry point to determine
the label of the devfs_dirent prior to allocation of a vnode for the
symlink.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories