Commit Graph

720 Commits

Author SHA1 Message Date
mckusick
f01ec099d7 Fixes to track snapshot copy-on-write checking in the specinfo
structure rather than assuming that the device vnode would reside
in the FFS filesystem (which is obviously a broken assumption with
the device filesystem).
2001-03-07 07:09:55 +00:00
jhb
44b0453b59 Grab the process lock while calling psignal and before calling psignal. 2001-03-07 03:37:06 +00:00
jhb
0efe7f19be Protect SIGDELSET of p_siglist with the proc lock. 2001-03-07 03:34:55 +00:00
mckusick
1e6e2e153e Free lock before returning from process_worklist_item.
Obtained from:	Constantine Sapuntzakis <csapuntz@stanford.edu>
2001-03-01 21:43:46 +00:00
adrian
8650fa6cdb Reviewed by: jlemon
An initial tidyup of the mount() syscall and VFS mount code.

This code replaces the earlier work done by jlemon in an attempt to
make linux_mount() work.

* the guts of the mount work has been moved into vfs_mount().

* move `type', `path' and `flags' from being userland variables into being
  kernel variables in vfs_mount(). `data' remains a pointer into
  userspace.

* Attempt to verify the `type' and `path' strings passed to vfs_mount()
  aren't too long.

* rework mount() and linux_mount() to take the userland parameters
  (besides data, as mentioned) and pass kernel variables to vfs_mount().
  (linux_mount() already did this, I've just tidied it up a little more.)

* remove the copyin*() stuff for `path'. `data' still requires copyin*()
  since its a pointer into userland.

* set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each
  filesystem.  This variable is generally initialised with `path', and
  each filesystem can override it if they want to.

* NOTE: f_mntonname is intiailised with "/" in the case of a root mount.
2001-03-01 21:00:17 +00:00
jlemon
498f8e2bd2 Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean().
Use this to tell a filter attached to a vnode that the underlying vnode is
no longer valid, by returning EV_EOF.

PR: kern/25309, kern/25206
2001-02-23 20:06:01 +00:00
jlemon
c9d1bd1c2c Use correct list pointer when detaching knote from list. 2001-02-23 19:20:21 +00:00
mckusick
bc35986d9d Free lock before calling panic so that subsequent attempt to write out
buffers does not re-panic with `locking against myself'. This change
should not affect normal operations of soft updates in any way.
2001-02-23 09:01:31 +00:00
mckusick
3c351c2d02 When cleaning up excess inode dependencies, check for being done.
Reviewed by:	Jan Koum <jkb@yahoo-inc.com>
2001-02-22 10:17:57 +00:00
mckusick
b5b51d050c This patch corrects two problems with the rate limiting code
that was introduced in revision 1.80. The problem manifested
itself with a `locking against myself' panic and could also
result in soft updates inconsistences associated with inodedeps.
The two problems are:

1) One of the background operations could manipulate the bitmap
while holding it locked with intent to create. This held lock
results in a `locking against myself' panic, when the background
processing that we have been coopted to do tries to lock the bitmap
which we are already holding locked. To understand how to fix this
problem, first, observe that we can do the background cleanups in
inodedep_lookup only when allocating inodedeps (DEPALLOC is set in
the call to inodedep_lookup). Second observe that calls to
inodedep_lookup with DEPALLOC set can only happen from the following
calls into the softdep code:

        softdep_setup_inomapdep
        softdep_setup_allocdirect
        softdep_setup_remove
        softdep_setup_freeblocks
        softdep_setup_directory_change
        softdep_setup_directory_add
        softdep_change_linkcnt

Only the first two of these can come from ffs_alloc.c while holding
a bitmap locked. Thus, inodedep_lookup must not go off to do
request_cleanups when being called from these functions. This change
adds a flag, NODELAY, that can be passed to inodedep_lookup to let
it know that it should not do background processing in those cases.

2) The return value from request_cleanup when helping out with the
cleanup was 0 instead of 1. This meant that despite the fact that
we may have slept while doing the cleanups, the code did not recheck
for the appearance of an inodedep (e.g., goto top in inodedep_lookup).
This lead to the softdep inconsistency in which we ended up with
two inodedep's for the same inode.

Reviewed by:	Peter Wemm <peter@yahoo-inc.com>,
		Matt Dillon <dillon@earth.backplane.com>
2001-02-20 11:14:38 +00:00
asmodai
d8144cf92b Preceed/preceeding are not english words. Use precede and preceding. 2001-02-18 10:43:53 +00:00
jlemon
00c3549c08 Extend kqueue down to the device layer.
Backwards compatible approach suggested by: peter
2001-02-15 16:34:11 +00:00
jake
45e3d6a42c Implement a unified run queue and adjust priority levels accordingly.
- All processes go into the same array of queues, with different
  scheduling classes using different portions of the array.  This
  allows user processes to have their priorities propogated up into
  interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
  32.  We used to have 4 separate arrays of 32 queues each, so this
  may not be optimal.  The new run queue code was written with this
  in mind; changing the number of run queues only requires changing
  constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter.  This
  is intended to be used to create per-cpu run queues.  Implement
  wrappers for compatibility with the old interface which pass in
  the global run queue structure.
- Group the priority level, user priority, native priority (before
  propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
  symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
  This was used to detect when a process' priority had lowered and
  it should yield.  We now effectively yield on every interrupt.
- Activate propogate_priority().  It should now have the desired
  effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
  idle loop.  It interfered with propogate_priority() because
  the idle process needed to do a non-blocking acquire of Giant
  and then other processes would try to propogate their priority
  onto it.  The idle process should not do anything except idle.
  vm_page_zero_idle() will return in the form of an idle priority
  kernel thread which is woken up at apprioriate times by the vm
  system.
- Update struct kinfo_proc to the new priority interface.  Deliberately
  change its size by adjusting the spare fields.  It remained the same
  size, but the layout has changed, so userland processes that use it
  would parse the data incorrectly.  The size constraint should really
  be changed to an arbitrary version number.  Also add a debug.sizeof
  sysctl node for struct kinfo_proc.
2001-02-12 00:20:08 +00:00
bmilekic
e67bcfcaf3 Change and clean the mutex lock interface.
mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)
2001-02-09 06:11:45 +00:00
phk
780a12ba92 Another round of the <sys/queue.h> FOREACH transmogriffer.
Created with:   sed(1)
Reviewed by:    md5(1)
2001-02-04 16:08:18 +00:00
phk
8363061cc7 Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)
2001-02-04 13:13:25 +00:00
phk
5888a3757b Use <sys/queue.h> macro API. 2001-02-04 12:37:48 +00:00
phk
132ab803d0 Remove a DIAGNOSTIC check which belongs in <sys/queue.h> if anyplace at all. 2001-02-04 11:53:51 +00:00
iedowse
e105a1bb77 Extend the sanity checks in ufs_lookup to ensure that each directory
entry fits within its DIRBLKSIZ block. The surrounding code is
extremely fragile with respect to corruption of the directory entry
'd_reclen' field; if directory corruption occurs, it can blindly
scan forward beyond the end of the filesystem block. Usually this
results in a 'fault on nofault entry' panic.

Directory corruption is now much more likely to be detected, resulting
in a 'ufs_dirbad' panic. If the filesystem is read-only, it will
simply print a warning message, and skip the corrupted block.

Reviewed by:	mckusick
2001-02-04 01:52:11 +00:00
iedowse
d8f5df1f3b Use the correct flags field when checking for a read-only filesystem
in ufs_dirbad(). The mnt_stat.f_flags field is only updated by the
syscalls *statfs and getfsstat, so mnt_flag should be used instead.

This only affects whether or not a panic is generated on detection of
certain types of directory corruption.

Reviewed by:	mckusick
2001-02-03 21:25:32 +00:00
dillon
fa650a9da8 Fix a race between the syncer and umount. When you umount a softupdates
filesystem softdep_process_worklist() is called in a loop until it indicates
that no dependancies remain, but the determination of that fact depends on
there only being one softdep_process_worklist() instance running.  It was
possible for the syncer to also be running softdep_process_worklist()
and the pre-existing checks in the code to prevent this were not sufficient
to prevent the race.  This patch solves the problem.

Approved-by: mckusick
2001-01-30 06:31:59 +00:00
jasone
cd97c8f6f2 Convert all simplelocks to mutexes and remove the simplelock implementations. 2001-01-24 12:35:55 +00:00
iedowse
5c5ba0de81 The ffs superblock includes a 128-byte region for use by temporary
in-core pointers to summary information. An array in this region
(fs_csp) could overflow on filesystems with a very large number of
cylinder groups (~16000 on i386 with 8k blocks). When this happens,
other fields in the superblock get corrupted, and fsck refuses to
check the filesystem.

Solve this problem by replacing the fs_csp array in 'struct fs'
with a single pointer, and add padding to keep the length of the
128-byte region fixed. Update the kernel and userland utilities
to use just this single pointer.

With this change, the kernel no longer makes use of the superblock
fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c
to indicate that these fields must be calculated for compatibility
with older kernels.

Reviewed by:	mckusick
2001-01-15 18:30:40 +00:00
mckusick
a33f21633a Properly compute the size of the final block of superblock summary information.
Submitted by:	Ian Dowse <iedowse@maths.tcd.ie>
2001-01-12 21:56:55 +00:00
rwatson
dc70342cef o Commit reems of style(9) changes, whitespace improvements, and comment
cleanups.

Obtained from:	TrustedBSD Project
2001-01-07 23:45:56 +00:00
rwatson
6a380ba8c4 o Zero the ufs_extattr_header length field (not necessary, but not a bad
idea either) in ufs_extattr_rm.
o More completely fill out the local_aio structure when writing out the
  zero'd extended attribute in ufs_extattr_rm -- previoulsy, this worked
  fine, but probably should not have.  This corrects extraneous warnings
  about inconsistent inodes following file deletion.

Reviewed by:	jedgar
2001-01-07 23:31:51 +00:00
rwatson
679972ebf8 o Add an additional EA inconsistency reporting opportunity in
ufs_extattr_rm.
o Make both reporting locations report the function name where the
  inconsistency is discovered, as well as the inode number in question.

Reviewed by:	jedgar
2001-01-07 23:27:58 +00:00
rwatson
7c3cda3702 o Make call to ufs_extattr_rm() in ufs_extattr_vnode_inactive() use
NULL as the credential, not 0, so as to make it more clear what's
  going on.

Obtained from:	TrustedBSD Project
2001-01-07 21:38:26 +00:00
rwatson
e4f163935e o Remove unnecessary sanity check involving requested offset of extended
attribute read--the offset is required to be 0 by an earlier check,
  meaning that it will always be within the scope of the attribute data.
  This change should have no impact on executed code paths other than
  removing the unnecessary check: please report if any new failures
  start to occur as a result.

Obtained from:	TrustedBSD Project
2001-01-07 21:07:22 +00:00
dillon
00135ff519 This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution.  However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box).  The new
algorithm limits the number of pages laundered in the first pageout daemon
pass.  If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace.  This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads.  It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis.  At the moment it is globalized.
2000-12-26 19:41:38 +00:00
mckusick
89297e2abf Several small but important fixes for snapshots:
1) Be more tolerant of missing snapshot files by only trying to decrement
   their reference count if they are registered as active.

2) Fix for snapshots of filesystems with block sizes larger than 8K
   (from Ollivier Robert <roberto@eurocontrol.fr>).

3) Fix to avoid losing last block in snapshot file when calculating blocks
   that need to be copied (from Don Coleman <coleman@coleman.org>).
2000-12-19 04:41:09 +00:00
mckusick
d48434fdef Get rid of spurious check in ffs_truncate for i_size == length
which fails to set the modification time on the file. The same
check a few lines later takes the correct action.

Submitted by:	Ian Dowse <iedowse@maths.tcd.ie>
2000-12-19 04:20:13 +00:00
assar
b14e706f98 add a stub for softdep_slowdown so that it's possible to build the
kernel without SOFTUPDATES
2000-12-17 23:59:56 +00:00
dillon
82ce79abb0 Avoid a data-consistency race between write() and mmap()
by ensuring that newly allocated blocks are zerod.  The
race can occur even in the case where the write covers
the entire block.

Reported by: Sven Berkvens <sven@berkvens.net>, Marc Olzheim <zlo@zlo.nu>
2000-12-17 23:57:05 +00:00
tanimura
42f7f4567d - Move ifs_init() so that it can initialize ifs_inode_hash_mtx.
- s/ffs_inode_hash_lock/ifs_inode_hash_lock/
2000-12-14 09:15:27 +00:00
tanimura
35edfb12ca Do not race for the lock of an inode hash.
Reviewed by:	jhb
2000-12-13 10:04:01 +00:00
mckusick
79714fbd04 Preventing runaway kernel soft updates memory, take three.
Previously, the syncer process was the only process in the
system that could process the soft updates background work
list. If enough other processes were adding requests to that
list, it would eventually grow without bound. Because some of
the work list requests require vnodes to be locked, it was
not generally safe to let random processes process the work
list while they already held vnodes locked. By adding a flag
to the work list queue processing function to indicate whether
the calling process could safely lock vnodes, it becomes possible
to co-opt other processes into helping out with the work list.
Now when the worklist gets too large, other processes can safely
help out by picking off those work requests that can be handled
without locking a vnode, leaving only the small number of
requests requiring a vnode lock for the syncer process. With
this change, it appears possible to keep even the nastiest
workloads under control.

Submitted by:	Paul Saab <ps@yahoo-inc.com>
2000-12-13 08:30:35 +00:00
dwmalone
418e7e45d7 Convert more malloc+bzero to malloc+M_ZERO.
Submitted by:	josh@zipperup.org
Submitted by:	Robert Drehmel <robd@gmx.net>
2000-12-08 21:51:06 +00:00
phk
3daded5af8 Staticize some malloc M_ instances. 2000-12-08 20:09:00 +00:00
dillon
4578b60a8c Add necessary bwillwrite() in writev() entry point.
Deal with excessive dirty buffers when msync() syncs non-contiguous
dirty buffers by checking for the case in UFS *before* checking for
clusterability.
2000-12-06 20:55:09 +00:00
mckusick
1ac4366788 More aggressively rate limit the growth of soft dependency structures
in the face of multiple processes doing massive numbers of filesystem
operations. While this patch will work in nearly all situations, there
are still some perverse workloads that can overwhelm the system.
Detecting and handling these perverse workloads will be the subject
of another patch.

Reviewed by:	Paul Saab <ps@yahoo-inc.com>
Obtained from:	Ethan Solomita <ethan@geocast.com>
2000-11-20 06:22:39 +00:00
dillon
54d2fd2d6a Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
    situations prior to now.

    The new code is based on the concept that I/O must be able to function in
    a low memory situation.  All major modules related to I/O (except
    networking) have been adjusted to allow allocation out of the system
    reserve memory pool.  These modules now detect a low memory situation but
    rather then block they instead continue to operate, then return resources
    to the memory pool instead of cache them or leave them wired.

    Code has been added to stall in a low-memory situation prior to a vnode
    being locked.

    Thus situations where a process blocks in a low-memory condition while
    holding a locked vnode have been reduced to near nothing.  Not only will
    I/O continue to operate, but many prior deadlock conditions simply no
    longer exist.

Implement a number of VFS/BIO fixes

	(found by Ian): in biodone(), bogus-page replacement code, the loop
        was not properly incrementing loop variables prior to a continue
        statement.  We do not believe this code can be hit anyway but we
        aren't taking any chances.  We'll turn the whole section into a
        panic (as it already is in brelse()) after the release is rolled.

	In biodone(), the foff calculation was incorrectly
        clamped to the iosize, causing the wrong foff to be calculated
        for pages in the case of an I/O error or biodone() called without
        initiating I/O.  The problem always caused a panic before.  Now it
        doesn't.  The problem is mainly an issue with NFS.

	Fixed casts for ~PAGE_MASK.  This code worked properly before only
        because the calculations use signed arithmatic.  Better to properly
        extend PAGE_MASK first before inverting it for the 64 bit masking
        op.

	In brelse(), the bogus_page fixup code was improperly throwing
        away the original contents of 'm' when it did the j-loop to
        fix the bogus pages.  The result was that it would potentially
        invalidate parts of the *WRONG* page(!), leading to corruption.

	There may still be cases where a background bitmap write is
        being duplicated, causing potential corruption.  We have identified
        a potentially serious bug related to this but the fix is still TBD.
        So instead this patch contains a KASSERT to detect the problem
  	and panic the machine rather then continue to corrupt the filesystem.
	The problem does not occur very often..  it is very hard to
	reproduce, and it may or may not be the cause of the corruption
	people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
mckusick
1b34fd0bd9 When deleting a file, the ordering of events imposed by soft updates
is to first write the deleted directory entry to disk, second write
the zero'ed inode to disk, and finally to release the freed blocks
and the inode back to the cylinder-group map. As this ordering
requires two disk writes to occur which are normally spaced about
30 seconds apart (except when memory is under duress), it takes
about a minute from the time that a file is deleted until its inode
and data blocks show up in the cylinder-group map for reallocation.
If a file has had only a brief lifetime (less than 30 seconds from
creation to deletion), neither its inode nor its directory entry
may have been written to disk. If its directory entry has not been
written to disk, then we need not wait for that directory block to
be written as the on-disk directory block does not reference the
inode. Similarly, if the allocated inode has never been written to
disk, we do not have to wait for it to be written back either as
its on-disk representation is still zero'ed out. Thus, in the case
of a short lived file, we can simply release the blocks and inode
to the cylinder-group map immediately. As the inode and its blocks
are released immediately, they are immediately available for other
uses. If they are not released for a minute, then other inodes and
blocks must be allocated for short lived files, cluttering up the
vnode and buffer caches. The previous code was a bit too aggressive
in trying to release the blocks and inode back to the cylinder-group
map resulting in their being made available when in fact the inode
on disk had not yet been zero'ed. This patch takes a more conservative
approach to doing the release which avoids doing the release prematurely.
2000-11-14 09:00:25 +00:00
bde
6274dae459 Fixed breakage of mknod() in rev.1.48 of ext2_vnops.c and rev.1.126 of
ufs_vnops.c:

1) i_ino was confused with i_number, so the inode number passed to
   VFS_VGET() was usually wrong (usually 0U).
2) ip was dereferenced after vgone() freed it, so the inode number
   passed to VFS_VGET() was sometimes not even wrong.

Bug (1) was usually fatal in ext2_mknod(), since ext2fs doesn't have
space for inode 0 on the disk; ino_to_fsba() subtracts 1 from the
inode number, so inode number 0U gives a way out of bounds array
index.  Bug(1) was usually harmless in ufs_mknod(); ino_to_fsba()
doesn't subtract 1, and VFS_VGET() reads suitable garbage (all 0's?)
from the disk for the invalid inode number 0U; ufs_mknod() returns
a wrong vnode, but most callers just vput() it; the correct vnode is
eventually obtained by an implicit VFS_VGET() just like it used to be.

Bug (2) usually doesn't happen.
2000-11-04 08:10:56 +00:00
eivind
cdf6d89a60 Give vop_mmap an untimely death. The opportunity to give it a timely
death timed out in 1996.
2000-11-01 17:57:24 +00:00
phk
ea99aa3c7e Add a missing <sys/systm.h> 2000-10-30 20:37:19 +00:00
phk
00ff95c136 Move suser() and suser_xxx() prototypes and a related #define from
<sys/proc.h> to <sys/systm.h>.

Correctly document the #includes needed in the manpage.

Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.
2000-10-29 16:06:56 +00:00
phk
10f128d440 Weaken a bogus dependency on <sys/proc.h> in <sys/buf.h> by #ifdef'ing
the offending inline function (BUF_KERNPROC) on it being #included
already.

I'm not sure BUF_KERNPROC() is even the right thing to do or in the
right place or implemented the right way (inline vs normal function).

Remove consequently unneeded #includes of <sys/proc.h>
2000-10-29 14:54:55 +00:00
phk
05fcfd3993 Remove unneeded #include <sys/proc.h> lines. 2000-10-29 13:57:19 +00:00
rwatson
be06217d29 o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform
"administrative" authorization checks.  In most cases, the VADMIN test
  checks to make sure the credential effective uid is the same as the file
  owner.
o Modify vaccess() to set VADMIN as an available right if the uid is
  appropriate.
o Modify references to uid-based access control operations such that they
  now always invoke VOP_ACCESS() instead of using hard-coded policy checks.
o This allows alternative UFS policies to be implemented by replacing only
  ufs_access() (such as mandatory system policies).
o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the
  vnode: I believe that new invocations of VOP_ACCESS() are always called
  with the lock held.
o Some direct checks of the uid remain, largely associated with the QUOTA
  and SUIDDIR code.

Reviewed by:	eivind
Obtained from:	TrustedBSD Project
2000-10-19 07:53:59 +00:00