Commit Graph

1036 Commits

Author SHA1 Message Date
Robert Watson
aad65d6f79 o Correct an ACL implementation bug that could result in a system panic
under heavy use when default ACLs were bgin inherited by new files
  or directories.  This is done by removing a bug in default ACL
  reading, and improving error handling for this failure case:

    - Move the setting of the buffer length (len) variable to above the
      ACL type (ap->a_type) switch rather than having it only for
      ACL_TYPE_ACCESS.  Otherwise, the len variable is unitialized in
      the ACL_TYPE_DEFAULT case, which generally worked right, but could
      result in failure.

    - Add a check for a short/long read of the ACL_TYPE_DEFAULT type from
      the underlying EA, resulting in EPERM rather than passing a
      potentially corrupted ACL back to the caller (resulting "cleaner"
      failures if the EA is damaged: right now, the caller will almost
      always panic in the presence of a corrupted EA).  This code is similar
      to code in the ACL_TYPE_ACCESS handling in the previous switch case.

    - While I'm fixing this code, remove a redundant bzero() of the ACL
      reader buffer; it need only be initialized above the acl_type
      switch.

Obtained from:	TrustedBSD Project
2001-04-02 01:02:32 +00:00
Robert Watson
a70f27470f Introduce support for POSIX.1e ACLs on UFS-based file systems. This
implementation is still experimental, and while fairly broadly tested,
is not yet intended for production use.  Support for POSIX.1e ACLs on
UFS will not be MFC'd to RELENG_4.

This implementation works by providing implementations of VOP_[GS]ETACL()
for FFS, as well as modifying the appropriate access control and file
creation routines.  In this implementation, ACLs are backed into extended
attributes; the base ACL (owner, group, other) permissions remain in the
inode for performance and compatibility reasons, so only the extended and
default ACLs are placed in extended attributes.  The logic for ACL
evaluation is provided by the fs-independent kern/kern_acl.c.

o Introduce UFS_ACL, a compile-time configuration option that enables
  support for ACLs on FFS (and potentially other UFS-based file systems).
o Introduce ufs_getacl(), ufs_setacl(), ufs_aclcheck(), which
  respectively get, set, and check the ACLs on the passed vnode.
o Introduce ufs_sync_acl_from_inode(), ufs_sync_inode_from_acl() to
  maintain access control information between inode permissions and
  extended attribute data.
o Modify ufs_access() to load a file access ACL and invoke
  vaccess_acl_posix1e() if ACLs are available on the file system
o Modify ufs_mkdir() and ufs_makeinode() to associate ACLs with newly
  created directories and files, inheriting from the parent directory's
  default ACL.
o Enable these new vnode operations and conditionally compiled code
  paths if UFS_ACL is defined.

A few notes:

o This implementation is fairly widely tested, but still should be
  considered experimental.
o Currently, ACLs are not exported via NFS, instead, the summarizing
  file mode/etc from the inode is.  This results in conservative
  protection behavior, similar to the behavior of ACL-nonaware programs
  acting locally.
o It is possible that underlying binary data formats associated with
  this implementation may change.  Consumers of the implementation
  should expect to find their local configuration obsoleted in the
  next few months, resulting in possible loss of ACL data during an
  upgrade.
o The extended attributes interface and implementation is still
  undergoing modification to address portable interface concerns, as
  well as performance.
o Many applications do not yet correctly handle ACLs.  In general,
  due to the POSIX.1e ACL model, behavior of ACL-unaware applications
  will be conservative with respects to file protection; some caution
  is recommended.
o Instructions for configuring and maintaining ACLs on UFS will be
  committed in the near future; in the mean time it is possible to
  reference the README included in the last UFS ACL distribution
  placed in the TrustedBSD web site:

      http://www.TrustedBSD.org/downloads/

Substantial debugging, hardware, travel, or connectivity support for this
project was provided by: BSDi, Safeport Network Services, and NAI Labs.
Significant coding contributions were made by Chris Faulhaber.  Additional
support was provided by Brian Feldman, Thomas Moestl, and Ilmar Habibulin.

Reviewed by:	jedgar, keichii, mckusick, trustedbsd-discuss, freebsd-fs
Obtained from:	TrustedBSD Project
2001-03-26 17:53:19 +00:00
Poul-Henning Kamp
f83880518b Send the remains (such as I have located) of "block major numbers" to
the bit-bucket.
2001-03-26 12:41:29 +00:00
Jeroen Ruigrok van der Werven
5d0b660f2a Fix typo ); -> , 2001-03-24 15:25:04 +00:00
Kirk McKusick
fca26df055 Check that background fsck operation is being done on a ufs filesystem.
Obtained from:	Robert Watson <rwatson@FreeBSD.org>
2001-03-23 20:58:25 +00:00
Robert Watson
7b4dc13de4 o Remove an unnecessary debugging printf from ufs_extattr_lookup(),
which resulted in the output of warning messages at boot if
  UFS_EXTATTR_AUTOSTART was enabled but ".attribute" and possible
  sub-directories weren't in a mounted MFS or UFS file systems.

Pointed out by:	dcs
Obtained from:	TrustedBSD Project
2001-03-21 23:00:39 +00:00
Kirk McKusick
812b1d416c Add kernel support for running fsck on active filesystems. 2001-03-21 04:09:01 +00:00
Kirk McKusick
31c6ce0aed Clear the fs_clean flag only when the FS_UNCLEAN flag is not set
(as is done in unmount).

Remove a snapshot inode from the superblock list when its last
name goes away rather than when its last reference goes away.
That way it will be properly reclaimed by fsck after a crash
rather than reenabled when the filesystem is mounted.
2001-03-21 04:05:20 +00:00
Kirk McKusick
7e72e9918a Report the correct inode number when panicing with freeing free inode.
Report the correct block number when panicing with freeing free block.
2001-03-21 04:01:02 +00:00
Robert Watson
d4562167ac o Enable UFS-based extended attribute support on MFS. Note that this change
is under-tested, and that MFS appears to be in the process of being
  deprecated in favor of FFS over md.  Note also that UFS_EXTATTR_AUTOSTART
  doesn't make much sense on MFS unless the MFSROOT is compiled in, so
  manual configuration is generally required.

Obtained from:	TrustedBSD Project
2001-03-19 06:44:18 +00:00
Robert Watson
3063207147 o Rename "namespace" argument to "attrnamespace" as namespace is a C++
reserved word.

Submitted by:	jkh
Obtained from:	TrustedBSD Project
2001-03-19 05:44:15 +00:00
Robert Watson
516081f288 o Change options FFS_EXTATTR and options FFS_EXTATTR_AUTOSTART to
options UFS_EXTATTR and UFS_EXTATTR_AUTOSTART respectively.  This change
  reflects the fact that our EA support is implemented entirely at the
  UFS layer (modulo FFS start/stop/autostart hooks for mount and unmount
  events).  This also better reflects the fact that [shortly] MFS will also
  support EAs, as well as possibly IFS.

o Consumers of the EA support in FFS are reminded that as a result, they
  must change kernel config files to reflect the new option names.

Obtained from:	TrustedBSD Project
2001-03-19 04:35:40 +00:00
Robert Watson
3938ca1cc4 o Caused FFS_EXTATTR_AUTOSTART to scan two sub-directories of ".attribute"
off of the file system root: "user" for user attributes, and "system"
  for system attributes.  When the scan occurs, attribute backing files
  discovered in those directories will be started in the respective
  namespaces.  This re-introduces support for auto-starting of user
  attributes, which was removed when the "$" prefix for system attributes
  was replaced with explicit namespacing.

  For users of the TrustedBSD UFS POSIX.1e ACL code, you'll need to:
    mv ${FSROOT}/'$posix1e.acl_access' ${FSROOT}/system/posix1e.acl_access
    mv ${FSROOT}/'$posix1e.acl_default' ${FSROOT}/system/posix1e.acl_default

  For users of the TrustedBSD POSIX.1e Capability code, you'll need to:
    mv ${FSROOT}/'$posix1e.cap' ${FSROOT}/system/posix1e.cap

  For users of the TrustedBSD MAC code, you'll need to:
    mv ${FSROOT}/'$freebsd.mac' ${FSROOT}/system/freebsd.mac

  Updated versions of relevant patches will be released in the near
  future.

Obtained from:	TrustedBSD Project
2001-03-18 04:04:23 +00:00
Robert Watson
70f3685105 o Change the API and ABI of the Extended Attribute kernel interfaces to
introduce a new argument, "namespace", rather than relying on a first-
  character namespace indicator.  This is in line with more recent
  thinking on EA interfaces on various mailing lists, including the
  posix1e, Linux acl-devel, and trustedbsd-discuss forums.  Two namespaces
  are defined by default, EXTATTR_NAMESPACE_SYSTEM and
  EXTATTR_NAMESPACE_USER, where the primary distinction lies in the
  access control model: user EAs are accessible based on the normal
  MAC and DAC file/directory protections, and system attributes are
  limited to kernel-originated or appropriately privileged userland
  requests.

o These API changes occur at several levels: the namespace argument is
  introduced in the extattr_{get,set}_file() system call interfaces,
  at the vnode operation level in the vop_{get,set}extattr() interfaces,
  and in the UFS extended attribute implementation.  Changes are also
  introduced in the VFS extattrctl() interface (system call, VFS,
  and UFS implementation), where the arguments are modified to include
  a namespace field, as well as modified to advoid direct access to
  userspace variables from below the VFS layer (in the style of recent
  changes to mount by adrian@FreeBSD.org).  This required some cleanup
  and bug fixing regarding VFS locks and the VFS interface, as a vnode
  pointer may now be optionally submitted to the VFS_EXTATTRCTL()
  call.  Updated documentation for the VFS interface will be committed
  shortly.

o In the near future, the auto-starting feature will be updated to
  search two sub-directories to the ".attribute" directory in appropriate
  file systems: "user" and "system" to locate attributes intended for
  those namespaces, as the single filename is no longer sufficient
  to indicate what namespace the attribute is intended for.  Until this
  is committed, all attributes auto-started by UFS will be placed in
  the EXTATTR_NAMESPACE_SYSTEM namespace.

o The default POSIX.1e attribute names for ACLs and Capabilities have
  been updated to no longer include the '$' in their filename.  As such,
  if you're using these features, you'll need to rename the attribute
  backing files to the same names without '$' symbols in front.

o Note that these changes will require changes in userland, which will
  be committed shortly.  These include modifications to the extended
  attribute utilities, as well as to libutil for new namespace
  string conversion routines.  Once the matching userland changes are
  committed, a buildworld is recommended to update all the necessary
  include files and verify that the kernel and userland environments
  are in sync.  Note: If you do not use extended attributes (most people
  won't), upgrading is not imperative although since the system call
  API has changed, the new userland extended attribute code will no longer
  compile with old include files.

o Couple of minor cleanups while I'm there: make more code compilation
  conditional on FFS_EXTATTR, which should recover a bit of space on
  kernels running without EA's, as well as update copyright dates.

Obtained from:	TrustedBSD Project
2001-03-15 02:54:29 +00:00
Robert Watson
7e9044a636 o In my merge, missed the one-line patch to ufs_vnops.c that removed
the static prototype for ufs_readdir().  Note that ufs_readdir() was
  actually already non-static, the prototype was incorrect.

Submitted by:	jedgar
2001-03-14 18:27:04 +00:00
Robert Watson
f5161237ad o Implement "options FFS_EXTATTR_AUTOSTART", which depends on
"options FFS_EXTATTR".  When extended attribute auto-starting
  is enabled, FFS will scan the .attribute directory off of the
  root of each file system, as it is mounted.  If .attribute
  exists, EA support will be started for the file system.  If
  there are files in the directory, FFS will attempt to start
  them as attribute backing files for attributes baring the same
  name.  All attributes are started before access to the file
  system is permitted, so this permits race-free enabling of
  attributes.  For attributes backing support for security
  features, such as ACLs, MAC, Capabilities, this is vital, as
  it prevents the file system attributes from getting out of
  sync as a result of file system operations between mount-time
  and the enabling of the extended attribute.  The userland
  extattrctl tool will still function exactly as previously.
  Files must be placed directly in .attribute, which must be
  directly off of the file system root: symbolic links are
  not permitted.  FFS_EXTATTR will continue to be able
  to function without FFS_EXTATTR_AUTOSTART for sites that do not
  want/require auto-starting.  If you're using the UFS_ACL code
  available from www.TrustedBSD.org, using FFS_EXTATTR_AUTOSTART
  is recommended.

o This support is implemented by adding an invocation of
  ufs_extattr_autostart() to ffs_mountfs().  In addition,
  several new supporting calls are introduced in
  ufs_extattr.c:

    ufs_extattr_autostart(): start EAs on the specified mount
    ufs_extattr_lookup(): given a directory and filename,
                          return the vnode for the file.
    ufs_extattr_enable_with_open(): invoke ufs_extattr_enable()
                          after doing the equililent of vn_open()
                          on the passed file.
    ufs_extattr_iterate_directory(): iterate over a directory,
                          invoking ufs_extattr_lookup() and
                          ufs_extattr_enable_with_open() on each
                          entry.

o This feature is not widely tested, and therefore may contain
  bugs, caution is advised.  Several changes are in the pipeline
  for this feature, including breaking out of EA namespaces into
  subdirectories of .attribute (this is waiting on the updated
  EA API), as well as a per-filesystem flag indicating whether
  or not EAs should be auto-started.  This is required because
  administrators may not want .attribute auto-started on all
  file systems, especially if non-administrators have write access
  to the root of a file system.

Obtained from:	TrustedBSD Project
2001-03-14 05:32:31 +00:00
Kirk McKusick
589c7af992 Fixes to track snapshot copy-on-write checking in the specinfo
structure rather than assuming that the device vnode would reside
in the FFS filesystem (which is obviously a broken assumption with
the device filesystem).
2001-03-07 07:09:55 +00:00
John Baldwin
19eb87d22a Grab the process lock while calling psignal and before calling psignal. 2001-03-07 03:37:06 +00:00
John Baldwin
412503ae46 Protect SIGDELSET of p_siglist with the proc lock. 2001-03-07 03:34:55 +00:00
Kirk McKusick
8775e64a5d Free lock before returning from process_worklist_item.
Obtained from:	Constantine Sapuntzakis <csapuntz@stanford.edu>
2001-03-01 21:43:46 +00:00
Adrian Chadd
f3a90da995 Reviewed by: jlemon
An initial tidyup of the mount() syscall and VFS mount code.

This code replaces the earlier work done by jlemon in an attempt to
make linux_mount() work.

* the guts of the mount work has been moved into vfs_mount().

* move `type', `path' and `flags' from being userland variables into being
  kernel variables in vfs_mount(). `data' remains a pointer into
  userspace.

* Attempt to verify the `type' and `path' strings passed to vfs_mount()
  aren't too long.

* rework mount() and linux_mount() to take the userland parameters
  (besides data, as mentioned) and pass kernel variables to vfs_mount().
  (linux_mount() already did this, I've just tidied it up a little more.)

* remove the copyin*() stuff for `path'. `data' still requires copyin*()
  since its a pointer into userland.

* set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each
  filesystem.  This variable is generally initialised with `path', and
  each filesystem can override it if they want to.

* NOTE: f_mntonname is intiailised with "/" in the case of a root mount.
2001-03-01 21:00:17 +00:00
Jonathan Lemon
7df2842dee Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean().
Use this to tell a filter attached to a vnode that the underlying vnode is
no longer valid, by returning EV_EOF.

PR: kern/25309, kern/25206
2001-02-23 20:06:01 +00:00
Jonathan Lemon
31e7122397 Use correct list pointer when detaching knote from list. 2001-02-23 19:20:21 +00:00
Kirk McKusick
a5a94e3936 Free lock before calling panic so that subsequent attempt to write out
buffers does not re-panic with `locking against myself'. This change
should not affect normal operations of soft updates in any way.
2001-02-23 09:01:31 +00:00
Kirk McKusick
cc686e21c0 When cleaning up excess inode dependencies, check for being done.
Reviewed by:	Jan Koum <jkb@yahoo-inc.com>
2001-02-22 10:17:57 +00:00
Kirk McKusick
2cf5d587a9 This patch corrects two problems with the rate limiting code
that was introduced in revision 1.80. The problem manifested
itself with a `locking against myself' panic and could also
result in soft updates inconsistences associated with inodedeps.
The two problems are:

1) One of the background operations could manipulate the bitmap
while holding it locked with intent to create. This held lock
results in a `locking against myself' panic, when the background
processing that we have been coopted to do tries to lock the bitmap
which we are already holding locked. To understand how to fix this
problem, first, observe that we can do the background cleanups in
inodedep_lookup only when allocating inodedeps (DEPALLOC is set in
the call to inodedep_lookup). Second observe that calls to
inodedep_lookup with DEPALLOC set can only happen from the following
calls into the softdep code:

        softdep_setup_inomapdep
        softdep_setup_allocdirect
        softdep_setup_remove
        softdep_setup_freeblocks
        softdep_setup_directory_change
        softdep_setup_directory_add
        softdep_change_linkcnt

Only the first two of these can come from ffs_alloc.c while holding
a bitmap locked. Thus, inodedep_lookup must not go off to do
request_cleanups when being called from these functions. This change
adds a flag, NODELAY, that can be passed to inodedep_lookup to let
it know that it should not do background processing in those cases.

2) The return value from request_cleanup when helping out with the
cleanup was 0 instead of 1. This meant that despite the fact that
we may have slept while doing the cleanups, the code did not recheck
for the appearance of an inodedep (e.g., goto top in inodedep_lookup).
This lead to the softdep inconsistency in which we ended up with
two inodedep's for the same inode.

Reviewed by:	Peter Wemm <peter@yahoo-inc.com>,
		Matt Dillon <dillon@earth.backplane.com>
2001-02-20 11:14:38 +00:00
Jeroen Ruigrok van der Werven
d7d97eb0aa Preceed/preceeding are not english words. Use precede and preceding. 2001-02-18 10:43:53 +00:00
Jonathan Lemon
608a3ce62a Extend kqueue down to the device layer.
Backwards compatible approach suggested by: peter
2001-02-15 16:34:11 +00:00
Jake Burkholder
d5a08a6065 Implement a unified run queue and adjust priority levels accordingly.
- All processes go into the same array of queues, with different
  scheduling classes using different portions of the array.  This
  allows user processes to have their priorities propogated up into
  interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
  32.  We used to have 4 separate arrays of 32 queues each, so this
  may not be optimal.  The new run queue code was written with this
  in mind; changing the number of run queues only requires changing
  constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter.  This
  is intended to be used to create per-cpu run queues.  Implement
  wrappers for compatibility with the old interface which pass in
  the global run queue structure.
- Group the priority level, user priority, native priority (before
  propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
  symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
  This was used to detect when a process' priority had lowered and
  it should yield.  We now effectively yield on every interrupt.
- Activate propogate_priority().  It should now have the desired
  effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
  idle loop.  It interfered with propogate_priority() because
  the idle process needed to do a non-blocking acquire of Giant
  and then other processes would try to propogate their priority
  onto it.  The idle process should not do anything except idle.
  vm_page_zero_idle() will return in the form of an idle priority
  kernel thread which is woken up at apprioriate times by the vm
  system.
- Update struct kinfo_proc to the new priority interface.  Deliberately
  change its size by adjusting the spare fields.  It remained the same
  size, but the layout has changed, so userland processes that use it
  would parse the data incorrectly.  The size constraint should really
  be changed to an arbitrary version number.  Also add a debug.sizeof
  sysctl node for struct kinfo_proc.
2001-02-12 00:20:08 +00:00
Bosko Milekic
9ed346bab0 Change and clean the mutex lock interface.
mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)
2001-02-09 06:11:45 +00:00
Poul-Henning Kamp
37d4006626 Another round of the <sys/queue.h> FOREACH transmogriffer.
Created with:   sed(1)
Reviewed by:    md5(1)
2001-02-04 16:08:18 +00:00
Poul-Henning Kamp
fc2ffbe604 Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)
2001-02-04 13:13:25 +00:00
Poul-Henning Kamp
ef9e85abba Use <sys/queue.h> macro API. 2001-02-04 12:37:48 +00:00
Poul-Henning Kamp
b99cfaf32c Remove a DIAGNOSTIC check which belongs in <sys/queue.h> if anyplace at all. 2001-02-04 11:53:51 +00:00
Ian Dowse
5d1731a783 Extend the sanity checks in ufs_lookup to ensure that each directory
entry fits within its DIRBLKSIZ block. The surrounding code is
extremely fragile with respect to corruption of the directory entry
'd_reclen' field; if directory corruption occurs, it can blindly
scan forward beyond the end of the filesystem block. Usually this
results in a 'fault on nofault entry' panic.

Directory corruption is now much more likely to be detected, resulting
in a 'ufs_dirbad' panic. If the filesystem is read-only, it will
simply print a warning message, and skip the corrupted block.

Reviewed by:	mckusick
2001-02-04 01:52:11 +00:00
Ian Dowse
f434e08437 Use the correct flags field when checking for a read-only filesystem
in ufs_dirbad(). The mnt_stat.f_flags field is only updated by the
syscalls *statfs and getfsstat, so mnt_flag should be used instead.

This only affects whether or not a panic is generated on detection of
certain types of directory corruption.

Reviewed by:	mckusick
2001-02-03 21:25:32 +00:00
Matthew Dillon
f8e071a1eb Fix a race between the syncer and umount. When you umount a softupdates
filesystem softdep_process_worklist() is called in a loop until it indicates
that no dependancies remain, but the determination of that fact depends on
there only being one softdep_process_worklist() instance running.  It was
possible for the syncer to also be running softdep_process_worklist()
and the pre-existing checks in the code to prevent this were not sufficient
to prevent the race.  This patch solves the problem.

Approved-by: mckusick
2001-01-30 06:31:59 +00:00
Jason Evans
1b367556b5 Convert all simplelocks to mutexes and remove the simplelock implementations. 2001-01-24 12:35:55 +00:00
Ian Dowse
f55ff3f3ef The ffs superblock includes a 128-byte region for use by temporary
in-core pointers to summary information. An array in this region
(fs_csp) could overflow on filesystems with a very large number of
cylinder groups (~16000 on i386 with 8k blocks). When this happens,
other fields in the superblock get corrupted, and fsck refuses to
check the filesystem.

Solve this problem by replacing the fs_csp array in 'struct fs'
with a single pointer, and add padding to keep the length of the
128-byte region fixed. Update the kernel and userland utilities
to use just this single pointer.

With this change, the kernel no longer makes use of the superblock
fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c
to indicate that these fields must be calculated for compatibility
with older kernels.

Reviewed by:	mckusick
2001-01-15 18:30:40 +00:00
Kirk McKusick
cb3ab5aaf7 Properly compute the size of the final block of superblock summary information.
Submitted by:	Ian Dowse <iedowse@maths.tcd.ie>
2001-01-12 21:56:55 +00:00
Robert Watson
7745909c22 o Commit reems of style(9) changes, whitespace improvements, and comment
cleanups.

Obtained from:	TrustedBSD Project
2001-01-07 23:45:56 +00:00
Robert Watson
4301368e49 o Zero the ufs_extattr_header length field (not necessary, but not a bad
idea either) in ufs_extattr_rm.
o More completely fill out the local_aio structure when writing out the
  zero'd extended attribute in ufs_extattr_rm -- previoulsy, this worked
  fine, but probably should not have.  This corrects extraneous warnings
  about inconsistent inodes following file deletion.

Reviewed by:	jedgar
2001-01-07 23:31:51 +00:00
Robert Watson
9d5703550d o Add an additional EA inconsistency reporting opportunity in
ufs_extattr_rm.
o Make both reporting locations report the function name where the
  inconsistency is discovered, as well as the inode number in question.

Reviewed by:	jedgar
2001-01-07 23:27:58 +00:00
Robert Watson
e33042af13 o Make call to ufs_extattr_rm() in ufs_extattr_vnode_inactive() use
NULL as the credential, not 0, so as to make it more clear what's
  going on.

Obtained from:	TrustedBSD Project
2001-01-07 21:38:26 +00:00
Robert Watson
32e278a63d o Remove unnecessary sanity check involving requested offset of extended
attribute read--the offset is required to be 0 by an earlier check,
  meaning that it will always be within the scope of the attribute data.
  This change should have no impact on executed code paths other than
  removing the unnecessary check: please report if any new failures
  start to occur as a result.

Obtained from:	TrustedBSD Project
2001-01-07 21:07:22 +00:00
Matthew Dillon
2b6b0df712 This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution.  However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box).  The new
algorithm limits the number of pages laundered in the first pageout daemon
pass.  If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace.  This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads.  It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis.  At the moment it is globalized.
2000-12-26 19:41:38 +00:00
Kirk McKusick
48d617487d Several small but important fixes for snapshots:
1) Be more tolerant of missing snapshot files by only trying to decrement
   their reference count if they are registered as active.

2) Fix for snapshots of filesystems with block sizes larger than 8K
   (from Ollivier Robert <roberto@eurocontrol.fr>).

3) Fix to avoid losing last block in snapshot file when calculating blocks
   that need to be copied (from Don Coleman <coleman@coleman.org>).
2000-12-19 04:41:09 +00:00
Kirk McKusick
6da443cb22 Get rid of spurious check in ffs_truncate for i_size == length
which fails to set the modification time on the file. The same
check a few lines later takes the correct action.

Submitted by:	Ian Dowse <iedowse@maths.tcd.ie>
2000-12-19 04:20:13 +00:00
Assar Westerlund
ca85ca6099 add a stub for softdep_slowdown so that it's possible to build the
kernel without SOFTUPDATES
2000-12-17 23:59:56 +00:00
Matthew Dillon
6ddaf0f45e Avoid a data-consistency race between write() and mmap()
by ensuring that newly allocated blocks are zerod.  The
race can occur even in the case where the write covers
the entire block.

Reported by: Sven Berkvens <sven@berkvens.net>, Marc Olzheim <zlo@zlo.nu>
2000-12-17 23:57:05 +00:00
Seigo Tanimura
0a439034dc - Move ifs_init() so that it can initialize ifs_inode_hash_mtx.
- s/ffs_inode_hash_lock/ifs_inode_hash_lock/
2000-12-14 09:15:27 +00:00
Seigo Tanimura
937c4dfa08 Do not race for the lock of an inode hash.
Reviewed by:	jhb
2000-12-13 10:04:01 +00:00
Kirk McKusick
1d733bbd10 Preventing runaway kernel soft updates memory, take three.
Previously, the syncer process was the only process in the
system that could process the soft updates background work
list. If enough other processes were adding requests to that
list, it would eventually grow without bound. Because some of
the work list requests require vnodes to be locked, it was
not generally safe to let random processes process the work
list while they already held vnodes locked. By adding a flag
to the work list queue processing function to indicate whether
the calling process could safely lock vnodes, it becomes possible
to co-opt other processes into helping out with the work list.
Now when the worklist gets too large, other processes can safely
help out by picking off those work requests that can be handled
without locking a vnode, leaving only the small number of
requests requiring a vnode lock for the syncer process. With
this change, it appears possible to keep even the nastiest
workloads under control.

Submitted by:	Paul Saab <ps@yahoo-inc.com>
2000-12-13 08:30:35 +00:00
David Malone
7cc0979fd6 Convert more malloc+bzero to malloc+M_ZERO.
Submitted by:	josh@zipperup.org
Submitted by:	Robert Drehmel <robd@gmx.net>
2000-12-08 21:51:06 +00:00
Poul-Henning Kamp
959b7375ed Staticize some malloc M_ instances. 2000-12-08 20:09:00 +00:00
Matthew Dillon
9440653d07 Add necessary bwillwrite() in writev() entry point.
Deal with excessive dirty buffers when msync() syncs non-contiguous
dirty buffers by checking for the case in UFS *before* checking for
clusterability.
2000-12-06 20:55:09 +00:00
Kirk McKusick
71868b020d More aggressively rate limit the growth of soft dependency structures
in the face of multiple processes doing massive numbers of filesystem
operations. While this patch will work in nearly all situations, there
are still some perverse workloads that can overwhelm the system.
Detecting and handling these perverse workloads will be the subject
of another patch.

Reviewed by:	Paul Saab <ps@yahoo-inc.com>
Obtained from:	Ethan Solomita <ethan@geocast.com>
2000-11-20 06:22:39 +00:00
Matthew Dillon
936524aa02 Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
    situations prior to now.

    The new code is based on the concept that I/O must be able to function in
    a low memory situation.  All major modules related to I/O (except
    networking) have been adjusted to allow allocation out of the system
    reserve memory pool.  These modules now detect a low memory situation but
    rather then block they instead continue to operate, then return resources
    to the memory pool instead of cache them or leave them wired.

    Code has been added to stall in a low-memory situation prior to a vnode
    being locked.

    Thus situations where a process blocks in a low-memory condition while
    holding a locked vnode have been reduced to near nothing.  Not only will
    I/O continue to operate, but many prior deadlock conditions simply no
    longer exist.

Implement a number of VFS/BIO fixes

	(found by Ian): in biodone(), bogus-page replacement code, the loop
        was not properly incrementing loop variables prior to a continue
        statement.  We do not believe this code can be hit anyway but we
        aren't taking any chances.  We'll turn the whole section into a
        panic (as it already is in brelse()) after the release is rolled.

	In biodone(), the foff calculation was incorrectly
        clamped to the iosize, causing the wrong foff to be calculated
        for pages in the case of an I/O error or biodone() called without
        initiating I/O.  The problem always caused a panic before.  Now it
        doesn't.  The problem is mainly an issue with NFS.

	Fixed casts for ~PAGE_MASK.  This code worked properly before only
        because the calculations use signed arithmatic.  Better to properly
        extend PAGE_MASK first before inverting it for the 64 bit masking
        op.

	In brelse(), the bogus_page fixup code was improperly throwing
        away the original contents of 'm' when it did the j-loop to
        fix the bogus pages.  The result was that it would potentially
        invalidate parts of the *WRONG* page(!), leading to corruption.

	There may still be cases where a background bitmap write is
        being duplicated, causing potential corruption.  We have identified
        a potentially serious bug related to this but the fix is still TBD.
        So instead this patch contains a KASSERT to detect the problem
  	and panic the machine rather then continue to corrupt the filesystem.
	The problem does not occur very often..  it is very hard to
	reproduce, and it may or may not be the cause of the corruption
	people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
Kirk McKusick
bd4bd019fb When deleting a file, the ordering of events imposed by soft updates
is to first write the deleted directory entry to disk, second write
the zero'ed inode to disk, and finally to release the freed blocks
and the inode back to the cylinder-group map. As this ordering
requires two disk writes to occur which are normally spaced about
30 seconds apart (except when memory is under duress), it takes
about a minute from the time that a file is deleted until its inode
and data blocks show up in the cylinder-group map for reallocation.
If a file has had only a brief lifetime (less than 30 seconds from
creation to deletion), neither its inode nor its directory entry
may have been written to disk. If its directory entry has not been
written to disk, then we need not wait for that directory block to
be written as the on-disk directory block does not reference the
inode. Similarly, if the allocated inode has never been written to
disk, we do not have to wait for it to be written back either as
its on-disk representation is still zero'ed out. Thus, in the case
of a short lived file, we can simply release the blocks and inode
to the cylinder-group map immediately. As the inode and its blocks
are released immediately, they are immediately available for other
uses. If they are not released for a minute, then other inodes and
blocks must be allocated for short lived files, cluttering up the
vnode and buffer caches. The previous code was a bit too aggressive
in trying to release the blocks and inode back to the cylinder-group
map resulting in their being made available when in fact the inode
on disk had not yet been zero'ed. This patch takes a more conservative
approach to doing the release which avoids doing the release prematurely.
2000-11-14 09:00:25 +00:00
Bruce Evans
1c1752872f Fixed breakage of mknod() in rev.1.48 of ext2_vnops.c and rev.1.126 of
ufs_vnops.c:

1) i_ino was confused with i_number, so the inode number passed to
   VFS_VGET() was usually wrong (usually 0U).
2) ip was dereferenced after vgone() freed it, so the inode number
   passed to VFS_VGET() was sometimes not even wrong.

Bug (1) was usually fatal in ext2_mknod(), since ext2fs doesn't have
space for inode 0 on the disk; ino_to_fsba() subtracts 1 from the
inode number, so inode number 0U gives a way out of bounds array
index.  Bug(1) was usually harmless in ufs_mknod(); ino_to_fsba()
doesn't subtract 1, and VFS_VGET() reads suitable garbage (all 0's?)
from the disk for the invalid inode number 0U; ufs_mknod() returns
a wrong vnode, but most callers just vput() it; the correct vnode is
eventually obtained by an implicit VFS_VGET() just like it used to be.

Bug (2) usually doesn't happen.
2000-11-04 08:10:56 +00:00
Eivind Eklund
e3c4036b18 Give vop_mmap an untimely death. The opportunity to give it a timely
death timed out in 1996.
2000-11-01 17:57:24 +00:00
Poul-Henning Kamp
ef10cd6a64 Add a missing <sys/systm.h> 2000-10-30 20:37:19 +00:00
Poul-Henning Kamp
cf9fa8e725 Move suser() and suser_xxx() prototypes and a related #define from
<sys/proc.h> to <sys/systm.h>.

Correctly document the #includes needed in the manpage.

Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.
2000-10-29 16:06:56 +00:00
Poul-Henning Kamp
9f69a4578a Weaken a bogus dependency on <sys/proc.h> in <sys/buf.h> by #ifdef'ing
the offending inline function (BUF_KERNPROC) on it being #included
already.

I'm not sure BUF_KERNPROC() is even the right thing to do or in the
right place or implemented the right way (inline vs normal function).

Remove consequently unneeded #includes of <sys/proc.h>
2000-10-29 14:54:55 +00:00
Poul-Henning Kamp
53ce36d17a Remove unneeded #include <sys/proc.h> lines. 2000-10-29 13:57:19 +00:00
Robert Watson
47460a23a0 o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform
"administrative" authorization checks.  In most cases, the VADMIN test
  checks to make sure the credential effective uid is the same as the file
  owner.
o Modify vaccess() to set VADMIN as an available right if the uid is
  appropriate.
o Modify references to uid-based access control operations such that they
  now always invoke VOP_ACCESS() instead of using hard-coded policy checks.
o This allows alternative UFS policies to be implemented by replacing only
  ufs_access() (such as mandatory system policies).
o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the
  vnode: I believe that new invocations of VOP_ACCESS() are always called
  with the lock held.
o Some direct checks of the uid remain, largely associated with the QUOTA
  and SUIDDIR code.

Reviewed by:	eivind
Obtained from:	TrustedBSD Project
2000-10-19 07:53:59 +00:00
Adrian Chadd
0b0c10b48d Initial commit of IFS - a inode-namespaced FFS. Here is a short
description:

How it works:
--

Basically ifs is a copy of ffs, overriding some vfs/vnops. (Yes, hack.)
I didn't see the need in duplicating all of sys/ufs/ffs to get this
off the ground.

File creation is done through a special file - 'newfile' . When newfile
is called, the system allocates and returns an inode. Note that newfile
is done in a cloning fashion:

fd = open("newfile", O_CREAT|O_RDWR, 0644);
fstat(fd, &st);

printf("new file is %d\n", (int)st.st_ino);

Once you have created a file, you can open() and unlink() it by its returned
inode number retrieved from the stat call, ie:

fd = open("5", O_RDWR);

The creation permissions depend entirely if you have write access to the
root directory of the filesystem.

To get the list of currently allocated inodes, VOP_READDIR has been added
which returns a directory listing of those currently allocated.

--

What this entails:

* patching conf/files and conf/options to include IFS as a new compile
  option (and since ifs depends upon FFS, include the FFS routines)

* An entry in i386/conf/NOTES indicating IFS exists and where to go for
  an explanation

* Unstaticize a couple of routines in src/sys/ufs/ffs/ which the IFS
  routines require (ffs_mount() and ffs_reload())

* a new bunch of routines in src/sys/ufs/ifs/ which implement the IFS
  routines. IFS replaces some of the vfsops, and a handful of vnops -
  most notably are VFS_VGET(), VOP_LOOKUP(), VOP_UNLINK() and VOP_READDIR().
  Any other directory operation is marked as invalid.

What this results in:

* an IFS partition's create permissions are controlled by the perm/ownership of
  the root mount point, just like a normal directory

* Each inode has perm and ownership too

* IFS does *NOT* mean an FFS partition can be opened per inode. This is a
  completely seperate filesystem here

* Softupdates doesn't work with IFS, and really I don't think it needs it.
  Besides, fsck's are FAST. (Try it :-)

* Inodes 0 and 1 aren't allocatable because they are special (dump/swap IIRC).
  Inode 2 isn't allocatable since UFS/FFS locks all inodes in the system against
  this particular inode, and unravelling THAT code isn't trivial. Therefore,
  useful inodes start at 3.

Enjoy, and feedback is definitely appreciated!
2000-10-14 03:02:30 +00:00
Robert Watson
d62bd6076e o Sanity check was inverted, resulting in a possible spurious panic
during unmount if extended attributes were in use.  Correct by removing
  an unneeded (and undesirable) '!'.
2000-10-09 20:04:39 +00:00
Eivind Eklund
7eb9fca557 Blow away the v_specmountpoint define, replacing it with what it was
defined as (rdev->si_mountpoint)
2000-10-09 17:31:39 +00:00
Robert Watson
ff435dcb91 o Move initialization of ump from mp to the top of the function so that
it is defined whenm used in ufs_extattr_uepm_destroy(), fixing a panic
  due to a NULL pointer dereference.

Submitted by:	Wesley Morgan <morganw@chemicals.tacorp.com>
2000-10-06 15:31:28 +00:00
Robert Watson
9de54ba513 o Add call to ufs_extattr_uepm_destroy() in ffs_unmount() so as to clean
up lock on extattrs.
o Get for free a comment indicating where auto-starting of extended
  attributes will eventually occur, as it was in my commit tree also.
  No implementation change here, only a comment.
2000-10-04 04:44:51 +00:00
Robert Watson
d32d56a07d o Correct use of lockdestroy() by adding a new ufs_extattr_uepm_destroy()
call, which should be the last thing down to a per-mount extattr
  management structure, after ufs_extattr_stop() on the file system.
  This currently has the effect only of destroying the per-mount lock
  on extended attributes, and clearing appropriate flags.
o Remove inappropriate invocation in ufs_extattr_vnode_inactive().
2000-10-04 04:41:33 +00:00
Jason Evans
a18b1f1d4d Convert lockmgr locks from using simple locks to using mutexes.
Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.
2000-10-04 01:29:17 +00:00
Boris Popov
67e871664b Add a lock structure to vnode structure. Previously it was either allocated
separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.

From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.

If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.

Reviewed in general by:	mckusick
2000-09-25 15:24:04 +00:00
Robert Watson
907da7c385 o Permit UFS Extended Attributes to be associated with special devices
and FIFOs.

Obtained from:	TrustedBSD Project
2000-09-21 19:06:02 +00:00
Robert Watson
bec1333db4 o Disallow privileged processes in jail() from directly accessing
system namespace extended attributes.
o Document privilege/jail() interaction relating to extended
  attributes.

Obtained from:	TrustedBSD Project
2000-09-18 18:10:13 +00:00
Robert Watson
cf48f6e42c o Allow privileged processes in jail() to override sticky bit behavior
on directories.
o Allow privileged processes in jail() to create inodes with the
  setgid bit set even if they are not a member of the group denoted
  by the file creation gid.  This occurs due to inherited gid's from
  parent directories on file creation, allowing a user to create a
  file with a gid that is not in the creating process's credentials.

Obtained from:	TrustedBSD Project
2000-09-18 18:03:49 +00:00
Robert Watson
f5770bb46a o Add a comment clarifying interaction between jail(), privileged processes,
and UFS file flags.  Here's what the comment says, for reference:

	Privileged processes in jail() are permitted to modify
	arbitrary user flags on files, but are not permitted
	to modify system flags.

  In other words, privilege does allow a process in jail to modify user
  flags for objects that the process does not own, but privilege will
  not permit the setting of system flags on the file.

Obtained from:	TrustedBSD Project
2000-09-18 17:58:15 +00:00
Robert Watson
ea57890740 o Add missing PRISON_ROOT allowing a privileged process in a jail() to not
remove the setuid/setgid bits by virtue of a change to a file with those
  bits set, even if the process doesn't own the file, or isn't a group
  member of the file's gid.

Obtained from:	TrustedBSD Project
2000-09-18 17:53:22 +00:00
Robert Watson
4da6e3d109 o Substitute suser() calls for direct credential checks, which is now
safe as suser() no longer sets ASU.
o Note that in some cases, the PRISON_ROOT flag is used even though no
  process structure is passed, to indicate that if a process structure
  (and hence jail) was available, it would be ok.  In the long run,
  the jail identifier should probably be moved to ucred, as the uidinfo
  information was.
o Some uid 0 checks remain relating to the quota code, which I'll leave
  for another day.

Reviewed by:	phk, eivind
Obtained from:	TrustedBSD Project
2000-09-18 16:13:02 +00:00
Dag-Erling Smørgrav
8461bdba85 Silence a warning. 2000-09-17 19:41:26 +00:00
Boris Popov
3ff1a2f43e Add new flag PDIRUNLOCK to the component.cn_flags which should be set by
filesystem lookup() routine if it unlocks parent directory. This flag should
be carefully tracked by filesystems if they want to work properly with nullfs
and other stacked filesystems.

VFS takes advantage of this flag to perform symantically correct usage
of vrele() instead of vput() if parent directory already unlocked.

If filesystem fails to track this flag then previous codepath in VFS left
unchanged.

Convert UFS code to set PDIRUNLOCK flag if necessary. Other filesystmes will
be changed after some period of testing.

Reviewed in general by:	mckusick, dillon, adrian
Obtained from:	NetBSD
2000-09-17 07:26:42 +00:00
Poul-Henning Kamp
ae2276e657 Remove a pointless casting of a gid_t to a gid_t. 2000-09-16 18:20:27 +00:00
Boris Popov
e37acb62a0 Add VOP_*VOBJECT vops, because MFS requires explicit vop specification.
Noted by:	knu
2000-09-12 16:21:16 +00:00
Robert Watson
5ab404120f o Variety of extended attribute fixes
- In ufs_extattr_enable(), return EEXIST instead of EOPNOTSUPP
	  if the caller tries to configure an attribute name that is
	  already configured
	- Throughout, add IO_NODELOCKED to VOP_{READ,WRITE} calls to
	  indicate lock status of passed vnode.  Apparently not a
	  problem, but worth fixing.
	- For all writes, make use of IO_SYNC consistent.  Really,
	  IO_UNIT and combining of VOP_WRITE's should happen, but I
	  don't have that tested.  At least with this, it's
	  consistent usage.  (pointed out by: bde)
	- In ufs_extattr_get(), fixed nested locking of backing
	  vnode (fine due to recursive lock support, but make it
	  more consistent with other code)
	- In ufs_extattr_get(), clean up return code to set uio_resid
	  more consistently with other pieces of code (worked fine,
	  this is just a cleanup)
	- Fix ufs_extattr_rm(), which was broken--effectively a nop.
	- Minor comment and whitespace fixes.

Obtained from:	TrustedBSD Project
2000-09-12 05:35:47 +00:00
John Baldwin
38a6ecf4de Fix a 64-bitism. Use size_t instead of int for 4th argument to copyinstr.
Approved by:	rwatson
2000-09-11 05:43:02 +00:00
Kirk McKusick
52a3bfa2e7 Cannot do MALLOC with M_WAITOK while holding ACQUIRE_LOCK
Obtained from:	Ethan Solomita <ethan@geocast.com>
2000-09-07 23:02:55 +00:00
Jason Evans
0384fff8c5 Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*().  See mutex(9).  (Note: The
  alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
  preempted (i386 only).

Partially contributed by:	BSDi (BSD/OS)
Submissions by (at least):	cp, dfr, dillon, grog, jake, jhb, sheldonh
2000-09-07 01:33:02 +00:00
Robert Watson
bbf0607700 Modify extended attribute protection model to authorize based on
attribute namespace and DAC protection on file:
	- Attribute names beginning with '$' are in the system namespace
	- The attribute name "$" is reserved
	- System namespace attributes may only be read/set by suser()
	  or by kernel (cred == NULL)
	- Other attribute names are in the application namespace
	- The attribute name "" is reserved
	- Application namespace attributes are protected in the manner
	  of the target file permission

o Kernel changes
	- Add ufs_extattr_valid_attrname() to check whether the requested
	  attribute "set" or "enable" is appropriate (i.e., non-reserved)
	- Modify ufs_extattr_credcheck() to accept target file vnode, not
	  to take inode uid
	- Modify ufs_extattr_credcheck() to check namespace, then enforce
	  either kernel/suser for system namespace, or vaccess() for
	  application namespace
o EA backing file format changes
	- Remove permission fields from extended attribute backing file
	  header
	- Bump extended attribute backing file header version to 3
o Update extattrctl.c and extattrctl.8
	- Remove now deprecated -r and -w arguments to initattr, as
	  permissions are now implicit
	- (unrelated) fix error reporting and unlinking during failed
	  initattr to remove duplicate/inaccurate error messages, and to
	  only unlink if the failure wasn't in the backing file open()

Obtained from:	TrustedBSD Project
2000-09-02 20:31:26 +00:00
Robert Watson
012c643d3e o Restructure vaccess() so as to check for DAC permission to modify the
object before falling back on privilege.  Make vaccess() accept an
  additional optional argument, privused, to determine whether
  privilege was required for vaccess() to return 0.  Add commented
  out capability checks for reference.  Rename some variables to make
  it more clear which modes/uids/etc are associated with the object,
  and which with the access mode.
o Update file system use of vaccess() to pass NULL as the optional
  privused argument.  Once additional patches are applied, suser()
  will no longer set ASU, so privused will permit passing of
  privilege information up the stack to the caller.

Reviewed by:	bde, green, phk, -security, others
Obtained from:	TrustedBSD Project
2000-08-29 14:45:49 +00:00
Robert Watson
877dd71fc6 o Correct spelling of ufs_exttatr_find_attr -> ufs_extattr_find_attr
o Add "const" qualifier to attrname argument of various calls to remove
  warnings

Obtained from:	TrustedBSD Project
2000-08-26 22:00:58 +00:00
Poul-Henning Kamp
3f54a085a6 Remove all traces of Julians DEVFS (incl from kern/subr_diskslice.c)
Remove old DEVFS support fields from dev_t.

  Make uid, gid & mode members of dev_t and set them in make_dev().

  Use correct uid, gid & mode in make_dev in disk minilayer.

  Add support for registering alias names for a dev_t using the
  new function make_dev_alias().  These will show up as symlinks
  in DEVFS.

  Use makedev() rather than make_dev() for MFSs magic devices to prevent
  DEVFS from noticing this abuse.

  Add a field for DEVFS inode number in dev_t.

  Add new DEVFS in fs/devfs.

  Add devfs cloning to:
        disk minilayer (ie: ad(4), sd(4), cd(4) etc etc)
        md(4), tun(4), bpf(4), fd(4)

  If DEVFS add -d flag to /sbin/inits args to make it mount devfs.

  Add commented out DEVFS to GENERIC
2000-08-20 21:34:39 +00:00
Poul-Henning Kamp
e39c53eda5 Centralize the canonical vop_access user/group/other check in vaccess().
Discussed with: bde
2000-08-20 08:36:26 +00:00
Tor Egge
b5ee7ec63a Initialize *countp to 0 in stub for softdep_flushworklist().
This allows ffs_fsync() to break out of a loop that might otherwise
be infinite on kernels compiled without the SOFTUPDATES option.
The observed symptom was a system hang at the first unmount attempt.
2000-08-09 00:41:54 +00:00
Ollivier Robert
8694d8e912 Fix the lockmgr panic everyone is seeing at shutdown time.
vput assumes curproc is the lock holder, but it's not true in this case.

Thanks a lot Luoqi !

Submitted by:	luoqi
Tested by:	phk
2000-08-01 14:15:07 +00:00
Peter Wemm
68e530258a Minor tweak - removed unused variable 'struct mount *mp'; 2000-07-28 22:28:05 +00:00
Peter Wemm
6ee6b42ef7 Minor change: fix warning - move a 'struct vnode *vp' declaration inside a
#ifdef DIAGNOSTIC to match its corresponding usage.
2000-07-28 22:27:00 +00:00
Kirk McKusick
3592b7155c Clean up the snapshot code so that it no longer depends on the use of
the SF_IMMUTABLE flag to prevent writing. Instead put in explicit
checking for the SF_SNAPSHOT flag in the appropriate places. With
this change, it is now possible to rename and link to snapshot files.
It is also possible to set or clear any of the owner, group, or
other read bits on the file, though none of the write or execute
bits can be set. There is also an explicit test to prevent the
setting or clearing of the SF_SNAPSHOT flag via chflags() or
fchflags(). Note also that the modify time cannot be changed as
it needs to accurately reflect the time that the snapshot was taken.

Submitted by:	Robert Watson <rwatson@FreeBSD.org>
2000-07-26 23:07:01 +00:00
Poul-Henning Kamp
a0580699e1 Fix the "mfs_badop[vop_getwritemount] = 45" messages. 2000-07-26 17:53:04 +00:00
Kirk McKusick
55ba28c60a Add stub for softdep_flushworklist() so that kernels compiled
without the SOFTUPDATES option will load correctly.

Obtained from:	John Baldwin <jhb@bsdi.com>
2000-07-25 05:28:59 +00:00
Kirk McKusick
d56bdab31c Eliminate periodic 'mfs_badop[vop_getwritemount] = 45' messages.
Submitted by:	Sheldon Hearn <sheldonh@uunet.co.za>
2000-07-25 05:11:57 +00:00
Kirk McKusick
9b97113391 This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.
2000-07-24 05:28:33 +00:00
Robert Watson
bc373480dc o Marius pointed out an unusually inconvenient upper bound on extended
attribute data size.
o Fortunately it turned out to be an unused constant left over from an
  earlier implementation, and is therefore being removed so as not to
  confuse casual observers.

Submitted by:	mbendiks@eunet.no
2000-07-14 03:30:52 +00:00
Boris Popov
3fbd97427e Prevent possible dereference of NULL pointer.
Submitted by:	Marius Bendiksen <mbendiks@eunet.no>
2000-07-13 02:17:14 +00:00
Kirk McKusick
d303f71fdc Brain fault, forgot to update ffs_snapshot.c with the new calling convention
for vn_start_write.
2000-07-12 00:27:27 +00:00
Kirk McKusick
f2a2857bb3 Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).
2000-07-11 22:07:57 +00:00
Kirk McKusick
d4c1816924 Clean up warning about undeclared function by declaring softdep_fsync
in mount.h instead of ffs_extern.h. The correct solution is to use
an indirect function pointer so that the kernel does not have to be
built with options FFS, but that will be left for another day.
2000-07-11 19:28:26 +00:00
Poul-Henning Kamp
88bab4e40c Finish repo-copy:
Move ufs/ufs/ufs_disksubr.c to kern/subr_disklabel.c.

These functions are not UFS specific and are in fact used all over the place.
2000-07-10 13:48:06 +00:00
Kirk McKusick
cc3962a9cd Delete README as it is now obsolete. Relevant information is in
README.softupdates.
2000-07-08 02:32:49 +00:00
Kirk McKusick
876578906d Update to reflect current status. 2000-07-08 02:31:21 +00:00
Kirk McKusick
22e5a6234e Get userland visible flags added for snapshots to give a few days
advance preparation for them to get migrated into place so that
subsequent changes in utilities will not fail to compile for lack
of up-to-date header files in /usr/include.
2000-07-04 04:58:34 +00:00
Kirk McKusick
e6796b67d9 Move the truncation code out of vn_open and into the open system call
after the acquisition of any advisory locks. This fix corrects a case
in which a process tries to open a file with a non-blocking exclusive
lock. Even if it fails to get the lock it would still truncate the
file even though its open failed. With this change, the truncation
is done only after the lock is successfully acquired.

Obtained from:	 BSD/OS
2000-07-04 03:34:11 +00:00
Poul-Henning Kamp
3275cf7379 Make the two calls from kern/* into softupdates #ifdef SOFTUPDATES,
that is way cleaner than using the softupdates_stub stunt, which
should be killed when convenient.

Discussed with:	mckusick
2000-07-03 13:26:54 +00:00
Poul-Henning Kamp
a8b1f9d2c9 Move prtactive to vfs from ufs. It is used all over the place. 2000-06-27 07:46:22 +00:00
Andrey A. Chernov
2d90744fd8 Remove obsoleted info about linking from contrib 2000-06-24 13:29:25 +00:00
Kirk McKusick
858c16fab8 Update to new copyright. 2000-06-22 00:29:53 +00:00
Kirk McKusick
6019e6208f When running with quotas enabled on a filesystem using soft updates,
the system would panic when a user's inode quota was exceeded (see
PR 18959 for details). This fixes that problem.

PR:		18959
Submitted by:	Jason Godsey <jason@unixguy.fidalgo.net>
2000-06-18 22:14:28 +00:00
Kirk McKusick
d3abb52714 Some additional performance improvements. When freeing an inode
check to see if it has been committed to disk. If it has never
been written, it can be freed immediately. For short lived files
this change allows the same inode to be reused repeatedly.
Similarly, when upgrading a fragment to a larger size, if it
has never been claimed by an inode on disk, it too can be freed
immediately making it available for reuse often in the next slowly
growing block of the same file.
2000-06-18 22:05:57 +00:00
Poul-Henning Kamp
7c50d77218 Revert part of my bioops change which implemented panic(8). 2000-06-16 14:32:13 +00:00
Poul-Henning Kamp
7523681895 ARGH! I have too many source trees :-(
Fix prototype errors in last commit.
2000-06-16 13:00:33 +00:00
Poul-Henning Kamp
a2e7a027a7 Virtualizes & untangles the bioops operations vector.
Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@
2000-06-16 08:48:51 +00:00
Poul-Henning Kamp
6ea6805f8c Remove a comment which should never have made it in. 2000-06-14 21:48:19 +00:00
Robert Watson
192851dbff o Remove unneeded off_t variable to clean up compile warning
Obtained from:	TrustedBSD Project
2000-06-05 14:22:51 +00:00
Robert Watson
b2b0497ab5 o If FFS_EXTATTR is defined, don't print out an error message on unmount
if an FFS partition returns EOPNOTSUPP, as it just means extended
  attributes weren't enabled on that partition.  Prevents spurious
  warning per-partition at shutdown.
2000-06-04 04:50:36 +00:00
Jake Burkholder
e39756439c Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by:		msmith and others
2000-05-26 02:09:24 +00:00
Jake Burkholder
740a1973a6 Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by:	phk
Reviewed by:	phk
Approved by:	mdodd
2000-05-23 20:41:01 +00:00
Robert Watson
f3706a0361 s/ffs_unmonut/ffs_unmount/ in a gratuitous ufs_extattr printf.
Reported by:	knu
2000-05-07 17:21:08 +00:00
Poul-Henning Kamp
9626b608de Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by:    peter
2000-05-05 09:59:14 +00:00
Robert Watson
a7e8b37043 Don't allow VOP_GETEXTATTR to set uio->uio_offset != 0, as we don't
provide locking over extended attribute operations, requiring that
individual operations be atomic.  Allowing non-zero starting offsets
permits applications/etc to put themselves at risk for inconsistent
behavior.  As VOP_SETEXTATTR already prohibited non-zero write offsets,
this makes sense.

Suggested by:	Andreas Gruenbacher <a.gruenbacher@bestbits.at>
2000-05-03 05:50:46 +00:00
Poul-Henning Kamp
2c9b67a8df Remove unneeded #include <vm/vm_zone.h>
Generated by:	src/tools/tools/kerninclude
2000-04-30 18:52:11 +00:00
Poul-Henning Kamp
87150cb06d s/biowait/bufwait/g
Prodded by: several.
2000-04-29 16:25:22 +00:00
Poul-Henning Kamp
eb95c536ad Remove unneeded #include <sys/kernel.h> 2000-04-29 15:36:14 +00:00
Kirk McKusick
0fa301ad86 When files are given to users by root, the quota system failed to
reset their grace timer as their ownership crossed the soft limit
threshhold. Thus if they had been over their limit in the past,
they were suddenly penalized as if they had been over their limit
ever since. The fix is to check when root gives away files, that
when the receiving user crosses their soft limit, their grace timer
is reset. See the PR report for a detailed method of reproducing
the bug.

PR:		kern/17128
Submitted by:	Andre Albsmeier <andre.albsmeier@mchp.siemens.de>
Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
2000-04-28 06:12:56 +00:00
Poul-Henning Kamp
a1da82154d Convert the magic MFS device to a VCHR.
Detected by:	obrien
2000-04-22 05:45:38 +00:00
Robert Watson
747b0fa36c o Introduce an extended attribute backing file header magic number
o Introduce an extended attribute backing file header version number
2000-04-19 20:12:41 +00:00
Poul-Henning Kamp
3389ae9350 Remove ~25 unneeded #include <sys/conf.h>
Remove ~60 unneeded #include <sys/malloc.h>
2000-04-19 14:58:28 +00:00
Robert Watson
d90eec469a o Cause attribute data writes to use IO_SYNC since this improves the
chances of consistency with other file/directory meta-data in a
  write.  In the current set of extended attribute applications,
  this does not hurt much.  This should be discussed again later when
  it comes time to optimize performance of attributes.

o Include an inode generation number in the per-attribute header
  information.  This allows consistency verification to catch when
  a crash occurs, or an inode is recycled while attributes are not
  properly configured.  For now, an irritating error message is
  displayed when an inconsistency occurs.  At some point, may introduce
  an ``extattrctl check ...'' which catches these before attributes
  are enabled.  Not today.  If you get this message, it means you
  somehow managed to get your attribute backing file out of synch
  with the file system.  When this occurs, attribute not found is
  returned (== undefined).  Writes will overwrite the value there
  correcting the problem.  Might want to think about introducing
  a new errno or two to handle this kind of situation.

Discussed with:	kris
2000-04-19 07:38:20 +00:00
Poul-Henning Kamp
11f8a0ca77 Retire bufqdisksort(), all drivers use bioqdisksort now. 2000-04-18 13:25:19 +00:00
Jonathan Lemon
b314931dd0 Remove unneeded cast. 2000-04-17 03:37:13 +00:00
Jonathan Lemon
7dffee4116 Replace the POLLEXTEND extensions with the kqueue() mechanism. 2000-04-16 18:55:20 +00:00
Robert Watson
33e1f8e541 Fix two bugs in extended attribute support for UFS/FFS:
o Put back in {} removed during over-zealous cleanup of gratuitous
  debugging output during preparation for the commit.  Due to the
  missing {}, writes on extended attributes always silently failed.
  Doh.

o Don't unlock the target vnode if it's the backing vnode, as we
  don't lock the target vnode if it's the backing vnode.
2000-04-16 01:35:30 +00:00
Poul-Henning Kamp
8177437d85 Complete the bio/buf divorce for all code below devfs::strategy
Exceptions:
        Vinum untouched.  This means that it cannot be compiled.
        Greg Lehey is on the case.

        CCD not converted yet, casts to struct buf (still safe)

        atapi-cd casts to struct buf to examine B_PHYS
2000-04-15 05:54:02 +00:00
Robert Watson
a64ed08955 Introduce extended attribute support for FFS, allowing arbitrary
(name, value) pairs to be associated with inodes.  This support is
used for ACLs, MAC labels, and Capabilities in the TrustedBSD
security extensions, which are currently under development.

In this implementation, attributes are backed to data vnodes in the
style of the quota support in FFS.  Support for FFS extended
attributes may be enabled using the FFS_EXTATTR kernel option
(disabled by default).  Userland utilities and man pages will be
committed in the next batch.  VFS interfaces and man pages have
been in the repo since 4.0-RELEASE and are unchanged.

o ufs/ufs/extattr.h: UFS-specific extattr defines
o ufs/ufs/ufs_extattr.c: bulk of support routines
o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes
o contrib/softupdates/ffs_softdep.c: extattr.h includes
o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR

o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h
(This should not be the case, and will be fixed in a future commit)

Currently attributes are not supported in MFS.  This will be fixed.

Reviewed by:	adrian, bp, freebsd-fs, other unthanked souls
Obtained from:	TrustedBSD Project
2000-04-15 03:34:27 +00:00
Poul-Henning Kamp
282ac69ede Clone bio versions of certain bits of infrastructure:
devstat_end_transaction_bio()
        bioq_* versions of bufq_* incl bioqdisksort()
the corresponding "buf" versions will disappear when no longer used.

Move b_offset, b_data and b_bcount to struct bio.

Add BIO_FORMAT as a hack for fd.c etc.

We are now largely ready to start converting drivers to use struct
bio instead of struct buf.
2000-04-02 19:08:05 +00:00
Poul-Henning Kamp
c244d2de43 Move B_ERROR flag to b_ioflags and call it BIO_ERROR.
(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.
2000-04-02 15:24:56 +00:00
Matthew Dillon
e4649cfac3 Change the write-behind code to take more care when starting
async I/O's.  The sequential read heuristic has been extended to
    cover writes as well.  We continue to call cluster_write() normally,
    thus blocks in the file will still be reallocated for large (but still
    random) I/O's, but I/O will only be initiated for truely sequential
    writes.

    This solves a number of annoying situations, especially with DBM (hash
    method) writes, and also has the side effect of fixing a number of
    (stupid) benchmarks.

Reviewed-by: mckusick
2000-04-02 00:55:28 +00:00
Poul-Henning Kamp
ce6acbb664 diff, patch and cvs didn't like these three last time around, try again. 2000-03-20 12:34:21 +00:00
Poul-Henning Kamp
b99c307a21 Rename the existing BUF_STRATEGY() to DEV_STRATEGY()
substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.
2000-03-20 11:29:10 +00:00
Poul-Henning Kamp
21144e3bf1 Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd.  The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue.  It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users:  Greg has not had time to test this yet, be careful.
2000-03-20 10:44:49 +00:00
Kirk McKusick
584508a741 Use 64-bit math to calculate if we have hit our freespace limit.
Necessary for coherent results on filesystems bigger than 0.5Tb.
2000-03-17 03:44:47 +00:00
Kirk McKusick
15e549f668 Bug fixes for currently harmless bugs that could rise to bite
the unwary if the code were called in slightly different ways.

1) In ufs_bmaparray() the code for calculating 'runb' will stop one block
short of the first entry in an indirect block. i.e. if an indirect block
contains N block numbers b[0]..b[N-1] then the code will never check if
b[0] and b[1] are sequential. For reference, compare with the equivalent
code that deals with direct blocks.

2) In ufs_lookup() there is an off-by-one error in the test that checks
if dp->i_diroff is outside the range of the the current directory size.
This is completely harmless, since the following while-loop condition
'dp->i_offset < endsearch' is never met, so the code immediately
does a second pass starting at dp->i_offset = 0.

3) Again in ufs_lookup(), the condition in a sanity check is wrong
for directories that are longer than one block. This bug means that
the sanity check is only effective for small directories.

Submitted by:	Ian Dowse <iedowse@maths.tcd.ie>
2000-03-15 07:18:15 +00:00
Kirk McKusick
9f043878d0 Use 64-bit math to decide if optimization needs to be changed.
Necessary for coherent results on filesystems bigger than 0.5Tb.

Submitted by:	Paul Saab <ps@yahoo-inc.com>
2000-03-15 07:08:36 +00:00
Matthew Dillon
b39d4d2060 In the 'found' case for ufs_lookup() the underlying bp's data was
being accessed after the bp had been releaed.  A simple move of the
    brelse() solves the problem.

Approved by: jkh
Submitted by:  Ian Dowse <iedowse@maths.tcd.ie>
2000-03-09 18:54:59 +00:00
Matthew Dillon
f8fa53397f Fix a 'freeing free block' panic in UFS. The problem occurs when the
filesystem fills up.  If the first indirect block exists and FFS is able
    to allocate deeper indirect blocks, but is not able to allocate the
    data block, FFS improperly unwinds the indirect blocks and leaves a
    block pointer hanging to a freed block.  This will cause a panic later
    when the file is removed.  The solution is to properly account for the
    first block-pointer-to-an-indirect-block we had to create in a balloc
    operation and then unwind it if a failure occurs.

Detective work by: Ian Dowse <iedowse@maths.tcd.ie>
Reviewed by: mckusick, Ian Dowse <iedowse@maths.tcd.ie>
Approved by: jkh
2000-02-24 20:43:20 +00:00
Robert Watson
0f71afb31e After much consulting with bde, concluded that this fix was the best fix
to the current jail/chflags interactions.  This fix conditionalizes ``root
behavior'' in the chflags() case on not being in jail, so attempts to
perform a chflags in a jail are limited to what a normal user could do.
For example, this does allow setting of user flags as appropriate, but
prohibits changing of system flags.

Reviewed by:	bde
2000-02-22 03:56:58 +00:00
Robert Watson
41ecf3565f Disable chflags() from within jail() so that root within jail can't make
a mess in securelevel environments.  Results in one warning during
/etc/rc as it attempts to remove file flags, but this is harmless.

Approved by:	High Lord Hubbard
2000-02-20 01:10:36 +00:00
Kirk McKusick
4434ff1d38 When writing out bitmap buffers, need to skip over ones that already
have a write in progress. Otherwise one can get in an infinite loop
trying to get them all flushed.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
2000-01-30 20:32:59 +00:00
Kirk McKusick
57a91f6fb0 During fastpath processing for removal of a short-lived inode, the
set of restrictions for cancelling an inode dependency (inodedep)
is somewhat stronger than originally coded. Since this check appears
in two places, we codify it into the function check_inode_unwritten
which we then call from the two sites, one freeing blocks and the
other freeing directory entries.

Submitted by:	Steinar Haug via Matthew Dillon
2000-01-18 01:33:05 +00:00
Kirk McKusick
4c6adb0622 Need to reorganize the flushing of directory entry (pagedep) dependencies
so that they never try to lock an inode corresponding to ".." as this
can lead to deadlock. We observe that any inode with an updated link count
is always pushed into its buffer at the time of the link count change, so
we do not need to do a VOP_UPDATE, but merely find its buffer and write it.
The only time we need to get the inode itself is from the result of a
mkdir whose name will never be ".." and hence locking such an inode will
never request a lock above us in the filesystem tree. Thanks to Brian
Fundakowski Feldman for providing the test program that tickled soft updates
into hanging in "inode" sleep.

Submitted by:	Brian Fundakowski Feldman <green@FreeBSD.org>
2000-01-18 01:30:03 +00:00
Kirk McKusick
105ef72c55 Better bounding on softdep_flushfiles; other minor tweeks to checks. 2000-01-17 06:35:11 +00:00
Kirk McKusick
107d5039ef Must track multiple uncommitted renames until one ultimately gets
committed to disk or is removed.
2000-01-17 06:28:18 +00:00
Matthew Dillon
173cce7c8e Non-operational change, fix compiler warning.
Reviewed by:  mckusick
2000-01-14 04:39:28 +00:00
Kirk McKusick
d7127837a2 Confirming Peter's fix (locking 101: release the lock before you go
to sleep). Locking 101, part 2: do not look at buffer contents after
you have been asleep. There is no telling what wonderous changes may
have occurred.
2000-01-13 20:03:22 +00:00
Peter Wemm
7f473504e6 Free the global softupdates lock prior to tsleep() in getdirtybuf().
This seems to be responsible for a bunch of panics where the process
sleeps and something else finds softupdates "locked" when it shouldn't
be.  This commit is unreviewed, but has been a big help here.
Previously my boxes would panic pretty much on the first fsync() that
wrote something to disk.
2000-01-13 18:48:12 +00:00
Kirk McKusick
1c2ceb2880 Because cylinder group blocks are now written in background,
it is no longer sufficient to get a lock on a buffer to know
that its write has been completed. We have to first get the
lock on the buffer, then check to see if it is doing a
background write. If it is doing background write, we have
to wait for the background write to finish, then check to see
if that fullfilled our dependency, and if not to start another
write. Luckily the explanation is longer than the fix.
2000-01-13 07:20:01 +00:00
Kirk McKusick
94313add1f A panic occurs during an fsync when a dirty block associated with
a vnode has not been written (which would clear certain of its
dependencies). The problems arises because fsync with MNT_NOWAIT
no longer pushes all the dirty blocks associated with a vnode. It
skips those that require rollbacks, since they will just get instantly
dirty again. Such skipped blocks are marked so that they will not be
skipped a second time (otherwise circular dependencies would never
clear). So, we fsync twice to ensure that everything will be written
at least once.
2000-01-13 07:17:39 +00:00
Kirk McKusick
4ed62fbd7f The only known cause of this panic is running out of disk space.
The problem occurs when an indirect block and a data block are
being allocated at the same time. For example when the 13th block
of the file is written, the filesystem needs to allocate the first
indirect block and a data block. If the indirect block allocation
succeeds, but the data block allocation fails, the error code
dellocates the indirect block as it has nothing at which to point.
Unfortunately, it does not deallocate the indirect block's associated
dependencies which then fail when they find the block unexpectedly
gone (ptr == 0 instead of its expected value). The fix is to fsync
the file before doing the block rollback, as the fsync will flush
out all of the dependencies. Once the rollback is done the file
must be fsync'ed again so that the soft updates code does not find
unexpected changes. This approach is much slower than writing the
code to back out the extraneous dependencies, but running out of
disk space is not expected to be a common occurence, so just getting
it right is the main criterion.

PR:		kern/15063
Submitted by:	Assar Westerlund <assar@stacken.kth.se>
2000-01-11 08:27:00 +00:00
Kirk McKusick
10767f840b We cannot proceed to free the blocks of the file until the dependencies
have been cleaned up by deallocte_dependencies(). Once that is done, it
is safe to post the request to free the blocks. A similar change is also
needed for the freefile case.
2000-01-11 06:52:35 +00:00
Poul-Henning Kamp
ba4ad1fcea Give vn_isdisk() a second argument where it can return a suitable errno.
Suggested by:	bde
2000-01-10 12:04:27 +00:00
Kirk McKusick
26e5527c86 Missing FREE_LOCK call before handle_workitem_freeblocks.
Submitted by:	"Kenneth D. Merry" <ken@kdm.org>
2000-01-10 08:39:03 +00:00
Kirk McKusick
cf60e8e4bf Several performance improvements for soft updates have been added:
1) Fastpath deletions. When a file is being deleted, check to see if it
   was so recently created that its inode has not yet been written to
   disk. If so, the delete can proceed to immediately free the inode.
2) Background writes: No file or block allocations can be done while the
   bitmap is being written to disk. To avoid these stalls, the bitmap is
   copied to another buffer which is written thus leaving the original
   available for futher allocations.
3) Link count tracking. Constantly track the difference in i_effnlink and
   i_nlink so that inodes that have had no change other than i_effnlink
   need not be written.
4) Identify buffers with rollback dependencies so that the buffer flushing
   daemon can choose to skip over them.
2000-01-10 00:24:24 +00:00
Kirk McKusick
f0f7d38386 Keep tighter control of removal dependencies by limiting the number
of dirrem structure rather than the collaterally created freeblks
and freefile structures. Limit the rate of buffer dirtying by the
syncer process during periods of intense file removal.
2000-01-09 23:35:38 +00:00
Kirk McKusick
3f5b28bc07 Reorganize softdep_fsync so that it only does the inode-is-flushed
check before the inode is unlocked while grabbing its parent directory.
Once it is unlocked, other operations may slip in that could make
the inode-is-flushed check fail. Allowing other writes to the inode
before returning from fsync does not break the semantics of fsync
since we have flushed everything that was dirty at the time of the
fsync call.
2000-01-09 23:14:57 +00:00
Kirk McKusick
e2dc60835d Get rid of unreferenced function. 2000-01-09 22:42:42 +00:00
Kirk McKusick
83aaf63ab2 Make static non-exported functions from soft updates. 2000-01-09 22:40:09 +00:00
Peter Wemm
c447342094 Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot).  This is consistant with the other
BSD's who made this change quite some time ago.  More commits to come.
1999-12-29 05:07:58 +00:00
Bruce Evans
7e58bfacbe Update the unclean flag for mount -u. I forgot to handle this case
when I made the absence of the clean flag sticky in rev.1.88.  This
was a problem main for "mount /".  There is no way to mount "/" for
writing without using mount -u (normally implicitly), so after
"mount -f /" of an unclean filesystem, the absence of the clean flag
was sticky forever.
1999-12-23 15:42:14 +00:00
Eivind Eklund
369dc8ceb8 Change incorrect NULLs to 0s 1999-12-21 11:14:12 +00:00
Robert Watson
91f37dcba1 Second pass commit to introduce new ACL and Extended Attribute system
calls, vnops, vfsops, both in /kern, and to individual file systems that
require a vfsop_ array entry.

Reviewed by:	eivind
1999-12-19 06:08:07 +00:00
Kirk McKusick
6a4152243f The function request_cleanup() had a tsleep() with PCATCH. It is
quite dangerous, since the process may hold locks at the point,
and if it is stopped in that tsleep the machine may hang. Because
the sleep is so short, the PCATCH is not required here, so it has
been removed. For the future, the FreeBSD team needs to decide
whether it is still reasonable to stop a process in tsleep, as that
may affect any other code that uses PCATCH while holding kernel locks.

Submitted by:	Dmitrij Tejblum <tejblum@arc.hq.cti.ru>
Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
1999-12-16 22:02:09 +00:00
Eivind Eklund
762e6b856c Introduce NDFREE (and remove VOP_ABORTOP) 1999-12-15 23:02:35 +00:00
Eivind Eklund
6bdfe06ad9 Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
  return value: LK_EXCLOTHER, when the lock is held exclusively by another
  process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
  locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with:	grog, mch, peter, phk
Reviewed by:	peter
1999-12-11 16:13:02 +00:00
Bill Fumerola
43cd4e8815 Remove the 'alpha, use at your own risk' death-statement.
Reviewed by:	mckusick (verbally at FreeBSDcon)
1999-12-03 00:40:31 +00:00
Bill Fumerola
cfa5001489 Fix typo, add $FreeBSD$ 1999-12-03 00:34:26 +00:00
Kirk McKusick
9f54c05286 Preferentially allocate the first indirect block in the same
cylinder group as the inode. This makes a 15% difference in
read speed for files in the 96K to 500K size range.
1999-12-01 19:33:12 +00:00
Poul-Henning Kamp
71e4fff823 Retire MFS_ROOT and MFS_ROOT_SIZE options from the MFS implementation.
Add MD_ROOT and MD_ROOT_SIZE options to the md driver.

Make the md driver handle MFS_ROOT and MFS_ROOT_SIZE options for compatibility.

Add md driver to GENERIC, PCCARD and LINT.

This is a cleanup which removes the need for some of the worse hacks in
MFS:  We really want to have a rootvnode but MFS on a preloaded image
doesn't really have one.  md is a true device, so it is less trouble.

This has been tested with make release, and if people remember to add
the "md" pseudo-device to their kernels, PicoBSD should be just fine
as well.  If people have no other use for MFS, it can be removed from
the kernel.
1999-11-26 20:08:44 +00:00
Poul-Henning Kamp
38224dcd59 Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by:    sos
1999-11-22 10:33:55 +00:00
Eivind Eklund
b2f2b704d0 We do not have ffs_checkexp, so remove the prototype 1999-11-20 16:44:44 +00:00
Poul-Henning Kamp
0429e37ade struct mountlist and struct mount.mnt_list have no business being
a CIRCLEQ.  Change them to TAILQ_HEAD and TAILQ_ENTRY respectively.

This removes ugly  mp != (void*)&mountlist  comparisons.

Requested by:   phk
Submitted by:   Jake Burkholder jake@checker.org
PR:             14967
1999-11-20 10:00:46 +00:00
Peter Wemm
63034ded71 Fix a warning (unused static declaration without MFS_ROOT) 1999-11-18 08:49:40 +00:00
Eivind Eklund
dd8c04f4c7 Remove WILLRELE from VOP_SYMLINK
Note: Previous commit to these files (except coda_vnops and devfs_vnops)
that claimed to remove WILLRELE from VOP_RENAME actually removed it from
VOP_MKNOD.
1999-11-13 20:58:17 +00:00
Eivind Eklund
edfe736df9 Remove WILLRELE from VOP_RENAME 1999-11-12 03:34:28 +00:00
Poul-Henning Kamp
698f9cf828 Next step in the device cleanup process.
Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code.

Unify spec_open() for bdev and cdev cases.

Remove the disabled bdev specific read/write code.
1999-11-09 14:15:33 +00:00
Bruce Evans
5bd5c8b9e5 Quick fix for breakage of ext2fs link counts as reported by stat(2) by
the soft updates changes: only report the link count to be i_effnlink
in ufs_getattr() for file systems that maintain i_effnlink.

Tested by:	Mike Dracopoulos <mdraco@math.uoa.gr>
1999-11-03 12:05:39 +00:00
Mike Smith
88d4183b84 Make MFS work with the new root filesystem search process.
In order to achieve this, root filesystem mount is moved from
SI_ORDER_FIRST to SI_ORDER_SECOND in the SI_SUB_MOUNT_ROOT sysinit
group.  Now, modules which wish to usurp the default root mount
can use SI_ORDER_FIRST.

A compiled-in or preloaded MFS filesystem will become the root
filesystem unless the vfs.root.mountfrom environment variable refers
to a valid bootable device.  This will normally only be the case when
the kernel and MFS image have been loaded from a disk which has a
valid /etc/fstab file.  In this case, the variable should be manually
overridden in the loader, or the kernel booted with -a.  In either
case "mfs:" should be supplied as the new value.

Also fix a typo in one DFLTROOT case that would not have compiled.
1999-11-03 11:02:47 +00:00
Mike Smith
6d14782861 Newline-terminate the complaint message about not being able to find
the root vnode pointer.
1999-11-01 23:57:28 +00:00
Matthew Dillon
f9eb66d73a Add sysctl debug.dircheck to allow directory sanity checking to be turned
on with a sysctl.

    Fix two bugs in ufs_lookup that can cause deadlocks due to out-of-order
    locking.  This fix was tested for a few days prior to commit.
1999-10-30 00:51:14 +00:00
Poul-Henning Kamp
923502ff91 useracc() the prequel:
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>.  This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
1999-10-29 18:09:36 +00:00
Poul-Henning Kamp
b89392e703 Remove the D_NOCLUSTER[RW] options which were added because vn had
problems.  Now that Matt has fixed vn, this can go.  The vn driver
should have used d_maxio (now si_iosize_max) anyway.
1999-09-30 07:11:30 +00:00
Poul-Henning Kamp
1b5464ef9d Remove v_maxio from struct vnode.
Replace it with mnt_iosize_max in struct mount.

Nits from:	bde
1999-09-29 20:05:33 +00:00
Marcel Moolenaar
2c42a14602 sigset_t change (part 2 of 5)
-----------------------------

The core of the signalling code has been rewritten to operate
on the new sigset_t. No methodological changes have been made.
Most references to a sigset_t object are through macros (see
signalvar.h) to create a level of abstraction and to provide
a basis for further improvements.

The NSIG constant has not been changed to reflect the maximum
number of signals possible. The reason is that it breaks
programs (especially shells) which assume that all signals
have a non-null name in sys_signame. See src/bin/sh/trap.c
for an example. Instead _SIG_MAXSIG has been introduced to
hold the maximum signal possible with the new sigset_t.

struct sigprop has been moved from signalvar.h to kern_sig.c
because a) it is only used there, and b) access must be done
though function sigprop(). The latter because the table doesn't
holds properties for all signals, but only for the first NSIG
signals.

signal.h has been reorganized to make reading easier and to
add the new and/or modified structures. The "old" structures
are moved to signalvar.h to prevent namespace polution.

Especially the coda filesystem suffers from the change, because
it contained lines like (p->p_sigmask == SIGIO), which is easy
to do for integral types, but not for compound types.

NOTE: kdump (and port linux_kdump) must be recompiled.

Thanks to Garrett Wollman and Daniel Eischen for pressing the
importance of changing sigreturn as well.
1999-09-29 15:03:48 +00:00
Poul-Henning Kamp
d6a0e38a1b Remove five now unused fields from struct cdevsw. They should never
have been there in the first place.  A GENERIC kernel shrinks almost 1k.

Add a slightly different safetybelt under nostop for tty drivers.

Add some missing FreeBSD tags
1999-09-25 18:24:47 +00:00
Matthew Dillon
67ddfcaf69 More removals of vnode->v_lastr, replaced by preexisting seqcount
heuristic to detect sequential operation.

    VM-related forced clustering code removed from ufs in preparation for a
    commit to vm/vm_fault.c that does it more generally.

Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
1999-09-20 23:27:58 +00:00
Poul-Henning Kamp
faad302913 Fix a harmless bug I introduced, simplify a bit more while here. 1999-09-20 21:14:43 +00:00
Poul-Henning Kamp
fae03f66d1 Step one of replacing devsw->d_maxio with si_bsize_max.
Rename dev->si_bsize_max to si_iosize_max and set it in spec_open
if the device didn't.

Set vp->v_maxio from dev->si_bsize_max in spec_open rather than
in ufs_bmap.c
1999-09-20 19:57:28 +00:00
Bruce Evans
887ba12fc5 Removed diskerr()'s unused d_name arg and updated callers. This fixes
warnings caused by the arg having the wrong type (not const enough).
The arg was also wrong (a full name instead of a short one) for calls
from from subr_diskmbr.c and pc98/diskslice_machdep.c.
1999-09-13 12:59:41 +00:00
Alfred Perlstein
c24fda81c9 Seperate the export check in VFS_FHTOVP, exports are now checked via
VFS_CHECKEXP.

Add fh(open|stat|stafs) syscalls to allow userland to query filesystems
based on (network) filehandle.

Obtained from:	NetBSD
1999-09-11 00:46:08 +00:00
Julian Elischer
85a219d201 Changes to centralise the default blocksize behaviour.
More likely to follow.

Submitted by: phk@freebsd.org
1999-09-09 19:08:44 +00:00
Julian Elischer
7012bab988 Revert a bunch of contraversial changes by PHK. After
a quick think and discussion among various people some form of some of
these changes will probably be recommitted.

The reversion requested was requested by dg while discussions proceed.
PHK has indicated that he can live with this, and it has been agreed
that some form of some of these changes may return shortly after further
discussion.
1999-09-03 05:16:59 +00:00
Poul-Henning Kamp
02e1576966 Make bdev userland access work like cdev userland access unless
the highly non-recommended option ALLOW_BDEV_ACCESS is used.

(bdev access is evil because you don't get write errors reported.)

Kill si_bsize_best before it kills Matt :-)

Use the specfs routines rather having cloned copies in devfs.
1999-08-30 07:56:23 +00:00
Poul-Henning Kamp
9626728875 remove unused variables. 1999-08-28 19:21:03 +00:00
Poul-Henning Kamp
10af1a2b5f We don't need to pass the diskname argument all over the diskslice/label
code, we can find the name from any convenient dev_t
1999-08-28 14:33:44 +00:00
Peter Wemm
280652828b $Id$ -> $FreeBSD$ 1999-08-28 02:16:32 +00:00
Peter Wemm
c3aac50f28 $Id$ -> $FreeBSD$ 1999-08-28 01:08:13 +00:00
Poul-Henning Kamp
dbafb3660f Simplify the handling of VCHR and VBLK vnodes using the new dev_t:
Make the alias list a SLIST.

        Drop the "fast recycling" optimization of vnodes (including
        the returning of a prexisting but stale vnode from checkalias).
        It doesn't buy us anything now that we don't hardlimit
        vnodes anymore.

        Rename checkalias2() and checkalias() to addalias() and
        addaliasu() - which takes dev_t and udev_t arg respectively.

        Make the revoke syscalls use vcount() instead of VALIASED.

        Remove VALIASED flag, we don't need it now and it is faster
        to traverse the much shorter lists than to maintain the
        flag.

        vfs_mountedon() can check the dev_t directly, all the vnodes
        point to the same one.

Print the devicename in specfs/vprint().

Remove a couple of stale LFS vnode flags.

Remove unimplemented/unused LK_DRAINED;
1999-08-26 14:53:31 +00:00
Poul-Henning Kamp
41d2e3e09e Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness. 1999-08-25 12:24:39 +00:00
Poul-Henning Kamp
cb5eef8f2b Initialize the si_bsize fields for the MFS bogodevices.
(This broke MFS rootfs and thereby installation)
1999-08-24 18:35:33 +00:00
Sheldon Hearn
740e3a15f7 Fix bug introduced in rev 1.28, which causes kernel build to break for
the case where DEBUG is defined but not DIAGNOSTIC. ffs_checkblk is
declared conditionally on DIAGNOSTIC, not DEBUG.

PR:	13314
Reviewed by:	bde
1999-08-24 08:39:41 +00:00
Bruce Evans
d918320517 Use devtoname() to print dev_t's instead of casting them to long or u_long
for misprinting in %lx format.
1999-08-23 20:35:21 +00:00
John Polstra
a2801b7731 Support full-precision file timestamps. Until now, only the seconds
have been maintained, and that is still the default.  A new sysctl
variable "vfs.timestamp_precision" can be used to enable higher
levels of precision:

      0 = seconds only; nanoseconds zeroed (default).
      1 = seconds and nanoseconds, accurate within 1/HZ.
      2 = seconds and nanoseconds, truncated to microseconds.
    >=3 = seconds and nanoseconds, maximum precision.

Level 1 uses getnanotime(), which is fast but can be wrong by up
to 1/HZ.  Level 2 uses microtime().  It might be desirable for
consistency with utimes() and friends, which take timeval structures
rather than timespecs.  Level 3 uses nanotime() for the higest
precision.

I benchmarked levels 0, 1, and 3 by copying a 550 MB tree with
"cpio -pdu".  There was almost negligible difference in the system
times -- much less than 1%, and less than the variation among
multiple runs at the same level.  Bruce Evans dreamed up a torture
test involving 1-byte reads with intervening fstat() calls, but
the cpio test seems more realistic to me.

This feature is currently implemented only for the UFS (FFS and
MFS) filesystems.  But I think it should be easy to support it in
the others as well.

An earlier version of this was reviewed by Bruce.  He's not to
blame for any breakage I've introduced since then.

Reviewed by:	bde (an earlier version of the code)
1999-08-22 00:15:16 +00:00
Alan Cox
2c28a10540 Add the (inline) function vm_page_undirty for clearing the dirty bitmask
of a vm_page.

Use it.

Submitted by:	dillon
1999-08-17 04:02:34 +00:00
Poul-Henning Kamp
49ff4debd3 Spring cleaning around strategy and disklabels/slices:
Introduce BUF_STRATEGY(struct buf *, int flag) macro, and use it throughout.
please see comment in sys/conf.h about the flag argument.

Remove strategy argument from all the diskslice/label/bad144
implementations, it should be found from the dev_t.

Remove bogus and unused strategy1 routines.

Remove open/close arguments from dssize().  Pick them up from dev_t.

Remove unused and unfinished setgeom support from diskslice/label/bad144 code.
1999-08-14 11:40:51 +00:00
Poul-Henning Kamp
3a965c0db0 Move the special-casing of stat(2)->st_blksize for device files
from UFS to the generic level.  For chr/blk devices we don't care
about the blocksize of the filesystem, we want what the device
asked for.
1999-08-13 10:56:07 +00:00
Poul-Henning Kamp
7dc5cd047f The bdevsw() and cdevsw() are now identical, so kill the former. 1999-08-13 10:29:38 +00:00
Poul-Henning Kamp
4d4f932326 s/v_specinfo/v_rdev/ 1999-08-13 10:10:12 +00:00
Poul-Henning Kamp
0ef1c82630 Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.
1999-08-08 18:43:05 +00:00
Alan Cox
7f866e4b29 Move the memory access behavior information provided by madvise
from the vm_object to the vm_map.

Submitted by:	dillon
1999-08-01 06:05:09 +00:00
Bruce Evans
3dfdfdb27f Fixed access timestamp bugs:
Set IN_ACCESS for successful reads of 0 bytes (except for requests to
read 0 bytes).  This was broken in rev.1.42.
PR:		misc/10148

Don't set IN_ACCESS for requests to read 0 bytes.

Don't set IN_ACCESS for unsuccessful reads.
1999-07-25 02:07:16 +00:00
Poul-Henning Kamp
698bfad7f2 Now a dev_t is a pointer to struct specinfo which is shared by all specdev
vnodes referencing this device.

Details:
        cdevsw->d_parms has been removed, the specinfo is available
        now (== dev_t) and the driver should modify it directly
        when applicable, and the only driver doing so, does so:
        vn.c.  I am not sure the logic in checking for "<" was right
        before, and it looks even less so now.

        An intial pool of 50 struct specinfo are depleted during
        early boot, after that malloc had better work.  It is
        likely that fewer than 50 would do.

        Hashing is done from udev_t to dev_t with a prime number
        remainder hash, experiments show no better hash available
        for decent cost (MD5 is only marginally better)  The prime
        number used should not be close to a power of two, we use
        83 for now.

        Add new checkalias2() to get around the loss of info from
        dev2udev() in bdevvp();

        The aliased vnodes are hung on a list straight of the dev_t,
        and speclisth[SPECSZ] is unused.  The sharing of struct
        specinfo means that the v_specnext moves into the vnode
        which grows by 4 bytes.

        Don't use a VBLK dev_t which doesn't make sense in MFS, now
        we hang a dummy cdevsw on B/Cmaj 253 so that things look sane.

	Storage overhead from all of this is O(50k).

        Bump __FreeBSD_version to 400009

The next step will add the stuff needed so device-drivers can start to
hang things from struct specinfo
1999-07-20 09:47:55 +00:00
Poul-Henning Kamp
f008cfcc1a I have not one single time remembered the name of this function correctly
so obviously I gave it the wrong name.  s/umakedev/makeudev/g
1999-07-17 18:43:50 +00:00
Kirk McKusick
4dc0c8f521 Create the macro DOINGASYNC to check whether the MNT_ASYNC flag has
been set for a mount point. Insert missing checks to ensure that all
write operations are done asynchronously when the MNT_ASYNC option
has been requested.

Submitted by:	Craig A Soules <soules+@andrew.cmu.edu>
Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
1999-07-13 18:20:13 +00:00
Poul-Henning Kamp
68de329e34 Use the fsid from the superblock, unless it looks bogus or has already
been taken by some other filesystem.
1999-07-11 19:16:50 +00:00
Kirk McKusick
ad8ac923fa These changes appear to give us benefits with both small (32MB) and
large (1G) memory machine configurations.  I was able to run 'dbench 32'
on a 32MB system without bring the machine to a grinding halt.

    * buffer cache hash table now dynamically allocated.  This will
      have no effect on memory consumption for smaller systems and
      will help scale the buffer cache for larger systems.

    * minor enhancement to pmap_clearbit().  I noticed that
      all the calls to it used constant arguments.  Making
      it an inline allows the constants to propogate to
      deeper inlines and should produce better code.

    * removal of inherent vfs_ioopt support through the emplacement
      of appropriate #ifdef's, with John's permission.  If we do not
      find a use for it by the end of the year we will remove it entirely.

    * removal of getnewbufloops* counters & sysctl's - no longer
      necessary for debugging, getnewbuf() is now optimal.

    * buffer hash table functions removed from sys/buf.h and localized
      to vfs_bio.c

    * VFS_BIO_NEED_DIRTYFLUSH flag and support code added
      ( bwillwrite() ), allowing processes to block when too many dirty
      buffers are present in the system.

    * removal of a softdep test in bdwrite() that is no longer necessary
      now that bdwrite() no longer attempts to flush dirty buffers.

    * slight optimization added to bqrelse() - there is no reason
      to test for available buffer space on B_DELWRI buffers.

    * addition of reverse-scanning code to vfs_bio_awrite().
      vfs_bio_awrite() will attempt to locate clusterable areas
      in both the forward and reverse direction relative to the
      offset of the buffer passed to it.  This will probably not
      make much of a difference now, but I believe we will start
      to rely on it heavily in the future if we decide to shift
      some of the burden of the clustering closer to the actual
      I/O initiation.

    * Removal of the newbufcnt and lastnewbuf counters that Kirk
      added.  They do not fix any race conditions that haven't already
      been fixed by the gbincore() test done after the only call
      to getnewbuf().  getnewbuf() is a static, so there is no chance
      of it being misused by other modules.  ( Unless Kirk can think
      of a specific thing that this code fixes.  I went through it
      very carefully and didn't see anything ).

    * removal of VOP_ISLOCKED() check in flushbufqueues().  I do not
      think this check is necessary, the buffer should flush properly
      whether the vnode is locked or not. ( yes? ).

    * removal of extra arguments passed to getnewbuf() that are not
      necessary.

    * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in
      vfs_cluster.c

    * vn_write() now calls bwillwrite() *PRIOR* to locking the vnode,
      which should greatly aid flushing operations in heavy load
      situations - both the pageout and update daemons will be able
      to operate more efficiently.

    * removal of b_usecount.  We may add it back in later but for now
      it is useless.  Prior implementations of the buffer cache never
      had enough buffers for it to be useful, and current implementations
      which make more buffers available might not benefit relative to
      the amount of sophistication required to implement a b_usecount.
      Straight LRU should work just as well, especially when most things
      are VMIO backed.  I expect that (even though John will not like
      this assumption) directories will become VMIO backed some point soon.

Submitted by:	Matthew Dillon <dillon@backplane.com>
Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
1999-07-08 06:06:00 +00:00
Ollivier Robert
7fe29b0aef Add $Id$
Approved by:	kirk
1999-07-07 07:51:04 +00:00
John Polstra
24755bdc25 Update pathnames for new location of soft-updates sources. 1999-07-03 21:34:05 +00:00
Kirk McKusick
48703fedf1 No longer need to set B_ASYNC flag since BUF_KERNPROC now
unconditionally sets the identity of the buffer.
1999-06-29 15:57:40 +00:00
Peter Wemm
a6451da76b Keep the inlines for <sys/buf.h> happy.. 1999-06-27 13:26:23 +00:00
Kirk McKusick
67812eacd7 Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.
1999-06-26 02:47:16 +00:00
Kirk McKusick
7481264c1e On our final pass through ffs_fsync, do all I/O synchronously so that
we can find out if our flush is failing because of write errors. This
change avoids a "flush failed" panic during unrecoverable disk errors.
1999-06-18 05:49:46 +00:00
Kirk McKusick
f9c8cab591 Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.
1999-06-16 23:27:55 +00:00
Kirk McKusick
e4ab40bcb6 Get rid of the global variable rushjob and replace it with a function in
kern/vfs_subr.c named speedup_syncer() which handles the speedup request.
Change the various clients of rushjob to use the new function.
1999-06-15 23:37:29 +00:00
Poul-Henning Kamp
2447bec829 Simplify cdevsw registration.
The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it.  cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.

cdevsw_add() will print an message if the d_maj field looks bogus.

Remove nblkdev and nchrdev variables.  Most places they were used
bogusly.  Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.

Move bdevsw() and devsw() functions to kern/kern_conf.c

Bump __FreeBSD_version to 400006

This commit removes:
        72 bogus makedev() calls
        26 bogus SYSINIT functions

if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.

I4b and vinum not changed.  Patches emailed to authors.  LINT
probably broken until they catch up.
1999-05-31 11:29:30 +00:00
John Birrell
ed3a2fb7b3 - Back out Luoqi's cdevsw stuff. It panics on my system and is not required.
- Fix an error message.
- Do the MFS_ROOT setting of mountrootfsname in mfs_init() instead of
  cpu_rootconf().
- Set rootdev in mfs_init instead of later in mfs_mount() iff MFS_ROOT.
1999-05-24 00:27:12 +00:00
Julian Elischer
2e897e94b6 Cosmetic changes to make it compile without errors in gcc -Wall 1999-05-22 04:43:04 +00:00
Luoqi Chen
0ce54cbb0c Legally acquire a major number for mfs. 1999-05-14 20:40:23 +00:00
Kirk McKusick
c2606ec5c6 Add a hook to ffs_fsync to allow soft updates to get first chance at doing
a sync on the block device for the filesystem. That allows it to push the
bitmap blocks before the inode blocks which greatly reduces the number of
inode rollbacks that need to be done.
1999-05-14 01:26:46 +00:00
Peter Wemm
51b5226683 Try and fix a dev_t/major/minor etc nit. 1999-05-12 22:32:07 +00:00
Poul-Henning Kamp
bfbb9ce670 Divorce "dev_t" from the "major|minor" bitmap, which is now called
udev_t in the kernel but still called dev_t in userland.

Provide functions to manipulate both types:
        major()         umajor()
        minor()         uminor()
        makedev()       umakedev()
        dev2udev()      udev2dev()

For now they're functions, they will become in-line functions
after one of the next two steps in this process.

Return major/minor/makedev to macro-hood for userland.

Register a name in cdevsw[] for the "filedescriptor" driver.

In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.

In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits.  This will essentially
replace the current "alias" check code (same buck, bigger bang).

A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference.  If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.

Without DEVT_FASCIST I belive this patch is a no-op.

Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.

Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).
1999-05-11 19:55:07 +00:00
Bruce Evans
6b88526425 Fixed disordering in previous 2 commits. 1999-05-11 03:11:09 +00:00
Peter Wemm
7f2d5fc4f2 Move the mfs_getimage() prototype to mfs_extern.h duplicating it
everywhere.
1999-05-10 17:12:45 +00:00
Kirk McKusick
71a0942aca Put back changes that might be causing trouble on Alpha. 1999-05-09 19:39:54 +00:00
Poul-Henning Kamp
4be2eb8c49 I got tired of seeing all the cdevsw[major(foo)] all over the place.
Made a new (inline) function devsw(dev_t dev) and substituted it.

Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)

DEVFS will eventually benefit from this change too.
1999-05-08 06:40:31 +00:00
Poul-Henning Kamp
46eede0058 Continue where Julian left off in July 1998:
Virtualize bdevsw[] from cdevsw.  bdevsw() is now an (inline)
        function.

        Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
        to the order of the cmaj/bmaj arguments!)

        Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
        (ditto!)

(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)
1999-05-07 10:11:40 +00:00
Kirk McKusick
36cfb417de Whitespace cleanup. 1999-05-07 05:21:16 +00:00
Kirk McKusick
7957996abd Get rid of random debugging cruft; sync up with latest version. 1999-05-07 05:11:31 +00:00
Kirk McKusick
224a6aa241 Severe slowdowns have been reported when creating or removing many
files at once on a filesystem running soft updates. The root of
the problem is that soft updates limits the amount of memory that
may be allocated to dependency structures so as to avoid hogging
kernel memory. The original algorithm just waited for the disk I/O
to catch up and reduce the number of dependencies. This new code
takes a much more aggressive approach. Basically there are two
resources that routinely hit the limit. Inode dependencies during
periods with a high file creation rate and file and block removal
dependencies during periods with a high file removal rate. I have
attacked these problems from two fronts. When the inode dependency
limits are reached, I pick a random inode dependency, UFS_UPDATE
it together with all the other dirty inodes contained within its
disk block and then write that disk block. This trick usually
clears 5-50 inode dependencies in a single disk I/O. For block and
file removal dependencies, I pick a random directory page that has
at least one remove pending and VOP_FSYNC its directory. That
releases all its removal dependencies to the work queue. To further
hasten things along, I also immediately start the work queue process
rather than waiting for its next one second scheduled run.
1999-05-07 02:26:47 +00:00
Peter Wemm
dfd5dee1b0 Add sufficient braces to keep egcs happy about potentially ambiguous
if/else nesting.
1999-05-06 18:13:11 +00:00
Alan Cox
4221e284a3 The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS.  These hacks have caused no
end of trouble, especially when combined with mmap().  I've removed
them.  Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write.  NFS does, however,
optimize piecemeal appends to files.  For most common file operations,
you will not notice the difference.  The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations.  NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write.  There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault.  This
is not correct operation.  The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid.  A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid.  This operation is
necessary to properly support mmap().  The zeroing occurs most often
when dealing with file-EOF situations.  Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten.  B_CACHE operation is now
formally defined in comments and more straightforward in
implementation.  B_CACHE for VMIO buffers is based on the validity of
the backing store.  B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa).  biodone() is now responsible for setting B_CACHE
when a successful read completes.  B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated.  VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE.  This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount.  These have been fixed.  getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made.  A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain.  The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
Poul-Henning Kamp
75c1354190 This Implements the mumbled about "Jail" feature.
This is a seriously beefed up chroot kind of thing.  The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact:  "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

   I have no scripts for setting up a jail, don't ask me for them.

   The IP number should be an alias on one of the interfaces.

   mount a /proc in each jail, it will make ps more useable.

   /proc/<pid>/status tells the hostname of the prison for
   jailed processes.

   Quotas are only sensible if you have a mountpoint per prison.

   There are no privisions for stopping resource-hogging.

   Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by:   http://www.rndassociates.com/
Run for almost a year by:       http://www.servetheweb.com/
1999-04-28 11:38:52 +00:00
Mike Smith
f4711b2df4 Simplify the tunefs example, since tunefs uses getfsfile(). Lots of
people complain about working out what device their filesystems are
mounted on.
1999-04-27 21:11:19 +00:00
Poul-Henning Kamp
f711d546d2 Suser() simplification:
1:
  s/suser/suser_xxx/

2:
  Add new function: suser(struct proc *), prototyped in <sys/proc.h>.

3:
  s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/

The remaining suser_xxx() calls will be scrutinized and dealt with
later.

There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.

More changes to the suser() API will come along with the "jail" code.
1999-04-27 11:18:52 +00:00
Dmitrij Tejblum
8d81b5d631 Change type of a variable from u_int to size_t, so that pointer to it may be
used as a last argument to copyinstr().
1999-04-21 09:41:07 +00:00
Eivind Eklund
ee45a71480 Correct typo in panic message 1999-04-11 02:28:32 +00:00
Peter Wemm
30c56d468c Hold the mfs process's upages in-core with PHOLD rather than P_NOSWAP. 1999-04-06 03:08:43 +00:00
Julian Elischer
8d17e69460 Catch a case spotted by Tor where files mmapped could leave garbage in the
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
    vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
    ufs/ufs/ufs_readwrite.c kern/vfs_bio.c

Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>
1999-04-05 19:38:30 +00:00
Peter Wemm
fa8e1794b5 There's not much point in the EXPORTMFS #ifdef. I've had this sitting
in my tree for 12+ months, and I just noticed that NetBSD have (I think,
I've just seen the commit, not the change) just zapped it there.
It wasn't in the options files or LINT either.
1999-04-05 06:39:10 +00:00
Julian Elischer
51df594922 Stop the mfs from trying to swap out crucial bits of the mfs
as this can lead to deadlock.
Submitted by: Mat dillon <dillon@freebsd.org>
1999-03-12 00:44:03 +00:00
Bruce Evans
44f332052d Don't depend on <ufs/ufs/quota.h> or another (old) prerequisite including
<sys/queue.h>.  This fixes my recent breakage of biosboot by unpolluting
<ufs/ufs/quota.h> in the !KERNEL case.
1999-03-06 05:21:09 +00:00
Bruce Evans
fdc79256c1 Moved kernel declarations inside the KERNEL ifdef, and removed
include of <sys/queue.h> in the !KERNEL case.  The prerequisites
for <ufs/ufs/quota.h> were broken in Lite2 by converting some of
the kernel declarations to use queue macros without including
<sys/queue.h>.  <sys/queue.h> was included in applications in
/usr/src instead.  We polluted this file instead of merging the
changes in the applications.

Include <sys/queue.h> in the KERNEL case, and forward-declare all
structs that are used in prototypes, so that this file is almost
self-sufficient even in the kernel.

Obtained from:	mostly from NetBSD
1999-03-05 11:25:31 +00:00
Bruce Evans
abd022381d Changed the type of quotactl()'s 4th arg from char *' to void *'
so that non-sloppy applications can call it without using disgusting
casts to avoid warnings.  The 4th arg is sort of varargs -- it must
sometimes represent a filename, sometimes a struct pointer, and is
sometimes unused.  The arg type is still caddr_t in the kernel.

Obtained from:	mostly from NetBSD
1999-03-05 09:28:33 +00:00
Kirk McKusick
38e28fd66b Reorganize locking to avoid holding the lock during calls to bdwrite
and brelse (which may sleep in some systems).

Obtained from:	Matthew Dillon <dillon@apollo.backplane.com>
1999-03-02 06:38:07 +00:00
Warner Losh
5369eb85ca Merge patch to ufs_vnops.c's ufs_rename to the copy of ufs_rename that
lives in ext2_vnops.c for ext2fs.  Also remove cast from comparision.
Bruce pointed out that it was bogus since we'd force a signed
comparision when we really wanted an unsigned comparison.
1999-03-02 05:31:47 +00:00
Kirk McKusick
eef33ce9bd When fsync'ing a file on a filesystem using soft updates, we first try
to write all the dirty blocks. If some of those blocks have dependencies,
they will be remarked dirty when the I/O completes. On systems with
really fast I/O systems, it is possible to get in an infinite loop trying
to flush the buffers, because the I/O finishes before we can get all the
dirty buffers off the v_dirtyblkhd list and into the I/O queue. (The
previous algorithm looped over the v_dirtyblkhd list writing out buffers
until the list emptied.) So, now we mark each buffer that we try to
write so that we can distinguish the ones that are being remarked dirty
from those that we have not yet tried to flush. Once we have tried to
push every buffer once, we then push any associated metadata that is
causing the remaining buffers to be redirtied.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
1999-03-02 04:04:31 +00:00
Kirk McKusick
4cbb89d95d Ensure that softdep_sync_metadata can handle bmsafemap and mkdir entries
if they ever arise (which should not happen as softdep_sync_metadata is
currently used).
1999-03-02 00:19:47 +00:00
Warner Losh
00db131a60 Fix last commit based on feedback from Guido, Bruce and Terry.
Specifically, the test was in the wrong place, lacked a cast, didn't
unlock the node, and exited to bad rather than abortit.  Now we don't
allow renaming of a file with LINK_MAX references.  Move the test to
earlier in the code as it is closer to where ip is obtained, as that
is the style of the rest of the function.

Didn't fix the problems bruce pointed out in the rename man page to
include EMLINK, nor address his complaints about how the whole idea of
incrementing the link count during a rename is potentially asking for
trouble.

Also didn't try to correct potential problem Terry pointed out with
decrements not being similarly protected against underflow.
1999-02-26 05:34:16 +00:00
Warner Losh
b82396269c Add missing check for LINK_MAX in ufs_rename. Since ip->i_effnlink and
ip->nlink were different types, there was a masked overflow.

Reported by: Mark Slemco <marcs@znep.com>
1999-02-25 09:52:46 +00:00
Matthew Dillon
06f7b4ebd1 Update ufs_vnops code to use new specinfo fields rather then guess.
This is part of general specinfo / d_parms() commit.
1999-02-25 05:35:53 +00:00
Kirk McKusick
133ff2619a fix double LIST_REMOVE; other cosmetic changes to match version 9.32.
Obtained from: Jeffrey Hsu <hsu@FreeBSD.ORG>
1999-02-17 20:01:20 +00:00
Matthew Dillon
89a01116cf Remove XXX comment in regarsd to why NFS doesn't use VOP_ABORT(). NFS
is being fixed now.
1999-02-13 08:38:28 +00:00
Matthew Dillon
8aef171243 Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile
1999-01-28 00:57:57 +00:00
Matthew Dillon
5e24f1a2f6 Remove unintended trigraph sequences in comments for -Wall 1999-01-27 18:19:53 +00:00
David Greenman
8ab2fa0073 Gutted softdep_deallocate_dependencies and replaced it with a panic. It
turns out to not be useful to unwind the dependencies and continue in
the face of a fatal error.
Also changed the log() to a printf() in softdep_error() so that it will
be output in the case of a impending panic.
Submitted by:	Kirk McKusick <mckusick@mckusick.com>
1999-01-22 09:07:32 +00:00
Matthew Dillon
50c7d0d513 Added support for VOP_FREEBLKS(), reducing MFS's impact on swap and
increasing performance by deallocating at least some of the backing
    store when files are removed.

    Protect mfsp->buf_queue access at splbio().
1999-01-21 09:27:03 +00:00
Matthew Dillon
3fe2487dfc Access to mfsp->buf_queue must be protected at splbio(). Other minor
adjustments also made, such as passing mfsp to mfs_doio() directly.
1999-01-21 09:24:46 +00:00
Matthew Dillon
1c7c3c6a86 This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
    fixes, several VM optimizations, and some additional revamping of the
    VM code.  The specific bug fixes will be documented with additional
    forced commits.  This commit is somewhat rough in regards to code
    cleanup issues.

Reviewed by:	"John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
1999-01-21 08:29:12 +00:00
Eivind Eklund
5b1b6c5859 Silence warning about unused debug function. (I'll turn this function
into a DDB command in my next staticization sweep).
1999-01-12 11:42:41 +00:00
Eivind Eklund
a862221fa0 Add a warning about the copyright restraints. 1999-01-08 16:03:12 +00:00
Bruce Evans
de5d1ba57c Don't pass unused unused timestamp args to UFS_UPDATE() or waste
time initializing them.  This almost finishes centralizing (in-core)
timestamp updates in ufs_itimes().
1999-01-07 16:14:19 +00:00
Bruce Evans
4591d9bb7e UFS_UPDATE() takes a boolean `waitfor' arg, so don't pass it the value
MNT_WAIT when we mean boolean `true' or check for that value not being
passed.  There was no problem in practice because MNT_WAIT had the
magic value of 1.
1999-01-06 18:18:06 +00:00
Bruce Evans
d64dbc8719 Ifdefed the conditionally used variable `prtrealloc'. Declare it
as volatile so that there is no chance that the code that it controls
is optimised away.
1999-01-06 17:04:33 +00:00
Bruce Evans
5991fd0370 Backed out rev.1.47. It just broke my optimisations for lazy syncing
of timestamps in rev.1.45.  The soft updates bug was elsewhere.

Forgotten by:	luoqi
1999-01-06 16:52:38 +00:00
Eivind Eklund
fb1167777a Remove the 'waslocked' parameter to vfs_object_create(). 1999-01-05 18:50:03 +00:00
Bruce Evans
289bdf33d3 Ifdefed conditionally used simplock variables. 1999-01-02 11:34:57 +00:00
Eivind Eklund
a777e82019 Remove the last clients of vfs_object_create(..., waslocked=1);
waslocked will go away shortly.

Reviewed by:	dg
1999-01-02 01:32:36 +00:00
Matthew Dillon
2ae122f64a The mount_mfs process that stays in a supervisor context handling MFS
I/O requests must be marked P_SYSTEM because if it isn't and the system
    decides to swap it or (god forbid) kill it, the system stands a good
    chance of locking up.
1999-01-01 04:14:11 +00:00
Bruce Evans
d26105a9ce Fixed null pointer panics which I introduced in rev.1.86. Vnodes
may be revoked, so vnop routines must be careful about accessing
the vnode if they may have blocked.

Fixed marking for update after successfully reading or writing 0
bytes.  In this case, POSIX.1 specifies marking if and only if the
requested count is nonzero, but rev.1.86 never marked.
1998-12-24 09:45:10 +00:00
Bruce Evans
450fefa9f3 Remove unused file. It seems to have been a vestige of when mfs did its
own memory allocation.
1998-12-20 17:05:54 +00:00
Doug Rabson
6839466b30 In ufs_setattr(), if only one of va_atime or va_mtime are != VNOVAL, then
the code set the other field in the inode to VNOVAL.  This can happen
sometimes on an NFS server.
1998-12-20 12:36:01 +00:00
Julian Elischer
feb17f5acb Add comments to code that I was trying to understand.
Hopefully will save others time.

Someone who understands this better might check for correctness.
1998-12-15 03:29:52 +00:00
Matthew Dillon
f7bb75c92a Fix -Wuninitialized warning regarding zero-length var-args ctl element.
( this isn't really an error, but I think it is important to fix the
    warning ).
1998-12-14 05:37:37 +00:00
Julian Elischer
1f35e8c8da Remove some compiler warnings. 1998-12-10 20:11:47 +00:00
Eivind Eklund
bf51e54f46 Make compare correct with unsigned types. (Problem introduced by Lite/2). 1998-12-09 02:06:27 +00:00
Archie Cobbs
f1d19042b0 The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.
1998-12-07 21:58:50 +00:00
Bruce Evans
672be20b9f Don't use the strange null pointer constant `(ufs_daddr_t)0' in a call
to VOP_BMAP().  Don't use uncast NULLs in the same call.
1998-11-29 03:12:06 +00:00
David Greenman
1c680b45a2 Restored the "reallocblks" code to its former glory. What this does is
basically do a on-the-fly defragmentation of the FFS filesystem, changing
file block allocations to make them contiguous. Thanks to Kirk McKusick
for providing hints on what needed to be done to get this working.
1998-11-13 01:01:44 +00:00
Peter Wemm
1c5bb3eaa1 add #include <sys/kernel.h> where it's needed by MALLOC_DEFINE() 1998-11-10 09:16:29 +00:00
Peter Wemm
2ec07c6614 Change dirty block list handling to use TAILQ macros. 1998-10-31 15:33:32 +00:00
Peter Wemm
40c8cfe552 Use TAILQ macros for clean/dirty block list processing. Set b_xflags
rather than abusing the list next pointer with a magic number.
1998-10-31 15:31:29 +00:00
Jordan K. Hubbard
2dcc2f0693 Clarify a rather ambiguous debugging message. 1998-10-28 10:37:54 +00:00
Bruce Evans
b5ee16407f Oops, the redundant tests for major numbers weren't redundant here.
They checked for the magic major number for the "device" behind mfs
mount points.  Use a more obvious check for this device.

Debugged by:		Andrew Gallatin <gallatin@cs.duke.edu>
1998-10-27 11:47:08 +00:00
Bruce Evans
569555b969 Removed redundant bitrotted checks for major numbers instead of updating
them.
1998-10-26 08:53:13 +00:00
Bruce Evans
9c0619dace Don't follow null bdevsw pointers. The `major(dev) < nblkdev' test rotted
when bdevsw[] became sparse.  We still depend on magic to avoid having to
check that (v_rdev) device numbers in vnodes are not NODEV.

Removed redundant `major(dev) < nblkdev' tests instead of updating them.
1998-10-25 19:02:48 +00:00
Poul-Henning Kamp
f5ef029e92 Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.
1998-10-25 17:44:59 +00:00
Bruce Evans
e36b4f594a Use only the correct raw partition for writing labels. Don't use the
partition that the label ioctl is being done on just because it has
offset 0, since there is no guarantee that such a partition is large
enough to contain the label.  Don't use the wrong raw partition (0
instead of RAW_PART).

This fixes problems rewriting bizarre labels (with a nonzero offset
for the 'a' partition) in newfs(8).  Such labels shouldn't normally
be used, but creating them was allowed if the ioctl was done on the
raw partition, and sysinstall creates them if the root partition isn't
allocated first.

Note that allowing write access to a partition other than the one that
has been checked for write access doesn't increase security holes
significantly, since write access to any partition already allows
changing the in-core label.

This fix should be in 3.0R.  Rev.1.26 of newfs/newfs.c shouldn't be
in 3.0R.
1998-10-17 07:49:04 +00:00
Jordan K. Hubbard
908dcbd2a4 fixup for alpha. 1998-10-16 10:14:21 +00:00
Bruce Evans
d2165c2f7d Fixed bloatage of `struct inode'. We used 5 "spare" fields for ext2fs,
but when i_effnlink was added to support soft updates, there was only
room for 4 spares.  The number of spares was not reduced, so the inode
size became 260 (on i386's), or 512 after rounding up by malloc().
Use one spare field in `struct dinode' instead of the 5th spare field
in the inode and reduced to 4 spares in the inode so that the size is
256 again.

Changed the types of the spares in the inode from int to u_int32_t
so that the inode size has more chance of being <= 256 under other
arches, and downdated ext2fs to match (it was broken to use ints
before rev.1.1).
1998-10-13 15:45:43 +00:00
Peter Wemm
624b326270 "fix" a warning 1998-10-12 09:02:19 +00:00
Jordan K. Hubbard
a33b93ff31 Allow more flexible use of MFS root.
Submitted by:	peter
1998-10-10 08:12:24 +00:00
Peter Wemm
cfb55a60f9 MODINFO_ADDR has real addresses now, remove the manual relocation based
on cpu type.
1998-10-09 23:37:37 +00:00
Jordan K. Hubbard
b526345c9e Add some evil temporary phys-to-kern translation for mfs. 1998-10-09 06:21:12 +00:00
Jordan K. Hubbard
48c540f9a1 include proper header for Mike's new stuff. 1998-10-09 01:40:56 +00:00
Jordan K. Hubbard
f97ad428df Allow the module area to be used in order to find the MFS image
(in addition to allowing it to be compiled in) and stop overloading
the MFS_ROOT variable to store size information.
1998-10-08 23:34:44 +00:00
Luoqi Chen
523e3b0f7f Use vm_page_xxx() inline functions to manipulate vm_page::flags, vm_page::busy.
As a side effect, a few wakeup() calls are added, which might fix some of the
missing vm_page wakeups people have been seeing.

Reviewed by:	Doug Rabson	<dfr@nlsystems.com>
1998-10-07 13:59:26 +00:00
Nate Williams
ed8d80c2de Fix 'noatime' bug that was unrelated to use of noatime.
The problem is caused when a directory block is compacted.  When this
occurs, softdep_change_directoryentry_offset() is called to relocate each
directory entry and adjust its matching diradd structure, if any, to match
the new location of the entry.  The bug is that while
softdep_change_directoryentry_offset() correctly adjusts the offsets of
the diradd structures on the pd_diraddhd[] lists (which are not yet ready
to be committed to disk), it fails to adjust the offsets of the diradd
structures on the pd_pendinghd list (which are ready to be committed to
disk).  This causes the dependency structures to be inconsistent with
the buf contents.  Now, if the compaction has moved a directory entry to
the same offset as one of the diradd structures on the pd_pendinghd list
*and* a syscall is done that tries to remove this directory entry before
this directory block has been written to disk (which would empty
pd_pendinghd), a sanity check in newdirrem() will call panic() when it
notices that the inode number in the entry that it is to be removed doesn't
match the inode number in the diradd structure with that offset of that
entry.

Reviewed by:	Kirk McKusick <mckusick@McKusick.COM>
Submitted by:	Don Lewis <Don.Lewis@tsc.tdk.com>
1998-10-03 19:17:11 +00:00
Kirk McKusick
df077352a7 Do not allow a mounted on directory to be rmdir'ed. This removal can
happen when an NFS exported filesystem tries to remove a locally
mounted on directory.
PR:		kern/7272
Submitted by:	Andre Albsmeier <andre.albsmeier@mchp.siemens.de>
1998-09-30 00:53:40 +00:00
Bruce Evans
0922cce61c Fixed clean flag handling:
- don't set the clean flag on unmount of an unclean filesystem that was
  (forcibly) mounted rw.
- set the clean flag on rw -> ro update of a mounted initially-clean
  filesystem.
- fixed some style bugs (mostly long lines).

This uses the fs_flags field and FS_UNCLEAN state bit which were
introduced in the softdep changes.  NetBSD uses extra state bits in
fs_clean.

Reviewed by:	luoqui
1998-09-26 04:59:42 +00:00
Luoqi Chen
e266594c25 Eliminate a race in VOP_FSYNC() when softupdates is enabled.
Submitted by:	Kirk McKusick	<mckusick@McKusick.COM>
Two minor changes are also included,
1. Remove gratuitious checks for error return from vn_lock with LK_RETRY set,
   vn_lock should always succeed in these cases.
2. Back out change rev. 1.36->1.37, which unnecessarily makes async mount
   a little more unstable. It also keeps us in sync with other BSDs.
Suggested by:	Bruce Evans	<bde@zeta.org.au>
1998-09-24 15:02:46 +00:00
Luoqi Chen
f9e84c2fee Restore pre-v1.44 behavior: always copy modified in-core inode to disk
buffer. Otherwise some in-core inode changes might be lost, including
important meta data (e.g. size) if softupdates is enabled.
1998-09-15 14:45:28 +00:00
Justin T. Gibbs
eda00cb5d2 When a buffer is removed from a buffer queue, remember it's block number
and use it as "the currently active" buffer in doing disk sort calculations.
1998-09-15 08:55:03 +00:00
Søren Schmidt
d024c95599 Remove the SLICE code.
This clearly needs alot more thought, and we dont need this to hunt
us down in 3.0-RELEASE.
1998-09-14 19:56:42 +00:00
Bruce Evans
9164000766 Don't dereference an uninitialized pointer in dead code. The dead
code gets executed if it is compiled without optimization.
1998-09-12 14:46:15 +00:00
Bruce Evans
8994ca3ce9 Removed statically configured mount type numbers (MOUNT_*) and all
references to them.

The change a couple of days ago to ignore these numbers in statically
configured vfsconf structs was slightly premature because the cd9660,
cfs, devfs, ext2fs, nfs vfs's still used MOUNT_* instead of the number
in their vfsconf struct.
1998-09-07 13:17:06 +00:00
Bruce Evans
ff261f16f6 Put the zombie ffs sysctl node in "notyet" state together with its few
remaining children.  Prepare it for MOUNT_UFS going away.
1998-09-07 11:50:19 +00:00
Poul-Henning Kamp
21aa768ea1 Make MFS do the default on VOP_FREEBLKS().
XXX: we could deallocate the storage, but somebody else will
have to pick up that task.
1998-09-07 06:52:01 +00:00
Poul-Henning Kamp
0375c9f2b8 Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to inform
device drivers about sectors no longer in use.

Device-drivers receive the call through d_strategy, if they have
D_CANFREE in d_flags.

This allows flash based devices to erase the sectors and avoid
pointlessly carrying them around in compactions.

Reviewed by:	Kirk Mckusick, bde
Sponsored by:	M-Systems (www.m-sys.com)
1998-09-05 14:13:12 +00:00
Bruce Evans
1874ef935c Quick fix for breakage of read clustering on non-IDE drives. Read
clustering is obsolescent technology so hardly anyone noticed.  On
a DORS 32160 SCSI drive with 4 tags, read clustering makes very
little difference even for huge sequential reads.  However, on a
ZIP SCSI drive with 0 tags, the minimum overhead per block is about
40 msec, so very large clusters must be used to get anywhere near
the maximum transfer rate.  Using clusters consisting of 1 8K block
reduces the transfer rate to about 250K/sec.  Under msdosfs, missing
read clustering is normal and a cluster size of 1 512 byte block
reduces the transfer rate to about 25K/sec.

Broken in:	rev.1.18
1998-08-18 03:54:39 +00:00
Bruce Evans
0492d857d1 Removed unused includes. 1998-08-17 19:09:36 +00:00
Mike Smith
f01beb610a "The releaseing of the reference and lock is not temporary and belongs
where it is.  The reference and lock(s) are acquired just above the
 code in VREF() and relookup()."

Submitted by:	Michael Hancock <michaelh@cet.co.jp>
1998-08-12 21:42:54 +00:00
Julian Elischer
55d80b2df1 Handle the case of moving a directory onto the top of a sibling's
child of the same name.

Submitted by:	Kirk Mckusick with fixes from luoqi Chen
Obtained from:   Whistle test tree.
1998-08-12 20:46:47 +00:00
Bruce Evans
aa6db4230d Used daddr_t's, not ints, to store disk block numbers. Updated printf
formats and args to match.  Fixed old printf format errors (all related;
most were hidden by calling printf indirectly).

This change somehow avoids compiler bugs for 64-bit longs on i386's,
although it increases the number of 64-bit calculations.
1998-07-28 18:25:51 +00:00
Bruce Evans
15b29aabe4 Made lazy syncing of timestamps for special files non-optional. 1998-07-27 15:37:00 +00:00
Bruce Evans
a23d65bfc8 Cast pointers to uintptr_t/intptr_t instead of to u_long/long,
respectively.  Most of the longs should probably have been
u_longs, but this changes is just to prevent warnings about
casts between pointers and integers of different sizes, not
to fix poorly chosen types.
1998-07-15 02:32:35 +00:00
Bruce Evans
ac1e407b32 Fixed printf format errors. 1998-07-11 07:46:16 +00:00
Julian Elischer
f763857cff Add code missed in the initial Soft updates integration.
Make the unallocated parts of a directry have a know state
in case we need it later.
1998-07-10 00:10:20 +00:00
Julian Elischer
bcbd6c6fdd Don't update superblock if mounted readonly,
also fixes some problems with softupdates on root.
More cleanups are needed here..
Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>
1998-07-08 23:52:27 +00:00
Julian Elischer
6deaf84b1f Catch a few corner cases where FreeBSD differs enough from BSD 4.4 to
confuse Soft updates..
Should solve several "dangling deps" panics.
1998-07-08 01:04:33 +00:00
Julian Elischer
fd5d1124e2 VOP_STRATEGY grows an (struct vnode *) argument
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>
1998-07-04 20:45:42 +00:00
Bruce Evans
99977261a1 Restored revs.1.89-1.90 which I somehow clobbered in rev.1.91. 1998-07-03 22:37:43 +00:00
Bruce Evans
3055187290 Sync timestamp changes for inodes of special files to disk as late
as possible (when the inode is reclaimed).  Temporarily only do
this if option UFS_LAZYMOD configured and softupdates aren't enabled.
UFS_LAZYMOD is intentionally left out of /sys/conf/options.

This is mainly to avoid almost useless disk i/o on battery powered
machines.  It's silly to write to disk (on the next sync or when the
inode becomes inactive) just because someone hit a key or something
wrote to the screen or /dev/null.

PR:		5577
Previous version reviewed by:	phk
1998-07-03 22:17:03 +00:00
Bruce Evans
33cc029eab Centralized in-core inode update. Update the in-core inode directly
in ufs_setattr() so that there is no need to pass timestamps to
UFS_UPDATE() (everything else just needs the current time).  Ignore
the passed-in timestamps in UFS_UPDATE() and always call ufs_itimes()
(was: itimes()) to do the update.  The timestamps are still passed
so that all the callers don't need to be changed yet.
1998-07-03 18:46:52 +00:00