1018 Commits

Author SHA1 Message Date
kib
dadf5cd065 The softdep_setup_freeblocks() adds worklist items before
deallocate_dependencies() is done. This opens a race between softdep
thread and the thread that does the truncation:
  A write of the indirect block causes the freeblks to become
  ALLCOMPLETE while softdep_setup_freeblocks() dropped softdep lock. And
  then, softdep_disk_write_complete() would reassign the workitem to the
  mount point worklist, causing premature processing of the workitem, or
  journal write exhaust the fb_jfreeblkhd and handle_written_jfreeblk does
  the same reassign.
indir_trunc() then would find the indirect block that is locked (with lock
owned by kernel) but without any dependencies, causing it to hang in
getblk() waiting for buffer lock.

Do not mark freeblks as DEPCOMPLETE until deallocate_dependencies()
finished.

Analyzed, suggested and reviewed by:	jeff
Tested by:	pho
2010-11-11 11:54:01 +00:00
kib
ac0984a653 Change #ifdef INVARIANTS panic into KASSERT, and print some useful
information to diagnose the issue, in handle_complete_freeblocks().

Reviewed by:	jeff
Tested by:	pho
2010-11-11 11:41:52 +00:00
kib
90d9ba4d7c In journal_mount(), only set MNTK_SUJ flag after the jblocks are mapped.
I believe there is a window otherwise where jblocks can be accessed
without proper initialization.

Reviewed by:	jeff
Tested by:	pho
2010-11-11 11:38:57 +00:00
kib
a16cecf311 Add function lbn_offset to calculate offset of the indirect block of
given level.

Reviewed by:	jeff
Tested by:	pho
2010-11-11 11:35:42 +00:00
kib
57488f7bec Fix typo. Function is called ffs_blkfree. 2010-11-11 11:26:59 +00:00
kib
4036cd070d The r184588 changed the layout of struct export_args, causing an ABI
breakage for old mount(2) syscall, since most struct <filesystem>_args
embed export_args. The mount(2) is supposed to provide ABI
compatibility for pre-nmount mount(8) binaries, so restore ABI to
pre-r184588.

Requested and reviewed by:	bde
MFC after:    2 weeks
2010-10-10 07:05:47 +00:00
alc
e9dc33bfce M_USE_RESERVE has been deprecated for a decade. Eliminate any uses that
have no run-time effect.
2010-10-02 17:58:57 +00:00
mckusick
bca797c285 Since local variable 'i' is used only in a KASSERT, declare and
initialize it only if INVARIANTS is defined to avoid a declared
but unused warning.

Suggested by: Brian Somers <brian@FreeBSD.org>
2010-09-29 14:46:57 +00:00
kib
bcef52d3bf Fix typo in comment. 2010-09-29 07:40:11 +00:00
obrien
55122bfc2d Correct some non-code typos. 2010-09-17 09:14:40 +00:00
mckusick
dd70ac636a Update comments in soft updates code to more fully describe
the addition of journalling. Only functional change is to
tighten a KASSERT.

Reviewed by:	jeff Roberson
2010-09-14 18:04:05 +00:00
jhb
d4890c88b0 Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created.  Use them to implement macros that
otherwise manipulated the flags directly.  Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe.  This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by:	kib
MFC after:	3 days
2010-08-20 19:46:50 +00:00
kib
60a46e5ff9 Softdep_process_worklist() should unsuspend not only before processing
the worklist (in softdep_process_journal), but also after flushing the
workitems. Might be, we should even do this before bwillwrite() too, but
this seems to be not needed for now.

Fs might be suspended during processing the queue, and then there is
nobody around to unsuspend.

In collaboration with:	pho
Tested by:	bz
Reviewed by:	jeff
2010-08-12 08:35:24 +00:00
jhb
7f218ea7f4 Revert the previous commit. The race is not applicable to the lockmgr
implementation in 8.0 and later as its flags field does not hold dynamic
state such as waiters flags, but is only modified in lockinit() aside
from VN_LOCK_*().

Discussed with:	attilio
2010-07-16 19:52:03 +00:00
jhb
ea417bf09a When the MNTK_EXTENDED_SHARED mount option was added, some filesystems were
changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE
in the vnode lock's flags) until after they had determined if the vnode was
a FIFO.  This occurs after the vnode has been inserted a VFS hash or some
similar table, so it is possible for another thread to find this vnode via
vget() on an i-node number and block on the vnode lock.  If the lockmgr
interlock (vnode interlock for vnode locks) is not held when clearing the
LK_NOSHARE flag, then the lk_flags field can be clobbered.  As a result
the thread blocked on the vnode lock may never get woken up.  Fix this by
holding the vnode interlock while modifying the lock flags in this case.

MFC after:	3 days
2010-07-16 19:20:20 +00:00
jeff
285c3f355c - Handle the truncation of an inode with an effective link count of 0 in
the context of the process that reduced the effective count.  Previously
   all truncation as a result of unlink happened in the softdep flush
   thread.  This had the effect of being impossible to rate limit properly
   with the journal code.  Now the process issuing unlinks is suspended
   when the journal files.  This has a side-effect of improving rm
   performance by allowing more concurrent work.
 - Handle two cases in inactive, one for effnlink == 0 and another when
   nlink finally reaches 0.
 - Eliminate the SPACECOUNTED related code since the truncation is no
   longer delayed.

Discussed with:	mckusick
2010-07-06 07:11:04 +00:00
avg
13985611dd ffs_softdep: change K&R in function defintions to ANSI prototypes
Apparently it's bad when we first have an ANSI prototype in function
declaration, but then use K&R in its defintion.

Complaint from:	clang
MFC after:	2 weeks
2010-06-11 18:26:53 +00:00
avg
4e8fc6f387 ffs_mount: accept and drop userland-only options that can be passed from
loader(8)

In r193192 loader(8) has grown an ability to pass root mount options
from fstab via vfs.root.mountfrom.options.  Unfortunately, some options
that can be present in fstab are for userland only and lead to root
mounting failure when seen by kernel.
Rather than teaching loader about FFS-specific options that should be
filtered out, ffs_mount recognizes those options as valid, but ignores
and deletes[1] them.

[1] is suggested by jh.

PR:		kern/141050
Reported by:	many
Reviewed by:	jh, bde
MFC after:	4 days
2010-05-19 09:32:11 +00:00
jeff
ebb7d74dae - Don't immediately re-run softdepflush if we didn't make any progress
on the last iteration.  This can lead to a deadlock when we have
   worklist items that cannot be immediately satisfied.

Reported by:	uqs, Dimitry Andric <dimitry@andric.com>

 - Remove some unnecessary debugging code and place some other under
   SUJ_DEBUG.
 - Examine the journal state in softdep_slowdown().
 - Re-format some comments so I may more easily add flag descriptions.
2010-05-19 06:18:01 +00:00
jeff
8fb90eedbc - Call softdep_prealloc() before any of the balloc routines in the
snapshot code.
 - Don't fsync() vnodes in prealloc if copy on write is in progress.  It
   is not safe to recurse back into the write path here.

Reported by:	Vladimir Grebenschikov <vova@fbsd.ru>
2010-05-07 08:45:21 +00:00
jeff
0b3e023908 - Use the correct flag mask when determining whether an inode has
successfully made it to the free list yet or not.  This fixes
   a deadlock that can occur with unlinked but referenced files.
   Journal space and inodedeps were not correctly reclaimed because
   the inode block was not left dirty.

Tested/Reported by:	lwindschuh@googlemail.com
2010-05-07 08:20:56 +00:00
alc
fecc56fac1 Eliminate page queues locking around most calls to vm_page_free(). 2010-05-06 18:58:32 +00:00
alc
5c7ca3ee73 Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held.  (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements.  It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib
2010-05-05 18:16:06 +00:00
trasz
402e3baade Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().
Reviewed by:	kib
2010-05-05 16:44:25 +00:00
avg
043deeb564 ffs_vfsops: restore alphabetic order of options in ffs_opts
The order was not correct only for nfsv4acls.
("no" prefix is ignored)

MFC after:	1 week
2010-04-29 10:04:00 +00:00
jeff
47b1b89a95 - When canceling jaddrefs they may not yet be in the journal if this is via
a revert call.  In this case don't attempt to remove something that
   has not yet been added.  Otherwise this jaddref must hang around
   to prevent the bitmap write as normal.
2010-04-28 07:57:37 +00:00
jeff
564f436237 - Fix builds without SOFTUPDATES defined in the kernel config. 2010-04-28 07:26:41 +00:00
pjd
57ec1f2624 Fix build for UFS without SOFTUPDATES. 2010-04-24 07:36:33 +00:00
jeff
a574495410 - Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
   for background fsck on unclean shutdown.

Sponsored by:   iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm
2010-04-24 07:05:35 +00:00
avg
d488f0b549 ffs_mount: remove redundant assignment of geom consumer to devvp.v_bufobj
The assignment is already done in g_vfs_open.
Redundant assignment is harmless, but can become a problem if g_vfs_open
logic is changed.

MFC after:	1 week
2010-04-03 08:25:04 +00:00
kib
d068222571 When ffs_realloccg() failed to allocate bigger fragment and, because
pending blocks are scheduled for removal, goes to retry the (re)allocation,
clear the bp pointer. It might happen that meantime free space is really
exhausted and we are entering nospace: label without bread()ing buffer,
causing stale bp value to be brelse()d again.

Tested by:	pho
    (Producing a scenario to reliably reproduce the
     race appeared to be much harder then fixing the bug)
MFC after:	1 week
2010-02-13 10:34:50 +00:00
mckusick
e7471d443b One last pass to get all the unsigned comparisons correct. 2010-02-11 18:14:53 +00:00
mckusick
d533f2ac8c This fix corrects a problem in the file system that treats large
inode numbers as negative rather than unsigned. For a default
(16K block) file system, this bug began to show up at a file system
size above about 16Tb.

To fully handle this problem, newfs must be updated to ensure that
it will never create a filesystem with more than 2^32 inodes. That
patch will be forthcoming soon.

Reported by: Scott Burns, John Kilburg, Bruce Evans
Followup by: Jeff Roberson
PR:          133980
MFC after:   2 weeks
2010-02-10 20:10:35 +00:00
trasz
bf6995c4bb Remove unused variable. 2010-02-10 18:56:49 +00:00
mckusick
94b44c0969 Cast 64-bit quantity to intptr_t rather than int so as to work properly
with 64-bit architectures (such as amd64).

Reported by:	bz
2010-01-11 22:42:06 +00:00
mckusick
0cddeb2cb4 Background:
When renaming a directory it passes through several intermediate
states. First its new name will be created causing it to have two
names (from possibly different parents). Next, if it has different
parents, its value of ".." will be changed from pointing to the old
parent to pointing to the new parent. Concurrently, its old name
will be removed bringing it back into a consistent state. When fsck
encounters an extra name for a directory, it offers to remove the
"extraneous hard link"; when it finds that the names have been
changed but the update to ".." has not happened, it offers to rewrite
".." to point at the correct parent. Both of these changes were
considered unexpected so would cause fsck in preen mode or fsck in
background mode to fail with the need to run fsck manually to fix
these problems. Fsck running in preen mode or background mode now
corrects these expected inconsistencies that arise during directory
rename. The functionality added with this update is used by fsck
running in background mode to make these fixes.

Solution:

This update adds three new fsck sysctl commands to support background
fsck in correcting expected inconsistencies that arise from incomplete
directory rename operations. They are:

setcwd(dirinode) - set the current directory to dirinode in the
    filesystem associated with the snapshot.
setdotdot(oldvalue, newvalue) - Verify that the inode number for ".."
    in the current directory is oldvalue then change it to newvalue.
unlink(nameptr, oldvalue) - Verify that the inode number associated
    with nameptr in the current directory is oldvalue then unlink it.

As with all other fsck sysctls, these new ones may only be used by
processes with appropriate priviledge.

Reported by:    	jeff
Security issues:	rwatson
2010-01-11 20:44:05 +00:00
mbr
7450f52a57 Remove extraneous semicolons, no functional changes.
Submitted by:	Marc Balmer <marc@msys.ch>
MFC after:	1 week
2010-01-07 21:01:37 +00:00
trasz
f04a989f2d Implement NFSv4 ACL support for UFS.
Reviewed by:	rwatson
2009-12-21 19:39:10 +00:00
kib
b79e14054c VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by:	alc
Tested by:	pho
MFC after:	3 weeks
2009-12-21 12:29:38 +00:00
kib
30f476628e insmntque_stddtr() clears vp->v_data and resets vp->v_op to
dead_vnodeops before calling vgone(). Revert r189706 and corresponding
part of the r186560.

Noted and reviewed by:	tegge
Approved by:	des (pseudofs part)
MFC after:	3 days
2009-09-07 11:55:34 +00:00
kib
2e1ddcb566 The clear_remove() and clear_inodedeps() call vn_start_write(NULL, &mp,
V_NOWAIT) on the non-busied mount point. Unmount might free ufs-specific
mp data, causing ffs_vgetf() to access freed memory.

Busy mountpoint before dropping softdep lk.

Noted and reviewed by:	tegge
Tested by:	pho
MFC after:	1 week
2009-09-06 11:46:51 +00:00
kib
3658df033e When a UFS node is truncated to the zero length, e.g. by explicit
truncate(2) call, or by being removed or truncated on open, either
new softupdate freeblks structure is allocated to track the freed
blocks of the node, or truncation is done syncronously when too many SU
dependencies are accumulated. The decision does not take into account
the allocated freeblks dependencies, allowing workloads that do huge
amount of truncations to exhaust the kernel memory.

Take the number of allocated freeblks into consideration for
softdep_slowdown().

Reported by:	pluknet gmail com
Diagnosed and tested by:	pho
Approved by:	re (rwatson)
MFC after:	1 month
2009-08-14 11:00:38 +00:00
kib
350f96b4bf In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by:	pho
Reviewed by:	jeff
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-02 18:02:55 +00:00
trasz
dcdba7b2e3 Don't panic on attempt to set ACL on a block device file.
This is just a part of kern/125613.

PR:		kern/125613
Submitted by:	Jaakko Heinonen <jh at saunalahti dot fi>
Reviewed by:	rwatson
Approved by:	re (kib)
2009-07-01 22:30:36 +00:00
kib
4cf230ed17 For SU mounts, softdep_fsync() might drop vnode lock, allowing other
threads to put dirty buffers on the vnode bufobj list. For regular files
and synchronous fsync requests, check for the condition and restart the
fsync vop if a new dirty buffer arrived.

Tested by:	pho
Approved by:	re (kensmith)
MFC after:	1 month
2009-06-30 10:07:33 +00:00
kib
c424611d79 Softdep_fsync() may need to lock parent directory of the synced vnode.
Use inlined (due to FFSV_FORCEINSMQ) version of vn_vget_ino() to prevent
mountpoint from being unmounted and freed while no vnodes are locked.

Tested by:	pho
Approved by:	re (kensmith)
MFC after:	1 month
2009-06-30 10:07:00 +00:00
rwatson
f4934662e5 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
attilio
44c490ae17 Handle lock recursion differenty by always checking against LO_RECURSABLE
instead the lock own flag itself.

Tested by:	pho
2009-06-02 13:03:35 +00:00
alc
dc942dabcf Introduce vfs_bio_set_valid() and use it from ffs_realloccg(). This
eliminates the misuse of vfs_bio_clrbuf() by ffs_realloccg().

In collaboration with:	tegge
2009-05-17 20:26:00 +00:00
attilio
1dcb84131b Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS.  Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled.  Bump __FreeBSD_version in order to signal such
situation.
2009-05-11 15:33:26 +00:00