Commit Graph

1075 Commits

Author SHA1 Message Date
bde
ec6a3fc0c3 Fixed some printf format errors (4 new ones reported by gcc and 5 nearby
old ones not reported by gcc).  This helps unbreak LINT.
2002-07-08 12:42:29 +00:00
iedowse
4416f82706 Use indirect function pointer hooks instead of #ifdef SOFTUPDATES
direct calls for the two places where the kernel calls into soft
updates code. Set up the hooks in softdep_initialize() and NULL
them out in softdep_uninitialize(). This change allows soft updates
to function correctly when ufs is loaded as a module.

Reviewed by:	mckusick
2002-07-01 17:59:40 +00:00
iedowse
603a270887 Add the ffs bits necessary to support unloading of the ufs kernel
module. This adds an ffs_uninit() function that calls ufs_uninit()
and also calls a new softdep_uninitialize() function. Add a stub
for softdep_uninitialize() to cover the non-SOFTUPDATES case.

Reviewed by:	mckusick
2002-07-01 11:00:47 +00:00
iedowse
948e368072 Remove the bogus SYSINIT from ufs_dirhash.c and instead add a call
to ufsdirhash_init() from ufs_init(). Add uninit() functions
corresponding the ufs, dirhash, quota and ihash init() functions.
2002-06-30 02:49:39 +00:00
iedowse
c0323a07fb Remove the kernel file-size limit for UFS2, so that only the limit
imposed by the filesystem structure itself remains. With 16k blocks,
the maximum file size is now just over 128TB.

For now, the UFS1 file size limit is left unchanged so as to remain
consistent with RELENG_4, but it too could be removed in the future.

Reviewed by:	mckusick
2002-06-26 18:34:51 +00:00
ken
0d3a835f3f At long last, commit the zero copy sockets code.
MAKEDEV:	Add MAKEDEV glue for the ti(4) device nodes.

ti.4:		Update the ti(4) man page to include information on the
		TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
		and also include information about the new character
		device interface and the associated ioctls.

man9/Makefile:	Add jumbo.9 and zero_copy.9 man pages and associated
		links.

jumbo.9:	New man page describing the jumbo buffer allocator
		interface and operation.

zero_copy.9:	New man page describing the general characteristics of
		the zero copy send and receive code, and what an
		application author should do to take advantage of the
		zero copy functionality.

NOTES:		Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
		TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files:	Add uipc_jumbo.c and uipc_cow.c.

conf/options:	Add the 5 options mentioned above.

kern_subr.c:	Receive side zero copy implementation.  This takes
		"disposable" pages attached to an mbuf, gives them to
		a user process, and then recycles the user's page.
		This is only active when ZERO_COPY_SOCKETS is turned on
		and the kern.ipc.zero_copy.receive sysctl variable is
		set to 1.

uipc_cow.c:	Send side zero copy functions.  Takes a page written
		by the user and maps it copy on write and assigns it
		kernel virtual address space.  Removes copy on write
		mapping once the buffer has been freed by the network
		stack.

uipc_jumbo.c:	Jumbo disposable page allocator code.  This allocates
		(optionally) disposable pages for network drivers that
		want to give the user the option of doing zero copy
		receive.

uipc_socket.c:	Add kern.ipc.zero_copy.{send,receive} sysctls that are
		enabled if ZERO_COPY_SOCKETS is turned on.

		Add zero copy send support to sosend() -- pages get
		mapped into the kernel instead of getting copied if
		they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
		can be used elsewhere.  (uipc_cow.c)

if_media.c:	In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
		calling malloc() with M_WAITOK.  Return an error if
		the M_NOWAIT malloc fails.

		The ti(4) driver and the wi(4) driver, at least, call
		this with a mutex held.  This causes witness warnings
		for 'ifconfig -a' with a wi(4) or ti(4) board in the
		system.  (I've only verified for ti(4)).

ip_output.c:	Fragment large datagrams so that each segment contains
		a multiple of PAGE_SIZE amount of data plus headers.
		This allows the receiver to potentially do page
		flipping on receives.

if_ti.c:	Add zero copy receive support to the ti(4) driver.  If
		TI_PRIVATE_JUMBOS is not defined, it now uses the
		jumbo(9) buffer allocator for jumbo receive buffers.

		Add a new character device interface for the ti(4)
		driver for the new debugging interface.  This allows
		(a patched version of) gdb to talk to the Tigon board
		and debug the firmware.  There are also a few additional
		debugging ioctls available through this interface.

		Add header splitting support to the ti(4) driver.

		Tweak some of the default interrupt coalescing
		parameters to more useful defaults.

		Add hooks for supporting transmit flow control, but
		leave it turned off with a comment describing why it
		is turned off.

if_tireg.h:	Change the firmware rev to 12.4.11, since we're really
		at 12.4.11 plus fixes from 12.4.13.

		Add defines needed for debugging.

		Remove the ti_stats structure, it is now defined in
		sys/tiio.h.

ti_fw.h:	12.4.11 firmware.

ti_fw2.h:	12.4.11 firmware, plus selected fixes from 12.4.13,
		and my header splitting patches.  Revision 12.4.13
		doesn't handle 10/100 negotiation properly.  (This
		firmware is the same as what was in the tree previously,
		with the addition of header splitting support.)

sys/jumbo.h:	Jumbo buffer allocator interface.

sys/mbuf.h:	Add a new external mbuf type, EXT_DISPOSABLE, to
		indicate that the payload buffer can be thrown away /
		flipped to a userland process.

socketvar.h:	Add prototype for socow_setup.

tiio.h:		ioctl interface to the character portion of the ti(4)
		driver, plus associated structure/type definitions.

uio.h:		Change prototype for uiomoveco() so that we'll know
		whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c:	In vm_fault(), check to see whether we need to do a page
		based copy on write fault.

vm_object.c:	Add a new function, vm_object_allocate_wait().  This
		does the same thing that vm_object allocate does, except
		that it gives the caller the opportunity to specify whether
		it should wait on the uma_zalloc() of the object structre.

		This allows vm objects to be allocated while holding a
		mutex.  (Without generating WITNESS warnings.)

		vm_object_allocate() is implemented as a call to
		vm_object_allocate_wait() with the malloc flag set to
		M_WAITOK.

vm_object.h:	Add prototype for vm_object_allocate_wait().

vm_page.c:	Add page-based copy on write setup, clear and fault
		routines.

vm_page.h:	Add page based COW function prototypes and variable in
		the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
2002-06-26 03:37:47 +00:00
mckusick
ea2f279985 Force the quota update to be done when an inode is released in
ufs_inactive. This avoid a panic when checking a NULL credential
in suser_cred().
2002-06-25 01:02:28 +00:00
jlemon
fe48a4f5d5 Prototype fixes (long newinum --> ino_t newinum). 2002-06-24 17:20:19 +00:00
mux
0a55bc4167 Warning fixes for 64 bits platforms. This eliminates all the
warnings I have had in the FFS code on sparc64.

Reviewed by:	mckusick
2002-06-23 18:17:27 +00:00
dillon
b0b3177408 Rename the BALLOC flags from B_* to BA_* to avoid confusion with the
struct buf B_ flags.

Approved by:	mckusick
2002-06-23 06:12:22 +00:00
mckusick
2730f1975d This patch fixes a problem whereby filesystems that ran
out of inodes in a cylinder group would fail to check for
free inodes in other cylinder groups. This bug was introduced
in the UFS2 code merge two days ago.

An inode is allocated by calling ffs_valloc which calls
ffs_hashalloc to do the filesystem scan. Ffs_hashalloc
walks around the cylinder groups calling its passed allocator
(ffs_nodealloccg in this case) until the allocator returns a
non-zero result. The bug is that ffs_hashalloc expects the
passed allocator function to return a 64-bit ufs2_daddr_t.
When allocating inodes, it calls ffs_nodealloccg which was
returning a 32-bit ino_t. The ffs_hashalloc code checked
a 64-bit return value and usually found random non-zero bits in
the high 32-bits so decided that the allocation had succeeded
(in this case in the only cylinder group that it checked).
When the result was passed back to ffs_valloc it looked at
only the bottom 32-bits, saw zero and declared the system
out of inodes. But ffs_hashalloc had really only checked
one cylinder group.

The fix is to change ffs_nodealloccg to return 64-bit results.

Sponsored by:	DARPA & NAI Labs.
Submitted by:	Poul-Henning Kamp <phk@critter.freebsd.dk>
Reviewed by:	Maxime Henrion <mux@freebsd.org>
2002-06-22 21:24:58 +00:00
mckusick
88d85c15ef This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by:	Poul-Henning Kamp <phk@freebsd.org>
2002-06-21 06:18:05 +00:00
dillon
7aa1bb9d05 In rev 1.72 a situation related to write/mmap was fixed which could result
in a user process gaining visibility into the 'old' contents of a filesystem
block.  There were two cases:  (1) when uiomove() fails (user process issues
illegal write), and (2) when uiomove() overlaps a mmap() of the same file at
the same offset (fault -> recursive buffer I/O reads contents of old block).

Unfortunately 1.72 also had the unintended effect of forcing the filesystem
to do a read-before-write in the case of a full-block-write (non append case),
e.g. 'dd if=/dev/zero of=test.dat bs=1m count=256 conv=notrunc'.  This
destroys performance.. not only is a read forced for every write, but
clustering breaks as well.

The solution is to clear the buffer manually in the full-block case rather
then asking BALLOC to do it (BALLOC issues the read-before-write).  In the
partial-block case we want BALLOC to do it because the read-before-write
is necessary.  This patch should greatly improve database and news-feed
server performance.

Found by: MKI <mki@mozone.net>
MFC after:	3 days
2002-06-19 09:39:41 +00:00
semenu
f48bc32790 Fix a typo in my recently added comment: s/beleived/believed/
Submitted by:	keramida
2002-06-06 20:43:03 +00:00
alfred
28ce0e9ae5 Backout/modify previous revision:
"empty default cases shouldn't be removed, they should have a break;
  statement added to them."

Requested by: billf
2002-06-01 20:54:21 +00:00
alfred
04d357d9e8 Silence warnings, remove some empty 'default' switch cases. 2002-06-01 20:40:42 +00:00
semenu
b9960bece1 Remove lock from ffs_vget introduced by v1.24. Instead of locking the
vnode creation globaly, we allow processes to create vnodes concurently.
In case of concurent creation of vnode for the one ino, we allow processes
to race and then check who wins.

Assuming that concurent creation of vnode for same ino is really rare case,
this is belived to be an improvement, as it just allows concurent creation
of vnodes.

Idea by:	bp
Reviewed by:	dillon
MFC after:	1 month
2002-05-30 22:04:17 +00:00
rwatson
930f7599ed Remove IFS from 5.0-CURRENT. This facilitates introducing UFS2 as
IFS had its fingers deep in the belly of the UFS/FFS split.  IFS
will be reimplemented by the maintainer at a later date.

Requested by:	adrian (maintainer)
2002-05-19 00:11:08 +00:00
iedowse
caf2c5a16f Fix two casts to "daddr_t *" that should have been "ufs_daddr_t *". 2002-05-18 19:03:00 +00:00
iedowse
cc9d214396 Fix a typo where sizeof(daddr_t) was specified instead of sizeof(doff_t).
Now that daddr_t is 64-bit, this caused hash blocks to be allocated
twice as large as they need to be.
2002-05-18 18:58:27 +00:00
iedowse
794dddf9bf Remove um_i_effnlink_valid, i_spare[] and the ufsmount_u and inode_u
unions, since these were only necessary when ext2fs used ufs code.

Reviewed by:	mckusick
2002-05-18 18:51:14 +00:00
phk
e127001407 Fix ufs_daddr_t/daddr_t type problems.
Sponsored by:	DARPA & NAI labs.
2002-05-17 18:59:53 +00:00
phk
152c7cfbca Call ufs_bmaparray() with right parameter type.
Sponsored by: DARPA & NAI Labs.
2002-05-17 18:53:29 +00:00
trhodes
28d42899b7 More s/file system/filesystem/g 2002-05-16 21:28:32 +00:00
phk
8536ea3cdb Make daddr_t and u_daddr_t 64bits wide.
Retire daddr64_t and use daddr_t instead.

Sponsored by:	DARPA & NAI Labs.
2002-05-14 11:09:43 +00:00
phk
b591373208 Remove register keyword.
Sponsored by:	DARPA & NAI Labs.
Submitted by:	mckusick
2002-05-13 09:22:31 +00:00
phk
e76663bf87 Remove two "register" and a blank line.
Submitted by:	mckusick
Sponsored by:	DARPA & NAI Labs.
2002-05-12 22:54:48 +00:00
phk
e27fed0950 ARGH! SBLOCK is not unused. Try to get this right.
BBSIZE belongs in <sys/disklabel.h> (but shouldn't be a constant).

Define SBLOCK again, using the right math.

Sponsored by: DARPA & NAI Labs.
2002-05-12 20:21:40 +00:00
phk
4aa59982cb Remove #define for BBOFF, it is assumed == 0 so many places that we might
as well forget about it.  In fact the only thing which used it was the
SBOFF macro.

Sponsored by: DARPA & NAI Labs.
2002-05-12 20:00:21 +00:00
phk
57e4ad9a8b Remove unused BBLOCK and SBLOCK #defines.
Sponsored by: DARPA & NAI Labs.
2002-05-12 19:56:31 +00:00
alc
d60a525036 o Condition the compilation and use of vm_freeze_copyopts()
on ENABLE_VFS_IOOPT.
2002-05-06 05:45:57 +00:00
phk
f075f5e39a Move some UFS related stuff home where it belongs. 2002-05-05 20:04:33 +00:00
jeff
86e123b512 Include systm.h so panic(9) is defined when doing DEBUG_ALL_VFS_LOCKS. 2002-05-04 02:40:37 +00:00
phk
a1f99d3509 Name ufs_vop_[gs]etextattr() consistently with the rest of our VOPs and
put then in the ufs_vnops where they belong, rather than in the ffs_vnops.

Ok'ed by:	rwatson
Sponsored by:	DARPA & NAI Labs.
2002-05-03 08:40:33 +00:00
phk
860ddc59bc Use vop_panic() instead of our home-rolled version. 2002-05-02 19:15:52 +00:00
alfred
d1526107c3 Remove support for using soon to be retired "special" poll(2) ops.
Replace with kevent(2) ops.

This is untested, but the code would rot even further if this wasn't
applied.  I've chosen to apply this to prompt some cleanup.

Submitted by: bde
2002-04-18 14:52:28 +00:00
jeff
b972756970 Don't peak into the malloc_type structure for limits. The desired vnodes
check should be sufficient.  This is required for the pending removal of
malloc_type limits.
2002-04-15 03:35:35 +00:00
phk
33405073ec Move generic disk ioctls from <sys/disklabel.h> to <sys/disk.h>.
Sponsored by:	DARPA & NAI Labs
2002-04-08 09:20:07 +00:00
jhb
db9aa81e23 Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on:	i386, alpha, sparc64
2002-04-04 21:03:38 +00:00
phk
65c4a4cb9d Move the FFS parameter MAXFRAG from <sys/param.h> to <ufs/ffs/fs.h>
Sponsored by:	DARPA & NAI Labs.
2002-04-03 20:39:27 +00:00
phk
79a894e10e Use DIOCGSECTORSIZE instead of the bogus DIOCGPART ioctl. 2002-04-02 11:23:14 +00:00
jhb
dc2e474f79 Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API.  The entire API now consists of two functions
similar to the pre-KSE API.  The suser() function takes a thread pointer
as its only argument.  The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0.  The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on:	smp@
2002-04-01 21:31:13 +00:00
bde
a9b7c63b29 In ffs_mountffs(), set mnt_iosize_max to si_iosize_max unconditionally
provided the latter is nonzero.  At this point, the former is a fairly
arbitrary default value (DFTPHYS), so changing it to any reasonable
value specified by the device driver is safe.  Using the maximum of
these limits broke ffs clustered i/o for devices whose si_iosize_max
is < DFLTPHYS.  Using the minimum would break device drivers' ability
to increase the active limit from DFTLPHYS up to MAXPHYS.

Copied the code for this and the associated (unnecessary?) fixup of
mp_iosize_max to all other filesystems that use clustering (ext2fs and
msdosfs).  It was completely missing.

PR:		36309
MFC-after:	1 week
2002-03-30 15:12:57 +00:00
dwmalone
246869b676 Two minor changes to dirhash, which result in some marginal benchmark
improvements.

1) If deleting an entry results in a chain of deleted slots ending in an
   empty slot, then we can be a bit more aggressive about marking slots as
   empty.

2) The last stage of the FNV hash is to xor the last byte of data
   into the hash. This means that filenames which differ only in
   the last byte will be placed close to one another in the hash
   table, which forms longer chains. To work around this common
   case, we also hash in the address of the dirhash structure.

     news/cancel = news/articles/control/cancel for a tradspool inn server
     squid2 = squid level 2 directory (dirs called 00->FF)
     squid3 = squid level 3 directory (files called 00001F00->00001FFF)

                             mean #probes for
                  home dir  mh inbox  news/cancel  tmp    squid2  squid3
old   successful  1.02      3.19      4.07         1.10    7.85   2.06
new   successful  1.04      1.32      1.27         1.04    1.93   1.17

old unsuccessful  1.08      4.50      5.37         1.17   10.76   2.69
new unsuccessful  1.08      1.73      1.64         1.17    2.89   1.37

Reviewed by:	iedowse
MFC after:	2 weeks
2002-03-20 17:58:02 +00:00
jeff
7bd172afc2 Remove references to vm_zone.h and switch over to the new uma API. 2002-03-20 08:48:07 +00:00
alfred
79061a9306 Remove __P. 2002-03-19 22:40:48 +00:00
bde
df8144b98e Fixed some printf format errors (hopefully all of the remaining daddr64_t
ones for GENERIC, and all others on the same line as those).  Reformat
the printfs if necessary to avoid new long lones or old format printf
errors.
2002-03-19 04:09:21 +00:00
mckusick
14dd08fd15 Add a flags parameter to VFS_VGET to pass through the desired
locking flags when acquiring a vnode. The immediate purpose is
to allow polling lock requests (LK_NOWAIT) needed by soft updates
to avoid deadlock when enlisting other processes to help with
the background cleanup. For the future it will allow the use of
shared locks for read access to vnodes. This change touches a
lot of files as it affects most filesystems within the system.
It has been well tested on FFS, loopback, and CD-ROM filesystems.
only lightly on the others, so if you find a problem there, please
let me (mckusick@mckusick.com) know.
2002-03-17 01:25:47 +00:00
mckusick
e929f2e4f0 Introduce the new 64-bit size disk block, daddr64_t. Change
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.
2002-03-15 18:49:47 +00:00
obrien
b4c5b60736 Quiet a warning on the Alpha. 2002-03-15 04:06:10 +00:00
mckusick
670fad8a9e This corrects the first of two known deadlock conditions that
come from the presence of a snapshot file.
2002-03-14 01:21:13 +00:00
iedowse
cf446bea56 Fix a bug in ufsdirhash_adjfree() that caused it to incorrectly
update the free-space statistics in some cases. The problem affected
directory blocks when the free space dropped below the size of the
maximum allowed entry size. When this happened, the free-space
summary information could claim that there are no further blocks
that can fit a maximum-size entry, even if there are.

The effect of this bug is that the directory may be enlarged even
though there is space within the directory for the new entry. This
wastes disk space and has a negative impact on performance.

Fix it by correctly computing the dh_firstfree array index, adding
a helper macro for clarity. Put an extra sanity check into
ufsdirhash_checkblock() to detect the situation in future.

Found by:	dwmalone
Reviewed by:	dwmalone
MFC after:	1 week
2002-03-11 19:13:22 +00:00
phk
2098322771 I missed one VOP_CLOSE in the previous commit.
Pointed out by:	bde
2002-03-11 16:27:04 +00:00
phk
c68bd5ed71 As a XXX bandaid open the mounted device READ/WRITE even if we only mount
read-only.

The trouble here is that we don't reopen the device in read/write mode
when we remount in read/write mode resulting in a filesystem sending
write requests to a device which was only opened read/only.

I'm not quite sure how such a reopen would best be done and defer
the problem to more agile hackers.
2002-03-11 13:53:00 +00:00
rwatson
1a2700eed9 Update DBA for NAI. We have several. We used the wrong one. :-) 2002-03-07 17:49:06 +00:00
green
965906f60d Add new errno ``ENOATTR''. 2002-03-07 15:13:44 +00:00
dillon
e05fb41e7e cleanup readability syntax prior to ongoing b_resid work commits.
MFC after:	1 day
2002-03-06 00:44:30 +00:00
jhb
b8b3ac8816 Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic
and isn't strictly required.  However, it lowers the number of false
positives found when grep'ing the kernel sources for p_ucred to ensure
proper locking.
2002-02-27 19:18:10 +00:00
jhb
3706cd3509 Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.
2002-02-27 18:32:23 +00:00
phk
62d248fb9e Replace bowrite() with BUF_WRITE in ufs.
Remove bowrite(), it is now unused.

This is the first step in getting entirely rid of BIO_ORDERED which is
a generally accepted evil thing.

Approved by:	mckusick
2002-02-22 09:03:00 +00:00
rwatson
ad827f33f7 o Minor style fix on #endif, missing '_' in comment. 2002-02-20 15:44:43 +00:00
phk
68389bd8ba Make v_addpollinfo() visible and non-inline.
Have callers only call it as needed.
Add necessary call in ufs_kqfilter().

Test-case found by:	Andrew Gallatin <gallatin@cs.duke.edu>
2002-02-18 16:18:02 +00:00
phk
c2a47cdbe8 Move the stuff related to select and poll out of struct vnode.
The use of the zone allocator may or may not be overkill.
There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need
to revisit.

This shaves about 60 bytes of struct vnode which on my laptop means
600k less RAM used for vnodes.
2002-02-17 21:15:36 +00:00
phk
3348974369 Collect the VN_KNOTE() macro definitions on vnode.h 2002-02-17 21:07:57 +00:00
julian
37369620df In a threaded world, differnt priorirites become properties of
different entities.  Make it so.

Reviewed by:	jhb@freebsd.org (john baldwin)
2002-02-11 20:37:54 +00:00
rwatson
c00ccc43a6 Minor style tweaks.
Remove an unneeded comment and commented out code that won't be
needed.
2002-02-10 04:57:08 +00:00
rwatson
ce1e1df4d1 Copyright + license update. 2002-02-10 04:50:24 +00:00
rwatson
6ca91a055c Part I: Update extended attribute API and ABI:
o Modify the system call syntax for extattr_{get,set}_{fd,file}() so
  as not to use the scatter gather API (which appeared not to be used
  by any consumers, and be less portable), rather, accepts 'data'
  and 'nbytes' in the style of other simple read/write interfaces.
  This changes the API and ABI.

o Modify system call semantics so that extattr_get_{fd,file}() return
  a size_t.  When performing a read, the number of bytes read will
  be returned, unless the data pointer is NULL, in which case the
  number of bytes of data are returned.  This changes the API only.

o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t
  argument so as to return the size, if desirable.  If set to NULL,
  the size will not be returned.

o Update various filesystems (pseodofs, ufs) to DTRT.

These changes should make extended attributes more useful and more
portable.  More commits to rebuild the system call files, as well
as update userland utilities to follow.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-02-10 04:43:22 +00:00
phk
d2872f5d10 Remove di_inumber since LFS is long gone. 2002-02-10 00:55:49 +00:00
mckusick
88b4f0b921 Occationally background fsck would cause a spurious ``freeing free
inode'' panic. This change corrects that problem by setting the
fs_active flag when the inode map changes to notify the snapshot
code that the cylinder group must be rescanned.

Submitted by:	Robert Watson <rwatson@FreeBSD.org>
2002-02-07 22:13:56 +00:00
mckusick
9d51e5cc24 Occationally deleted files would hang around for hours or days
without being reclaimed. This bug was introduced in revision 1.95
dealing with filenames placed in newly allocated directory blocks,
thus is not present in 4.X systems. The bug is triggered when a
new entry is made in a directory after the data block containing
the original new entry has been written, but before the inode
that references the data block has been written.

Submitted by:	Bill Fenner <fenner@research.att.com>
2002-02-07 00:54:32 +00:00
mckusick
c142ab455c When taking a snapshot, we must check for active files that have
been unlinked (e.g., with a zero link count). We have to expunge
all trace of these files from the snapshot so that they are neither
reclaimed prematurely by fsck nor saved unnecessarily by dump.
2002-02-02 01:42:44 +00:00
mckusick
35edc11259 Add a stub for softdep_request_cleanup() so that compilation without
SOFTUPDATES option works properly.

Submitted by:	Benno Rice <benno@jeamland.net>
2002-01-23 02:18:56 +00:00
mckusick
4f050b1765 This patch fixes a long standing complaint with soft updates in
which small and/or nearly full filesystems would fail with `file
system full' messages when trying to replace a number of existing
files (for example during a system installation). When the allocation
routines are about to fail with a file system full condition, they
make a call to softdep_request_cleanup() which attempts to accelerate
the flushing of pending deletion requests in an effort to free up
space. In the face of filesystem I/O requests that exceed the
available disk transfer capacity, the cleanup request could take
an unbounded amount of time. Thus, the softdep_request_cleanup()
routine will only try for tickdelay seconds (default 2 seconds)
before giving up and returning a filesystem full error. Under typical
conditions, the softdep_request_cleanup() routine is able to free
up space in under fifty milliseconds.
2002-01-22 06:17:22 +00:00
mckusick
4e7dcb216b Fix a bug introduced in ffs_snapshot.c -r1.25 and fs.h -r1.26
which caused incomplete snapshots to be taken. When background
fsck would run on these snapshots, the result would be files
being incorrectly released which would subsequently panic the
kernel with ``handle_workitem_freefile: inodedep survived'',
``handle_written_inodeblock: live inodedep'', and
``handle_workitem_remove: lost inodedep'' errors.
2002-01-17 08:33:32 +00:00
mckusick
0ed7ba2c74 Put write on read-only filesystem panic after we have weeded out
block and character devices, fifo's, etc.

Submitted by:	Bruce Evans <bde@zeta.org.au>
2002-01-16 04:59:09 +00:00
mckusick
b8d6599e4c When downgrading a filesystem from read-write to read-only, operations
involving file removal or file update were not always being fully
committed to disk. The result was lost files or corrupted file data.
This change ensures that the filesystem is properly synced to disk
before the filesystem is down-graded.

This delta also fixes a long standing bug in which a file open for
reading has been unlinked. When the last open reference to the file
is closed, the inode is reclaimed by the filesystem. Previously,
if the filesystem had been down-graded to read-only, the inode could
not be reclaimed, and thus was lost and had to be later recovered
by fsck.  With this change, such files are found at the time of the
down-grade.  Normally they will result in the filesystem down-grade
failing with `device busy'. If a forcible down-grade is done, then
the affected files will be revoked causing the inode to be released
and the open file descriptors to begin failing on attempts to read.

Submitted by:	"Sam Leffler" <sam@errno.com>
2002-01-15 07:17:12 +00:00
alfred
844237b396 SMP Lock struct file, filedesc and the global file list.
Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
   protects all the fields.
   protects "struct file" initialization, while a struct file
     is being changed from &badfileops -> &pipeops or something
     the filedesc should be locked.

1 mutex in each struct file
   protects the refcount fields.
   doesn't protect anything else.
   the flags used for garbage collection have been moved to
     f_gcflag which was the FILLER short, this doesn't need
     locking because the garbage collection is a single threaded
     container.
  could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file *	fhold(struct file *fp);
        /* increments reference count on a file */

struct file *	fhold_locked(struct file *fp);
        /* like fhold but expects file to locked */

struct file *	ffind_hold(struct thread *, int fd);
        /* finds the struct file in thread, adds one reference and
                returns it unlocked */

struct file *	ffind_lock(struct thread *, int fd);
        /* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.
2002-01-13 11:58:06 +00:00
mckusick
c6e526cffc When going to sleep, we must save our SPL so that it does not get
lost if some other process uses the lock while we are sleeping. We
restore it after we have slept. This functionality is provided by
a new routine interlocked_sleep() that wraps the interlocking with
functions that sleep. This function is then used in place of the
old ACQUIRE_LOCK_INTERLOCKED() and FREE_LOCK_INTERLOCKED() macros.

Submitted by:	Debbie Chu <dchu@juniper.net>
2002-01-12 20:57:36 +00:00
mckusick
b86f75b78b Must call drain_output() before checking the dirty block list
in softdep_sync_metadata(). Otherwise we may miss dependencies
that need to be flushed which will result in a later panic
with the message ``vinvalbuf: dirty bufs''.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
MFC after:	1 week
2002-01-11 19:59:27 +00:00
phk
64c925b060 Do not pull quota entries of the cache-list if they have already
been removed from the cache-list as part of a previous unmount.

This would result in panics (page fault in dqflush()) during subsequent
umounts provided that enough distinct UID's to actually make the
hash do something are active.

This can probably explain a number of weird quota related behaviours.

PR:		32331 maybe more.
Reproduced by:	Søren Schrørder <sch@cybercity.dk>
2002-01-10 15:02:57 +00:00
msmith
6fa204fd80 Initialise the bioops vector hack at runtime rather than at link time. This
avoids the use of common variables.

Reviewed by:	mckusick
2002-01-08 19:32:18 +00:00
dillon
ac9876d609 Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code.  Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release
2001-12-20 22:42:27 +00:00
mckusick
eeb2a6d271 Change the atomic_set_char to atomic_set_int and atomic_clear_char
to atomic_clear_int to ease the implementation for the sparc64.

Requested by:	Jake Burkholder <jake@locore.ca>
2001-12-18 18:05:17 +00:00
iedowse
5c873cad57 Make sure we ignore the value of `fs_active' when reloading the
superblock, and move the initialisation of it to beside where other
pointer fields are initialised.
2001-12-16 18:54:09 +00:00
iedowse
64972486c2 Move the new superblock field `fs_active' into the region of the
superblock that is already set up to handle pointer types. This
fixes an accidental change in the superblock size on 64-bit platforms
caused by revision 1.24.
2001-12-16 18:51:11 +00:00
mckusick
735db38be3 Minimize the time necessary to suspend operations on a filesystem
when taking a snapshot. The two time consuming operations are
scanning all the filesystem bitmaps to determine which blocks
are in use and scanning all the other snapshots so as to be able
to expunge their blocks from the view of the current snapshot.
The bitmap scanning is broken into two passes. Before suspending
the filesystem all bitmaps are scanned. After the suspension,
those bitmaps that changed after being scanned the first time
are rescanned. Typically there are few bitmaps that need to be
rescanned. The expunging of other snapshots is now done after
the suspension is released by observing that we can easily
identify any blocks that were allocated to them after the
suspension (they will be maked as `not needing to be copied'
in the just created snapshot). For all the gory details, see
the ``Running fsck in the Background'' paper in the Usenix
BSDCon 2002 Conference Proceedings, pages 55-64.
2001-12-14 00:15:06 +00:00
mckusick
d375501d04 When a file is partially truncated, we first check to see if the
new file end will land in the middle of a file hole. Since the last
block of a file must always be allocated, the hole is filled by
allocating a block at that location. If the hole being filled is
a direct block, then the truncation may eventually reduce the
full sized block down to a fragment. When running with soft
updates, it is necessary to FSYNC the file after allocating the
block and before creating the fragment to avoid triggering a
soft updates inconsistency when the block unexpectedly shrinks.

Found by:	Matthew Dillon <dillon@apollo.backplane.com>
MFC after:	1 week
2001-12-13 05:07:48 +00:00
rwatson
523aaa204d Use 'mkdir -p /.attribute/system' instead of breaking it into
two seperate mkdir targets.

Submitted by:	jedgar
2001-11-30 15:32:07 +00:00
rwatson
df873c5111 Use 'mkdir -p /.attribute/system' instead of breaking it into
two seperate mkdir targets.
2001-11-30 15:21:20 +00:00
rwatson
3014016ccd README.extattr incorrectly specified sample command lines for
UFS_EXTATTR_AUTOSTART.  Insert the missing 'initattr' arguments
to extattrctl.

Noticed by:	green
2001-11-30 15:15:27 +00:00
guido
6ece6a6a1b When mkdir()-ing, the parent dir gets is linkcount increased.
Fix VN_KNOTE to reflect that.

Found by: tobez@freebsd.org
MFC after:	2 days
2001-11-22 15:33:12 +00:00
iedowse
61d89377de Oops, when trying the dirhash sequential-access optimisation,
compare the slot offset against the predicted offset, not a boolean
flag. This typo effectively disabled the sequential optimisation,
but was otherwise harmless.

Not surprisingly, fixing this improves performance in the sequential
access case. I am seeing a 7% speedup on one machine here; using
dirhash when sequentially looking up directory entries is now about
5% faster instead of 2% slower than the non-dirhash case.

Submitted by:	KOIE Hidetaka <koie@suri.co.jp>
MFC after:	1 week
2001-11-14 15:08:07 +00:00
dillon
1147eaf58a Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking
in wdrain during a write.  This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.

Remove a vprintf() warning from MD when the backing vnode is found to be
in-use.  The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.

MFC after:	1 week
2001-11-05 18:48:54 +00:00
rwatson
58469c01a1 o Update copyright dates.
o Add reference to TrustedBSD Project in license header.
o Update dated comments, including comment in extattr.h claiming that
  no file systems support extended attributes.
o Improve comment consistency.
2001-11-01 21:37:07 +00:00
rwatson
5e557739dc o Althought this is not specified in POSIX.1e, the UFS ACL implementation
coerces the deletion of a default ACL on a directory when no default
  ACL EA is present to success.  Because the UFS EA implementation doesn't
  disinguish the EA failure modes "that EA name has not been
  administratively enabled" from "that EA name has no defined data",
  there's a potential conflict in error return values.  Normally, the
  lack of administratively configured EA support is coerced to
  EOPNOTSUPP to indicate that ACLs are not available; in this case,
  it is possible to get a successful return, even if ACLs are not
  available because EA support for them has not been enabled.

  Expand the comment in ufs_setacl() to identify this case.

Obtained from:	TrustedBSD Project
2001-10-27 05:39:17 +00:00
rwatson
82286fef85 o Clarify a comment about the locking condition of the vnode upon exit
from ufs_extattr_enable_with_open().
o Print auto-start notifications if (bootverbose).  This was previously
  commented out since it didn't know how to check for bootverbose.
o Drop in comments throughout indicating where ENOENT should be replaced
  with ENOATTR once that is available.

Obtained from:	TrustedBSD Project
2001-10-27 05:19:14 +00:00
rwatson
b985eb702b o The comment about ordering the destruction of the lock and the removal of
the flag indicating that the structure was initialized didn't need
  an XXX, since it didn't need fixing.

Obtained from:	TrustedBSD Project
2001-10-27 05:05:39 +00:00
rwatson
1d903fb701 o Wrap a number of long lines of code, many of which were introduced
due to KSE-related (p) expansions.

Obtained from:	TrustedBSD Project
2001-10-27 05:03:05 +00:00
rwatson
25c2879cc9 Since namespace support was added to the UFS extended attribute
implementation to replace single-character namespace prefixes, '$' is no
longer an invalid attribute name, and the namespace is relevant to
validity determination.

o Remove '$' case from ufs_extattr_valid_attrname()
o Add attrnamespace argument to ufs_extattr_valid_attrname(), and
  fill out appropriately.

Currently no decisions are made based on the namespace argument, but
may be in the future.

Obtained from:	TrustedBSD Project
2001-10-27 04:58:28 +00:00
dillon
f883ef447a Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync().  Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list.  Improves looping case by 500%.

Optimize ffs_sync().  Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list.  This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held.  Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after:	1 week
2001-10-26 00:08:05 +00:00
iedowse
6c4623e897 Default to not performing ufs_dirhash's extensive directory-block
sanity check after every directory modification. This check can be
re-enabled at any time by setting the sysctl "vfs.ufs.dirhash_docheck"
to 1.

This group of sanity tests was there to ensure that any UFS_DIRHASH
bugs could be caught by a panic before a potentially corrupted
directory block would be written to disk. It has served its main
purpose now, so disable it in the interest of performance.

MFC after:	1 week
2001-10-25 22:55:59 +00:00
dillon
45a6fabe87 Change the vnode list under the mount point from a LIST to a TAILQ
in preparation for an implementation of limiting code for kern.maxvnodes.

MFC after:	3 days
2001-10-23 01:21:29 +00:00
jhb
4806d88677 Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
  another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
  refcount is > 1 and false (0) otherwise.
2001-10-11 23:38:17 +00:00
jhb
4c62ba7c58 Add missing includes of sys/lock.h. 2001-10-11 17:52:20 +00:00
dillon
dbebfe18a1 Remove panics for rename() race conditions. The panics are inappropriate
because the IN_RENAME flag only fixes a few of the huge number of race
conditions that can result in the source path becoming invalid even
prior to the VOP_RENAME() call.  The panics created a serious security
issue whereby an attacker could fairly easily cause the panic to
occur, crashing the machine.

The correct solution requires a great deal of work in the namei
path cache code.

MFC after:	0 days
2001-10-08 00:37:54 +00:00
rwatson
63b7da3bb4 o Replace two direct uid!=0 comparisons with suser_xxx() calls.
Obtained from:	TrustedBSD Project
2001-10-02 14:41:43 +00:00
rwatson
bf46cc0b03 o Replace two direct uid!=0 comparisons with suser_td() calls.
Obtained from:	TrustedBSD Project
2001-10-02 14:34:22 +00:00
dillon
c998ec6a97 Backout the last commit. The problem is actually much worse then I
first thought and may require serious work to the VOP_RENAME() api itself.
Basically, by the time the VOP_RENAME() function is called, it's already
too late.
2001-10-02 04:26:58 +00:00
dillon
5eaf310a14 IN_RENAME should only be cleared by the routine that set it. This fixes
a rename/rmdir race that has been shown to cause a panic.

Bug reported by: Yevgeniy Aleynikov <eugenea@infospace.com>
MFC after:	3 days
2001-10-02 02:58:48 +00:00
jhb
fe3fc4701b - Fix some minor whitespace nits.
- Move the SPECIAL_FLAG #define up next to the NOHOLDER #define and fix a
  little nit that caused it to be defined as -(sizeof (struct thread) + 1)
  instead of -2.
2001-09-27 21:04:13 +00:00
rwatson
9eed33b643 o Re-enable support of system file flags in jail() by adding back the
PRISON_ROOT to the suser_xxx() check.  Since securelevels may now
  be raised in specific jails, use of system flags can still be
  restricted in jail(), but in a more configurable way.
o Users of jail() expecting system flags (such as schg) to restrict
  jail()'s should be sure to set the securelevel appropriately in
  jail()'s.
o This fixes activities involving automated system flag removal in
  jail(), including installkernel and friends.

Obtained from:	TrustedBSD Project
2001-09-26 20:44:41 +00:00
rwatson
4e4d85b5d1 o Modify ufs_setattr() so that it uses securelevel_gt() instead of
direct variable access.

Obtained from:	TrustedBSD Project
2001-09-26 20:31:37 +00:00
rwatson
56db2a389f o Further clarify comment: ad Udo's request, re-insert the 'if'
refering to securelevels; also, update the unprivileged process text
  to better indicate the scope of actions permittable when any system
  flags are already set (limited).

Submitted by:	Udo Schweigert <udo.schweigert@siemens.com>
2001-09-25 12:02:44 +00:00
rwatson
de37df3afa o Parallelize the comment on the relationship between privileged un-jailed
processes and the actual securelevel check: make the comment use '> 0'
  instead of inverted '<= 0'.
2001-09-25 02:26:10 +00:00
iedowse
af5c7958aa The addition of i_dirhash to struct inode pushed RELENG_4's
sizeof(struct inode) into a new malloc bucket on the i386. This
didn't happen in -current due to the removal of i_lock, but it does
no harm to apply the workaround to -current first.

Reduce the size of the i_spare[] array in struct inode from 4 to
3 entries, and change ext2fs to use i_din.di_spare[1] so that it
does not need i_spare[3].

Reviewed by:	bde
MFC after:	3 days
2001-09-24 18:29:20 +00:00
julian
5596676e6c KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after:    ha ha ha ha
2001-09-12 08:38:13 +00:00
iedowse
ddaced1703 The "dirpref" directory layout preference improvements make use of
an array "fs_contigdirs[]" to avoid too many directories getting
created in each cylinder group. The memory required for this and
two other arrays (fs_csp[] and fs_maxcluster[]) is allocated with
a single malloc() call, and divided up afterwards.  However, the
'space' pointer is not advanced correctly, so fs_contigdirs and
fs_maxcluster end up pointing to the same address.

Add the missing code to advance the 'space' pointer, and remove
an unnecessary update of the pointer that follows.

This is likely to fix the "ffs_clusteralloc: map mismatch" panics
that have been reported recently.

Submitted by:		Luke Mewburn <lukem@wasabisystems.com>
2001-09-09 23:48:28 +00:00
jedgar
5b9926665b Use ACL_PERM_NONE instead of hardcoding 0 when initializing
ACL entry permissions.

Reviewed by:	rwatson
2001-09-01 23:18:15 +00:00
rwatson
00ec5e8482 o At some point, unmounting a non-EA file system with EA's compiled
in got a bit broken, when ufs_extattr_stop() was called and failed,
  ufs_extattr_destroy() would panic.  This makes the call to destroy()
  conditional on the success of stop().

Submitted by:		Christian Carstensen <cc@devcon.net>
Obtained from:	TrustedBSD Project
2001-09-01 20:11:05 +00:00
peter
4b437abe78 If a file has been completely unlinked, stop automatically syncing the
file.  ffs will discard any pending dirty pages when it is closed,
so we may as well not waste time trying to clean them.  This doesn't
stop other things from writing it out, eg: pageout, fsync(2) etc.
2001-08-27 06:09:56 +00:00
iedowse
bff1ce4c20 Stop using dirhash when a directory is removed, and ensure that we
never attempt to hash directories once they are deleted. This fixes
a problem where operations on a deleted directory could trigger
dirhash sanity panics.
2001-08-26 20:47:19 +00:00
iedowse
c8ef91ce6c When compacting directories, ufs_direnter() always trusted DIRSIZ()
to supply the number of bytes to be bcopy()'d to move an entry. If
d_ino == 0 however, DIRSIZ() is not guaranteed to return a sensible
length, so ufs_direnter could end up corrupting a directory during
compaction. In practice I believe this can only happen after fsck_ffs
has fixed a previously-corrupted directory.

We now deal with any mid-block unused entries specially to avoid
using DIRSIZ() or bcopy() on such entries. We also ensure that the
variables 'dsize' and 'spacefree' contain meaningful values at all
times. Add a few comments to describe better this intricate piece
of code.

The special handling of mid-block unused entries makes the dirhash-
specific bugfix in the previous revision (1.53) now uncecessary,
so this change removes it.

Reviewed by:	mckusick
2001-08-26 01:25:12 +00:00
iedowse
c37a1a0228 When compressing directory blocks, the dirhash code didn't check
that the directory entry was in use before attempting to find it
in the hash structures to change its offset. Normally, unused
entries do not need to be moved, but fsck can leave behind some
unused entries that do. A dirhash sanity panic resulted when the
entry to be moved was not found. Add a check that stops entries
with d_ino == 0 from being passed to ufsdirhash_move().
2001-08-22 01:35:17 +00:00
peter
886fd9ad64 Sigh. ufs_lookup() calls ffs_snapgone(), meaning that 'options EXT2FS'
without 'options FFS' would fail to link.
2001-08-18 03:08:48 +00:00
iedowse
c768482a0f Two recent commits in sys/ufs/ufs interacted badly with ext2fs
because it shares ufs code. In ufs_fhtovp(), the test on i_effnlink
is invalid because ext2fs does not maintain this field. In ufs_close(),
i_effnlink is also tested, to determines whether or not to call
vn_start_write(). The ufs_fhtovp issue breaks NFS exporting of
ext2fs filesystems; I believe the other is harmless.

Fix both cases by checking um_i_effnlink_valid in the ufsmount
struct, and use i_nlink if necessary.

Noticed by:	bde
Reviewed by:	mckusick, bde
2001-07-29 22:26:01 +00:00
iedowse
5857132545 Disable the dirhash sanity check that panics if an unused directory
entry (d_ino == 0) is found in a position that is not the start of
a DIRBLKSIZ block.

While such entries cannot occur normally (ufs always extends the
previous entry to cover the free space instead), they do not cause
problems and fsck does not fix them, so panicking is bad.
2001-07-27 18:45:41 +00:00
peter
57a6887663 Use a fixed type for times in on-disk structures for ufs rather than
something that could potentially change like time_t.
2001-07-16 00:55:27 +00:00
iedowse
991ee42593 Return a locked struct buf from ufsdirhash_lookup() to avoid one
extra getblk/brelse sequence for each lookup. We already had this
buf in ufsdirhash_lookup(), so there was no point in brelse'ing it
only to have the caller immediately reaquire the same buffer.

This should make the case of sequential lookups marginally faster;
in my tests, sequential lookups with dirhash enabled are now only
around 1% slower than without dirhash.
2001-07-13 20:50:38 +00:00
iedowse
ecbac42d61 Bring in dirhash, a simple hash-based lookup optimisation for large
directories. When enabled via "options UFS_DIRHASH", in-core hash
arrays are maintained for large directories. These allow all
directory operations to take place quickly instead of requiring
long linear searches. For now anyway, dirhash is not enabled by
default.

The in-core hash arrays have a memory requirement that is approximately
half the size of the size of the on-disk directory file. A number
of new sysctl variables allow control over which directories get
hashed and over the maximum amount of memory that dirhash will use:

  vfs.ufs.dirhash_minsize
    The minimum on-disk directory size for which hashing should be
    used. The default is 2560 (2.5k).

  vfs.ufs.dirhash_maxmem
    The system-wide maximum total memory to be used by dirhash data
    structures. The default is 2097152 (2MB).

The current amount of memory being used by dirhash is visible
through the read-only sysctl variable vfs.ufs.dirhash_maxmem.
Finally, some extra sanity checks that are enabled by default, but
which may have an impact on performance, can be disabled by setting
vfs.ufs.dirhash_docheck to 0.

Discussed on: -fs, -hackers
2001-07-10 21:21:29 +00:00
dillon
e028603b7e With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage).  Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
2001-07-04 16:20:28 +00:00
jhb
1bc2ddffa0 Fix more mntvnode and vnode interlock order reversals. 2001-06-28 22:21:33 +00:00
jhb
822af43e20 - Fix a mntvnode and vnode interlock reversal.
- Protect the mnt_vnode list with the mntvnode lock.
- Use queue(9) macros.
2001-06-28 04:12:56 +00:00
peter
efc05ef868 Fix warning:
1973: warning: int format, long int arg (arg 5)
2001-06-15 07:44:39 +00:00
mckusick
2f9709c0f1 Build on the change in revision 1.98 by Tor.Egge@fast.no.
The symptom being treated in 1.98 was to avoid freeing a
pagedep dependency if there was still a newdirblk dependency
referencing it. That change is correct and no longer prints
a warning message when it occurs. The other part of revision
1.98 was to panic when a newdirblk dependency was encountered
during a file truncation. This fix removes that panic and
replaces it with code to find and delete the newdirblk
dependency so that the truncation can succeed.
2001-06-13 23:13:13 +00:00
tmm
126158251a Call vn_close on the backing file vnode if ufs_extattr_enable failed to
avoid leaking it.

Reviewed by:	rwatson
2001-06-07 00:11:32 +00:00
jlemon
bf3a0c1bb1 Add a wrapper for the fifo kqfilter which falls through to the ufs routine.
This permits the fifo to inherit the ufs VNODE kqfilter.
2001-06-06 17:40:57 +00:00
jlemon
404f6dd2da Add a kqueue filter for writing to ufs filesystems which always returns
true.  This permits better interoperability with programs which register
filters on their stdin/stdout handles.

Submitted by: Niels Provos <provos@citi.umich.edu>
2001-06-05 13:52:37 +00:00
obrien
c3d076fe28 There seems to be a problem that the order of disk write operation being
incorrect due to a missing check for some dependency.  This change
avoids the freelist corruption (but not the temporarily inconsistent
state of the file system).

A message is printed as a reminder of the under lying problem when a
pagedep structure is not freed due to the NEWBLOCK flag being set.

Submitted by:	Tor.Egge@fast.no
2001-06-05 01:49:37 +00:00
jhb
ff6bb62be3 Revert the previous commit in favor of the fix in rev 1.42 of
ufs/ffs/ffs_extern.h instead.

Requested by:	bde
2001-05-30 23:09:19 +00:00
jhb
b033de0fdc Forward declare struct cg to quiet a warning.
Submitted by:	bde
2001-05-30 23:08:40 +00:00
jhb
3d4bb0e38d Include <ufs/ffs/fs.h> to get the definition of struct cg to quiet a
warning.
2001-05-29 23:53:16 +00:00
phk
d442761285 Remove last vestiges of MFS. 2001-05-29 21:21:53 +00:00
phk
785e1e1e17 Remove MFS from the kernel. 2001-05-29 18:50:30 +00:00
tmm
76ca05f26f Add a check to determine whether extended attributes have been
initialized on the file system before trying to grab the lock of the
per-mount extattr structure, as this lock is unitialized in that case.
This is needed because ufs_extattr_vnode_inactive is called from
ufs_inactive, which is also used by EA-unaware file systems such as
ext2fs.

Reviewed by:	rwatson
2001-05-25 18:24:52 +00:00
rwatson
f504530d9f o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
  pcred->pc_uidinfo, which was associated with the real uid, only rename
  it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
  corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
  original macro that pointed.
  p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
  p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
  cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
  cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
  we figure out locking and optimizations; generally speaking, this
  means moving to a structure like this:
        newcred = crdup(oldcred);
        ...
        p->p_ucred = newcred;
        crfree(oldcred);
  It's not race-free, but better than nothing.  There are also races
  in sys_process.c, all inter-process authorization, fork, exec, and
  exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
  remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
  use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
  pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
  allocation.
o Clean up ktrcanset() to take into account changes, and move to using
  suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
  calls to better document current behavior.  In a couple of places,
  current behavior is a little questionable and we need to check
  POSIX.1 to make sure it's "right".  More commenting work still
  remains to be done.
o Update credential management calls, such as crfree(), to take into
  account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
      change_euid()
      change_egid()
      change_ruid()
      change_rgid()
      change_svuid()
      change_svgid()
  In each case, the call now acts on a credential not a process, and as
  such no longer requires more complicated process locking/etc.  They
  now assume the caller will do any necessary allocation of an
  exclusive credential reference.  Each is commented to document its
  reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
  and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
  questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
  processes and pcreds.  Note that this authorization, as well as
  CANSIGIO(), needs to be updated to use the p_cansignal() and
  p_cansched() centralized authorization routines, as they currently
  do not take into account some desirable restrictions that are handled
  by the centralized routines, as well as being inconsistent with other
  similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from:	TrustedBSD Project
Reviewed by:	green, bde, jhb, freebsd-arch, freebsd-audit
2001-05-25 16:59:11 +00:00
dillon
a179ee09ab This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%.  For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by:	tegge, dillon
2001-05-24 07:22:27 +00:00
alfred
edd79c680a ufs_bmaparray() may block on IO, drop vm mutex and aquire Giant when
calling it from the pager routine
2001-05-23 10:30:25 +00:00
ru
35437d86aa - FDESC, FIFO, NULL, PORTAL, PROC, UMAP and UNION file
systems were repo-copied from sys/miscfs to sys/fs.

- Renamed the following file systems and their modules:
  fdesc -> fdescfs, portal -> portalfs, union -> unionfs.

- Renamed corresponding kernel options:
  FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS.

- Install header files for the above file systems.

- Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland
  Makefiles.
2001-05-23 09:42:29 +00:00
mckusick
c9a89dbc03 Update softdep_setup_directory_add prototype to reflect changes in
actual function.

Obtained from:	Jim Bloom <bloom@jbloom.jbloom.org>
2001-05-20 15:59:55 +00:00
mckusick
aac9daff0f Must ensure that all the entries on the pd_pendinghd list have been
committed to disk before clearing them. More specifically, when
free_newdirblk is called, we know that the inode claims the new
directory block. However, if the associated pagedep is still linked
onto the directory buffer dependency chain, then some of the entries
on the pd_pendinghd list may not be committed to disk yet. In this
case, we will simply note that the inode claims the block and let
the pd_pendinghd list be processed when the pagedep is next written.
If the pagedep is no longer on the buffer dependency chain, then
all the entries on the pd_pending list are committed to disk and
we can free them in free_newdirblk. This corrects a window of
vulnerability introduced in the code added in version 1.95.
2001-05-19 19:24:26 +00:00
alfred
a3f0842419 Introduce a global lock for the vm subsystem (vm_mtx).
vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb
2001-05-19 01:28:09 +00:00
mckusick
c8947d0f03 Must be a bit less aggressive about freeing pagedep structures.
Obtained from:	Robert Watson <rwatson@FreeBSD.org> and
		Matthew Jacob <mjacob@feral.com>
2001-05-18 22:16:28 +00:00
mckusick
5411edc0fb When a new block is allocated to a directory, an fsync of a file
whose name is within that block must ensure not only that the block
containing the file name has been written, but also that the on-disk
directory inode references that block. When a new directory block
is created, we allocate a newdirblk structure which is linked to
the associated allocdirect (on its ad_newdirblk list). When the
allocdirect has been satisfied, the newdirblk structure is moved
to the inodedep id_bufwait list of its directory to await the inode
being written.  When the inode is written, the directory entries
are fully committed and can be deleted from their pagedep->id_pendinghd
and inodedep->id_pendinghd lists.
2001-05-17 07:24:03 +00:00
iedowse
dafd513732 Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by:	phk, bp
2001-05-16 18:04:37 +00:00
mckusick
dca0cbadc3 Further fixes for deadlock in the presence of multiple snapshots.
There are still more to find, but this fix should cover the
common cases that folks are hitting.
2001-05-14 17:16:49 +00:00
mckusick
0c59ac9908 If the effective link count is zero when an NFS file handle request
comes in for it, the file is really gone, so return ESTALE.

The problem arises when the last reference to an FFS file is
released because soft-updates may delay the actual freeing of the
inode for some time. Since there are no filesystem links or open
file descriptors referencing the inode, from the point of view of
the system, the file is inaccessible. However, if the filesystem
is NFS exported, then the remote client can still access the inode
via ufs_fhtovp() until the inode really goes away. To prevent this
anomoly, it is necessary to begin returning ESTALE at the same time
that the file ceases to be accessible to the local filesystem.

Obtained from:	Ian Dowse <iedowse@maths.tcd.ie>
2001-05-13 23:30:45 +00:00
mckusick
8b011b16b3 Remove yet another deadlock case. 2001-05-11 07:12:03 +00:00
mckusick
d8ef9cc3b5 When running with soft updates, track the number of blocks and files
that are committed to being freed and reflect these blocks in the
counts returned by statfs (and thus also by the `df' command). This
change allows programs such as those that do news expiration to
know when to stop if they are trying to create a certain percentage
of free space. Note that this change does not solve the much harder
problem of making this to-be-freed space available to applications
that want it (thus on a nearly full filesystem, you may still
encounter out-of-space conditions even though the free space will
show up eventually). Hopefully this harder problem will be the
subject of a future enhancement.
2001-05-08 07:42:20 +00:00
mckusick
1be6156a3d Several fixes for units errors:
1) Do not assume that the superblock will be of size fs->fs_bsize.
   This fixes a panic when taking a snapshot on a filesystem with
   a block size bigger than 8K.
2) Properly calculate the number of fragments that follow the
   superblock summary information. This fixes a bug with inconsistent
   snapshots.
3) When cleaning up a snapshot that is about to be removed, properly
   calculate the number of blocks that need to be checked. This fixes
   a bug that created partially allocated inodes.
4) When moving blocks from a snapshot that is about to be removed
   to another snapshot, properly account for the reduced number of
   blocks in the snapshot from which they are taken. This fixes a
   bug in which the number of blocks released from a snapshot did not
   match the number that it claimed to have.
2001-05-08 07:29:03 +00:00
mckusick
36984adaef When syncing out snapshot metadata, we must temporarily allow recursive
buffer locking so as to avoid locking against ourselves if we need to
write filesystem metadata.
2001-05-08 07:13:00 +00:00
mckusick
547825394d Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reduce
the amount of time that the filesystem must be suspended. The
current snapshot is elided as well as the earlier snapshots.
2001-05-04 05:49:28 +00:00
phk
c41ac84ca2 Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes. 2001-05-01 09:12:39 +00:00
phk
279d435d13 Remove blatantly pointless call to VOP_BMAP().
Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.
2001-05-01 09:12:31 +00:00
phk
5948c9ed5b Implement vop_std{get|put}pages() and add them to the default vop[].
Un-copy&paste all the VOP_{GET|PUT}PAGES() functions which do nothing but
the default.
2001-05-01 08:34:45 +00:00
markm
bcca5847d5 Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by:	bde (with reservations)
2001-05-01 08:13:21 +00:00
phk
8e3fa89968 VOP_BALLOC was never really a VOP in the first place, so convert it
to UFS_BALLOC like the other "between UFS and FFS function interfaces".
2001-04-29 12:36:52 +00:00
phk
608c1caf3b Add a vop_stdbmap(), and make it part of the default vop vector.
Make 7 filesystems which don't really know about VOP_BMAP rely
on the default vector, rather than more or less complete local
vop_nopbmap() implementations.
2001-04-29 11:48:41 +00:00
phk
a86fd16038 Call ufs_bmaparray() directly instead of indirectly via VOP_BMAP(). 2001-04-29 10:25:30 +00:00
phk
cc0f43bb51 Remove two unused arguments from ufs_bmaparray(). 2001-04-29 10:24:58 +00:00
phk
cc62125a67 Remove faint traces of blind copy&paste. 2001-04-29 10:23:50 +00:00
phk
d2b1930b66 Remove faint traces of non-existant ffs_bmap(). 2001-04-29 10:23:32 +00:00
grog
4b9d9cbaac Revert consequences of changes to mount.h, part 2.
Requested by:	bde
2001-04-29 02:45:39 +00:00
mckusick
ed86738068 Rather than copying all the indirect blocks of the snapshot,
simply mark them as BLK_NOCOPY. This trick cuts the initial
size of the snapshot in half and cuts the time to take a
snapshot by a third.
2001-04-26 00:50:53 +00:00
mckusick
f863141979 When closing the last reference to an unlinked file, it is freed
by the inactive routine. Because the freeing causes the filesystem
to be modified, the close must be held up during periods when the
filesystem is suspended.

For snapshots to be consistent across crashes, they must write
blocks that they copy and claim those written blocks in their
on-disk block pointers before the old blocks that they referenced
can be allowed to be written.

Close a loophole that allowed unwritten blocks to be skipped when
doing ffs_sync with a request to wait for all I/O activity to be
completed.
2001-04-25 08:11:18 +00:00
phk
cdc83afc7f Move the netexport structure from the fs-specific mountstructure
to struct mount.

This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.

Consequently that all non-stacking filesystems can use
vfs_stdcheckexp().

At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>
2001-04-25 07:07:52 +00:00
iedowse
383dd0a265 Pre-dirpref versions of fsck may zero out the new superblock fields
fs_contigdirs, fs_avgfilesize and fs_avgfpdir. This could cause
panics if these fields were zeroed while a filesystem was mounted
read-only, and then remounted read-write.

Add code to ffs_reload() which copies the fs_contigdirs pointer
from the previous superblock, and reinitialises fs_avgf* if necessary.

Reviewed by:	mckusick
2001-04-24 00:37:16 +00:00
grog
1f5de30718 Correct #includes to work with fixed sys/mount.h. 2001-04-23 09:05:15 +00:00
phk
378e561228 This patch removes the VOP_BWRITE() vector.
VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine.  For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
2001-04-17 08:56:39 +00:00
mckusick
ba66879022 Add debugging option to always read/write cylinder groups as full
sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'.
Add debugging option to disable background writes of cylinder
groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'.
These debugging options should be tried on systems that are panicing
with corrupted cylinder group maps to see if it makes the problem
go away. The set of panics in question are:

	ffs_clusteralloc: map mismatch
	ffs_nodealloccg: map corrupted
	ffs_nodealloccg: block not in map
	ffs_alloccg: map corrupted
	ffs_alloccg: block not in map
	ffs_alloccgblk: cyl groups corrupted
	ffs_alloccgblk: can't find blk in cyl
	ffs_checkblk: partially free fragment

The following panics are less likely to be related to this problem,
but might be helped by these debugging options:

	ffs_valloc: dup alloc
	ffs_blkfree: freeing free block
	ffs_blkfree: freeing free frag
	ffs_vfree: freeing free inode

If you try these options, please report whether they helped reduce your
bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com>
and to Matt Dillon <dillon@earth.backplane.com>.
2001-04-17 05:37:51 +00:00
mckusick
6ea67910b6 Background fsck sysctl operations must use vn_start_write and
vn_finished_write so that they do not attempt to modify a
suspended filesystem.
2001-04-17 05:06:37 +00:00
rwatson
678b28a532 In my first reading of POSIX.1e, I misinterpreted handling of the
ACL_USER_OBJ and ACL_GROUP_OBJ fields, believing that modification of the
access ACL could be used by privileged processes to change file/directory
ownership.  In fact, this is incorrect; ACL_*_OBJ (+ ACL_MASK and
ACL_OTHER) should have undefined ae_id fields; this commit attempts
to correct that misunderstanding.

o Modify arguments to vaccess_acl_posix1e() to accept the uid and gid
  associated with the vnode, as those can no longer be extracted from
  the ACL passed as an argument.  Perform all comparisons against
  the passed arguments.  This actually has the effect of simplifying
  a number of components of this call, as well as reducing the indent
  level, but now seperates handling of ACL_GROUP_OBJ from ACL_GROUP.

o Modify acl_posix1e_check() to return EINVAL if the ae_id field of
  any of the ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} entries is a value
  other than ACL_UNDEFINED_ID.  As a temporary work-around to allow
  clean upgrades, set the ae_id field to ACL_UNDEFINED_ID before
  each check so that this cannot cause a failure in the short term
  (this work-around will be removed when the userland libraries and
  utilities are updated to take this change into account).

o Modify ufs_sync_acl_from_inode() so that it forces
  ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} ae_id fields to ACL_UNDEFINED_ID
  when synchronizing the ACL from the inode.

o Modify ufs_sync_inode_from_acl to not propagate uid and gid
  information to the inode from the ACL during ACL update.  Also
  modify the masking of permission bits that may be set from
  ALLPERMS to (S_IRWXU|S_IRWXG|S_IRWXO), as ACLs currently do not
  carry none-ACCESSPERMS (S_ISUID, S_ISGID, S_ISTXT).

o Modify ufs_getacl() so that when it emulates an access ACL from
  the inode, it initializes the ae_id fields to ACL_UNDEFINED_ID.

o Clean up ufs_setacl() substantially since it is no longer possible
  to perform chown/chgrp operations using vop_setacl(), so all the
  access control for that can be eliminated.

o Modify ufs_access() so that it passes owner uid and gid information
  into vaccess_acl_posix1e().

Pointed out by:	jedger
Obtained from:	TrustedBSD Project
2001-04-17 04:33:34 +00:00
mckusick
27094e6d21 Update to describe use of mdconfig instead of deprecated vnconfig.
Submitted by:	Steve Ames <steve@virtual-voodoo.com>
2001-04-14 18:32:09 +00:00
mckusick
6a7a6ab20d This checkin adds support in ufs/ffs for the FS_NEEDSFSCK flag.
It is described in ufs/ffs/fs.h as follows:

/*
 * Filesystem flags.
 *
 * Note that the FS_NEEDSFSCK flag is set and cleared only by the
 * fsck utility. It is set when background fsck finds an unexpected
 * inconsistency which requires a traditional foreground fsck to be
 * run. Such inconsistencies should only be found after an uncorrectable
 * disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when
 * it has successfully cleaned up the filesystem. The kernel uses this
 * flag to enforce that inconsistent filesystems be mounted read-only.
 */
#define FS_UNCLEAN    0x01	/* filesystem not clean at mount */
#define FS_DOSOFTDEP  0x02	/* filesystem using soft dependencies */
#define FS_NEEDSFSCK  0x04	/* filesystem needs sync fsck before mount */
2001-04-14 05:26:28 +00:00
mckusick
3931e94b1f Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>.
His description of the problem and solution follow. My own tests show
speedups on typical filesystem intensive workloads of 5% to 12% which
is very impressive considering the small amount of code change involved.

------

  One day I noticed that some file operations run much faster on
small file systems then on big ones. I've looked at the ffs
algorithms, thought about them, and redesigned the dirpref algorithm.

  First I want to describe the results of my tests. These results are old
and I have improved the algorithm after these tests were done. Nevertheless
they show how big the perfomance speedup may be. I have done two file/directory
intensive tests on a two OpenBSD systems with old and new dirpref algorithm.
The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports".
The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release.
It contains 6596 directories and 13868 files. The test systems are:

1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for
   test is at wd1. Size of test file system is 8 Gb, number of cg=991,
   size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current
   from Dec 2000 with BUFCACHEPERCENT=35

2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system
   at wd0, file system for test is at wd1. Size of test file system is 40 Gb,
   number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k
   OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50

You can get more info about the test systems and methods at:
http://www.ptci.ru/gluk/dirpref/old/dirpref.html

                              Test Results

             tar -xzf ports.tar.gz               rm -rf ports
  mode  old dirpref new dirpref speedup old dirprefnew dirpref speedup
                             First system
 normal     667         472      1.41       477        331       1.44
 async      285         144      1.98       130         14       9.29
 sync       768         616      1.25       477        334       1.43
 softdep    413         252      1.64       241         38       6.34
                             Second system
 normal     329         81       4.06       263.5       93.5     2.81
 async      302         25.7    11.75       112          2.26   49.56
 sync       281         57.0     4.93       263         90.5     2.9
 softdep    341         40.6     8.4        284          4.76   59.66

"old dirpref" and "new dirpref" columns give a test time in seconds.
speedup - speed increasement in times, ie. old dirpref / new dirpref.

------

Algorithm description

The old dirpref algorithm is described in comments:

/*
 * Find a cylinder to place a directory.
 *
 * The policy implemented by this algorithm is to select from
 * among those cylinder groups with above the average number of
 * free inodes, the one with the smallest number of directories.
 */

A new directory is allocated in a different cylinder groups than its
parent directory resulting in a directory tree that is spreaded across
all the cylinder groups. This spreading out results in a non-optimal
access to the directories and files. When we have a small filesystem
it is not a problem but when the filesystem is big then perfomance
degradation becomes very apparent.

What I mean by a big file system ?

  1. A big filesystem is a filesystem which occupy 20-30 or more percent
     of total drive space, i.e. first and last cylinder are physically
     located relatively far from each other.
  2. It has a relatively large number of cylinder groups, for example
     more cylinder groups than 50% of the buffers in the buffer cache.

The first results in long access times, while the second results in
many buffers being used by metadata operations. Such operations use
cylinder group blocks and on-disk inode blocks. The cylinder group
block (fs->fs_cblkno) contains struct cg, inode and block bit maps.
It is 2k in size for the default filesystem parameters. If new and
parent directories are located in different cylinder groups then the
system performs more input/output operations and uses more buffers.
On filesystems with many cylinder groups, lots of cache buffers are
used for metadata operations.

My solution for this problem is very simple. I allocate many directories
in one cylinder group. I also do some things, so that the new allocation
method does not cause excessive fragmentation and all directory inodes
will not be located at a location far from its file's inodes and data.
The algorithm is:
/*
 * Find a cylinder group to place a directory.
 *
 * The policy implemented by this algorithm is to allocate a
 * directory inode in the same cylinder group as its parent
 * directory, but also to reserve space for its files inodes
 * and data. Restrict the number of directories which may be
 * allocated one after another in the same cylinder group
 * without intervening allocation of files.
 *
 * If we allocate a first level directory then force allocation
 * in another cylinder group.
 */

  My early versions of dirpref give me a good results for a wide range of
file operations and different filesystem capacities except one case:
those applications that create their entire directory structure first
and only later fill this structure with files.

  My solution for such and similar cases is to limit a number of
directories which may be created one after another in the same cylinder
group without intervening file creations. For this purpose, I allocate
an array of counters at mount time. This array is linked to the superblock
fs->fs_contigdirs[cg]. Each time a directory is created the counter
increases and each time a file is created the counter decreases. A 60Gb
filesystem with 8mb/cg requires 10kb of memory for the counters array.

  The maxcontigdirs is a maximum number of directories which may be created
without an intervening file creation. I found in my tests that the best
performance occurs when I restrict the number of directories in one cylinder
group such that all its files may be located in the same cylinder group.
There may be some deterioration in performance if all the file inodes
are in the same cylinder group as its containing directory, but their
data partially resides in a different cylinder group. The maxcontigdirs
value is calculated to try to prevent this condition. Since there is
no way to know how many files and directories will be allocated later
I added two optimization parameters in superblock/tunefs. They are:

        int32_t  fs_avgfilesize;   /* expected average file size */
        int32_t  fs_avgfpdir;      /* expected # of files per directory */

These parameters have reasonable defaults but may be tweeked for special
uses of a filesystem. They are only necessary in rare cases like better
tuning a filesystem being used to store a squid cache.

I have been using this algorithm for about 3 months. I have done
a lot of testing on filesystems with different capacities, average
filesize, average number of files per directory, and so on. I think
this algorithm has no negative impact on filesystem perfomance. It
works better than the default one in all cases. The new dirpref
will greatly improve untarring/removing/coping of big directories,
decrease load on cvs servers and much more. The new dirpref doesn't
speedup a compilation process, but also doesn't slow it down.

Obtained from:	Grigoriy Orlov <gluk@ptci.ru>
2001-04-10 08:38:59 +00:00
rwatson
2208cab11f o Indent sub-section headings to be consistent with README.extattr.
Obtained from:	TrustedBSD Project
2001-04-03 18:05:03 +00:00
rwatson
f39773137b o Introduce a README file describing briefly how to use access control
lists, in the style of FFS README files for soft updates and snapshots.

Obtained from:        TrustedBSD Project
2001-04-03 17:58:25 +00:00
rwatson
d43ef707ba o Introduce a README file describing briefly how to use extended
attributes, in the style of FFS README files for soft updates and
  snapshots.

Obtained from:	TrustedBSD Project
2001-04-03 17:31:36 +00:00
rwatson
6805eb2bf4 o Change the default from using IO_SYNC on EA set and delete operations
to not using IO_SYNC.  Expose a sysctl (debug.ufs_extattr_sync) for
  enabling the use of IO_SYNC.

    - Use of IO_SYNC substantially degrades ACL performance when a
      default ACL is set on a directory, as there are four synchronous
      writes initiated to define both supporting EAs for new
      sub-directories, and to set the data; two for new files.  Later, this
      may be optimized to two writes for sub-directories, one for new
      files.

    - IO_SYNC does not substantially improve consistency properties due
      to the poor consistency properties of existing permissions (which
      ACLs are a superset of), due to interaction with soft updates,
      and due to differences in handling consistency for data and file
      system meta-data.

    - In macro-benchmarks, this reduces the overhead of setting default
      ACLs down to the same overhead as enabling ACLs on a file system
      and not using them.  Enabling ACLs still introduces a small
      overhead (I measure 7% on a -j 2 buildworld with pre-allocated
      EA backing store, but this is not rigorous testing, nor in any way
      optimized).

    - The sysctl will probably change to another administration method
      (or at least, a better name) in the near future, but consistency
      properties of EAs are still being worked out.  The toggle is defined
      right now to allow easier performance analysis and exploration
      of possible guarantees.

Obtained from:	TrustedBSD Project
2001-04-03 04:09:53 +00:00
rwatson
3100bf9079 o Correct an ACL implementation bug that could result in a system panic
under heavy use when default ACLs were bgin inherited by new files
  or directories.  This is done by removing a bug in default ACL
  reading, and improving error handling for this failure case:

    - Move the setting of the buffer length (len) variable to above the
      ACL type (ap->a_type) switch rather than having it only for
      ACL_TYPE_ACCESS.  Otherwise, the len variable is unitialized in
      the ACL_TYPE_DEFAULT case, which generally worked right, but could
      result in failure.

    - Add a check for a short/long read of the ACL_TYPE_DEFAULT type from
      the underlying EA, resulting in EPERM rather than passing a
      potentially corrupted ACL back to the caller (resulting "cleaner"
      failures if the EA is damaged: right now, the caller will almost
      always panic in the presence of a corrupted EA).  This code is similar
      to code in the ACL_TYPE_ACCESS handling in the previous switch case.

    - While I'm fixing this code, remove a redundant bzero() of the ACL
      reader buffer; it need only be initialized above the acl_type
      switch.

Obtained from:	TrustedBSD Project
2001-04-02 01:02:32 +00:00
rwatson
737ae0941e Introduce support for POSIX.1e ACLs on UFS-based file systems. This
implementation is still experimental, and while fairly broadly tested,
is not yet intended for production use.  Support for POSIX.1e ACLs on
UFS will not be MFC'd to RELENG_4.

This implementation works by providing implementations of VOP_[GS]ETACL()
for FFS, as well as modifying the appropriate access control and file
creation routines.  In this implementation, ACLs are backed into extended
attributes; the base ACL (owner, group, other) permissions remain in the
inode for performance and compatibility reasons, so only the extended and
default ACLs are placed in extended attributes.  The logic for ACL
evaluation is provided by the fs-independent kern/kern_acl.c.

o Introduce UFS_ACL, a compile-time configuration option that enables
  support for ACLs on FFS (and potentially other UFS-based file systems).
o Introduce ufs_getacl(), ufs_setacl(), ufs_aclcheck(), which
  respectively get, set, and check the ACLs on the passed vnode.
o Introduce ufs_sync_acl_from_inode(), ufs_sync_inode_from_acl() to
  maintain access control information between inode permissions and
  extended attribute data.
o Modify ufs_access() to load a file access ACL and invoke
  vaccess_acl_posix1e() if ACLs are available on the file system
o Modify ufs_mkdir() and ufs_makeinode() to associate ACLs with newly
  created directories and files, inheriting from the parent directory's
  default ACL.
o Enable these new vnode operations and conditionally compiled code
  paths if UFS_ACL is defined.

A few notes:

o This implementation is fairly widely tested, but still should be
  considered experimental.
o Currently, ACLs are not exported via NFS, instead, the summarizing
  file mode/etc from the inode is.  This results in conservative
  protection behavior, similar to the behavior of ACL-nonaware programs
  acting locally.
o It is possible that underlying binary data formats associated with
  this implementation may change.  Consumers of the implementation
  should expect to find their local configuration obsoleted in the
  next few months, resulting in possible loss of ACL data during an
  upgrade.
o The extended attributes interface and implementation is still
  undergoing modification to address portable interface concerns, as
  well as performance.
o Many applications do not yet correctly handle ACLs.  In general,
  due to the POSIX.1e ACL model, behavior of ACL-unaware applications
  will be conservative with respects to file protection; some caution
  is recommended.
o Instructions for configuring and maintaining ACLs on UFS will be
  committed in the near future; in the mean time it is possible to
  reference the README included in the last UFS ACL distribution
  placed in the TrustedBSD web site:

      http://www.TrustedBSD.org/downloads/

Substantial debugging, hardware, travel, or connectivity support for this
project was provided by: BSDi, Safeport Network Services, and NAI Labs.
Significant coding contributions were made by Chris Faulhaber.  Additional
support was provided by Brian Feldman, Thomas Moestl, and Ilmar Habibulin.

Reviewed by:	jedgar, keichii, mckusick, trustedbsd-discuss, freebsd-fs
Obtained from:	TrustedBSD Project
2001-03-26 17:53:19 +00:00
phk
c47745e977 Send the remains (such as I have located) of "block major numbers" to
the bit-bucket.
2001-03-26 12:41:29 +00:00
asmodai
05c87e82c2 Fix typo ); -> , 2001-03-24 15:25:04 +00:00
mckusick
c6fdb61aa7 Check that background fsck operation is being done on a ufs filesystem.
Obtained from:	Robert Watson <rwatson@FreeBSD.org>
2001-03-23 20:58:25 +00:00
rwatson
cae18aa0fd o Remove an unnecessary debugging printf from ufs_extattr_lookup(),
which resulted in the output of warning messages at boot if
  UFS_EXTATTR_AUTOSTART was enabled but ".attribute" and possible
  sub-directories weren't in a mounted MFS or UFS file systems.

Pointed out by:	dcs
Obtained from:	TrustedBSD Project
2001-03-21 23:00:39 +00:00
mckusick
69603157de Add kernel support for running fsck on active filesystems. 2001-03-21 04:09:01 +00:00
mckusick
39275d892c Clear the fs_clean flag only when the FS_UNCLEAN flag is not set
(as is done in unmount).

Remove a snapshot inode from the superblock list when its last
name goes away rather than when its last reference goes away.
That way it will be properly reclaimed by fsck after a crash
rather than reenabled when the filesystem is mounted.
2001-03-21 04:05:20 +00:00
mckusick
d22815bec3 Report the correct inode number when panicing with freeing free inode.
Report the correct block number when panicing with freeing free block.
2001-03-21 04:01:02 +00:00
rwatson
b777dece07 o Enable UFS-based extended attribute support on MFS. Note that this change
is under-tested, and that MFS appears to be in the process of being
  deprecated in favor of FFS over md.  Note also that UFS_EXTATTR_AUTOSTART
  doesn't make much sense on MFS unless the MFSROOT is compiled in, so
  manual configuration is generally required.

Obtained from:	TrustedBSD Project
2001-03-19 06:44:18 +00:00
rwatson
0012887962 o Rename "namespace" argument to "attrnamespace" as namespace is a C++
reserved word.

Submitted by:	jkh
Obtained from:	TrustedBSD Project
2001-03-19 05:44:15 +00:00