bypass some extra anti-foot-shooting measures. Currently, its only
effect is to allow detaching a device while it's still open (e.g.,
mounted). This is useful for testing how the system reacts to a disk
suddenly going away, which can happen with some removeable media.
At this point, the force option is only checked on detach, so it
would've been possible to allow the option to be passed with the
MDIOCDETACH operation. This was not done to allow the possibility of
having the force flag influence other tests in the future, which may
not necessarily deal with detaching the device.
Reviewed by: sobomax
Approved by: phk
Avoid using parenthesis enclosure macros (.Pq and .Po/.Pc) with plain text.
Not only this slows down the mdoc(7) processing significantly, but it also
has an undesired (in this case) effect of disabling hyphenation within the
entire enclosed block.
into sadb_x_sa2_sequence from sadb_x_sa2_reserved3 in the sadb_x_sa2
structure. Also the output of setkey is changed. sequence number
of the sadb is replaced to the end of the output.
Obtained from: KAME
pointed out by bde:
- Ask for user confirmation before adjusting to a head/cylinder
boundary (only when running interactively), and separate this
adjustment from the automatic calculation of c/h/s parameters.
- In sanitize_partition, don't change any values in the slice until
we know that the automatic adjustment will succeed.
- When auto-adjusting, ignore unused slices and give an appropriate
error for other zero-size slices depending on the cause.
- Change dos() to do all of the c/h/s calculations for a whole slice;
this fixes a bug where the ending c/h/s of an unused slice was set
incorrectly.
- When changing the active slice, detect the currently active slice
number instead of always defaulting to slice 4.
- Call fflush(stdout) before calling fgets().
- Test for fgets() returning NULL so we don't loop on EOF.
Reviewed by: bde
1.) prefix all functions in the library with devstat_ (compatability
functions are available for all functions that were chaned in an
incompatible way, but are deprecated).
2.) Add a pointer to a kvm_t as the first argument to functions that
used to get their information via sysctl; they behave the same
as before when NULL is passed as this argument, otherwise, the
information is obtained via libkvm using the supplied handle.
3.) Add a new function, devstat_compute_statistics(), that is intended
to replace the old compute_stats() function. It offers more
statistics data, and has a more flexible interface.
libdevstat does now require libkvm; a library depedency is added, so
that libkvm only needs to be explicitely specified for statically linked
programs.
The library major version number is bumped.
Submitted by: Sergey A. Osokin <osa@freebsd.org.ru>, ken (3)
Reviewed by: ken
- Declare mtabhead as an extern in mounttab.h and define it only in
mounttab.c.
- Remove shared global `verbose' and instead pass it as a parameter.
- Remove the `mtabp' argument to read_mtab(). It served no purpose
whatsoever, although read_mtab() did use it as a temporary local
variable.
- Don't check for impossible conditions when parsing mounttab, and
do detect zero-length fields.
- Correctly test for strtoul() failures - just testing ERANGE is wrong.
- Include a field name in syslog errors, and avoid passing NULL to
a syslog %s field.
- Don't test if arrays are NULL.
- If there are duplicates when writing out mounttab, keep the last
entry instead of the first, as it will have a later timestamp.
- Fix a few formatting issues.
Update rpc.umntall and umount to match the mounttab interface changes.
information for any command line error, the actual error message
almost always (and sometimes irretrievably) lost scrolling off the top
of the screen. Now just print the error. Give ipfw(8) no arguments for
the old usage summary.
Thanks to Lyndon Nerenberg <lyndon@orthanc.ab.ca> for the patch and
PR, but I had already done this when ru pointed out the PR.
PR: bin/28729
Approved by: ru
MFC after: 1 week
immediately if a host specified by the -h flag cannot be parsed
instead of attempting to unmount all NFS filesystems, which was
bad.
Add a missing return statement at the end of checkname(); this
could result in a non-zero exit status in some cases even if the
unmount succeeded.
Group two separate NFS-related operations into one block to make
it more obvious that a variable (hostp) is not dereferenced when
uninitialised. Initialise it to NULL anyway to avoid a warning.
Pass in the read_mtab()'s bogus argument as NULL instead of messing
with a local variable to achieve the same effect. A later commit
will clean up this mounttab interface.
forever by default. This matches what mount_nfs did before revision
1.40, and it is the generally expected behaviour for NFS mounts.
Document the current defaults near the start of the man page and
mention the options that can be used to change them.
Discussed on: -hackers
to give up after one attempt unless a background mount is requested.
Background mounts would retry 10000 times (at least 7 days) before
giving up.
For some situations such as diskless terminals, an NFS filesystem
may be critical to the boot process, so neither the "try once" nor
background mounts are appropiate. To cater for this situation,
unbreak the -R (retry count) parameter so that it also works in
the non-background case. Interpret a zero retry count as "retry
forever".
The defaults are now "try once" for non-background mounts and "retry
forever" for background mounts; both can be overridden via -R.
Add a description of this behaviour to the manpage.
device search code i introduce nearly six years ago in rev 1.8. Bruce
suggested to rather use the device name of the root filesystem instead
which is certainly the most sensible default. Since there are many
possible cases for a root filesystem name (device with and without
slices, consider /dev/vinum/root even though it currently could not
work as such), there's some heuristic using a RE in order to find out
the canonical device name from the mounted name. This probably won't
quite fit for a NFS root (can't test that right now), but then,
there's hard to find a good default for those machines anyway. ;-)
This unbreaks the functionality of rev 1.2 i once broke in 1.8. :)
to use 0xffffffff (INADDR_NONE) as a netmask value. The fix
is to use inet_addr(3) which doesn't suffer from this problem.
PR: bin/28873
Also, while here, fixed the bug when netmask value was ignored
(RTF_HOST flag was set) if the "destination gateway netmask"
syntax is used, e.g. ``route add 1.2.3.4 127.1 255.255.255.255''.
The original code was certainly broken; it knows that whereto is
to be used for a sockaddr_in, so it should be declared as such.
To support multiple protocols, there is also a sockaddr_storage
struct that can be used; I don't think struct sockaddr is supposed
to be used anywhere other than for casts and pointers.
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
MFC after: 3 weeks
This one is strange and goes against my rusty compiler knowledge.
The global declaration
struct sockaddr whereto;
produces for both i386 && alpha:
.comm whereto,16,1
which means common storage, byte aligned. Ahem. I though structs
were supposed to be ALDOUBLE always? I mean, w/o pragma packed?
Later on, this address is coerced to:
to = (struct sockaddr_in *)&whereto;
Up until now, we've been fine on alpha because the address
just ended up aligned to a 4 byte boundary. Lately, though,
it end up as:
0000000120027b0f B whereto
And, tra la, you get unaligned access faults. The solution I picked, in
lieu of understanding what the compiler was doing, is to put whereto
as a union of a sockaddr and sockaddr_in. That's more formally correct
if somewhat awkward looking.
prematurely terminate the search for a usable disk. ENOENT is quite
normal in particulare now with the advent of devfs.
While being here, also remove /dev/wd0 and /dev/od0 from the list of
disks to search since we don't have them anymore.
MFC after: 1 week
backslash as nothing, treat it like a space so that adjacent lines
aren't glued together.
PR: 8479
Submitted by: Adrian Filipi-Martin <adrian@ubergeeks.com>
user runs with privilege, allowing the sending of icmp packets with
larger size (up to 48k, the default receive buffer size in ping),
which is useful for network driver development testing, as well
as experimentation with fragmentation.
Reviewed by: wpaul
ensure that we never proceed with the mount() syscall if the server
is replying from the wrong source address. Previously the userland
RPC call to the remote nfsd would succeed, but the kernel uses
connect() so it would not see the replies, resulting in a hung
mount.
NQNFS code is ancient, bug-ridden, and should probably be removed).
The wording here was very confusing; it was easy to get the impression
that NQNFS is an extension to NFSv3 when in fact it just uses some
NFSv3-like extensions on top of NFSv2. As witnessed by the mailing
lists and PRs, some people were reading the description and deciding
that NQNFS was what they wanted to use.
MFC after: 1 week
driver itself obviously won't configure such a disk, but the error
returned (EDOM) is more cryptic to the average user than it should be.
Also assert that the argument to -u is in fact a valid unit; don't
just accept any string to mean 0.
Approved by: phk
in revision 1.48. It is pretty valid and often feasible to use
a non-point-to-point interface as the gateway. One might, for
example, use this to route some hosts through an ARP on a local
interface, without having to assign an additional IP address:
Script started on Tue Jun 12 16:16:09 2001
# ifconfig rl0 inet
rl0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
inet 192.168.4.115 netmask 0xffffff00 broadcast 192.168.4.255
# netstat -arn -finet | grep -w rl0
192.168.4 link#1 UC 3 0 rl0 =>
192.168.4.65 0:d0:b7:16:9c:c6 UHLW 1 0 rl0 1197
# route add -net 192.168.100 -iface rl0
add net 192.168.100: gateway rl0
# ping 192.168.100.1
PING 192.168.100.1 (192.168.100.1): 56 data bytes
64 bytes from 192.168.100.1: icmp_seq=0 ttl=255 time=0.551 ms
64 bytes from 192.168.100.1: icmp_seq=1 ttl=255 time=0.268 ms
^C
--- 192.168.100.1 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.268/0.410/0.551/0.142 ms
# netstat -arn -finet | grep -w rl0
192.168.4 link#1 UC 3 0 rl0 =>
192.168.4.65 0:d0:b7:16:9c:c6 UHLW 1 0 rl0 1165
192.168.100 link#1 UCSc 1 0 rl0 =>
192.168.100.1 0:d0:b7:16:9c:c6 UHLW 1 4 rl0 1192
Script done on Tue Jun 12 16:17:12 2001
This is needed to pick up the right headers. Wrong headers from
src/contrib/ipfilter are used otherwise.
The right fix would be to fix contrib/ipfilter C sources to pick up
headers from <sys/netinet>.
Noticed by: peter
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.
TODO:
- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.
Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks
the individual options to increment argv and decrement argc. This
caused the -T option to swallow an extra argument.
PR: 27982
Submitted by: Samuel Greear <sgreear@vsni.com>
a route to the gateway and caches it in the route structure.
It may happen (if the routing table is screwed) that the gateway
route is the same route as the one being modified, in which case
a kernel reports EDQUOT. Be more verbose about this:
# route add -net 10 192.168.4.65
add net 10: gateway 192.168.4.65
# netstat -rn -finet
Routing tables
Internet:
Destination Gateway Flags Refs Use Netif Expire
default 192.168.4.65 UGSc 1 7 rl0
10 192.168.4.65 UGSc 0 0 rl0
127.0.0.1 127.0.0.1 UH 0 178 lo0
192.168.4 link#1 UC 2 0 rl0 =>
192.168.4.65 0:d0:b7:16:9c:c6 UHLW 2 0 rl0 1123
Before:
# route change -net 10 10.0.0.1
route: writing to routing socket: Disc quota exceeded
change net 10: gateway 10.0.0.1: Disc quota exceeded
After:
# ./route change -net 10 10.0.0.1
route: writing to routing socket: Disc quota exceeded
change net 10: gateway 10.0.0.1: gateway uses the same route
PR: bin/1093, misc/26833
blackhole(4), except that blackhole(4) uses sysctl's. This xref
obviously isn't appropriate unless we want to xref all the other man
pages which mention sysctls, which we obviously don't (we may want to
list those sysctls, but that's another story).
PR: 27937
Submitted by: yar
PR: bin/12489
- Use inet_ntoa(3) where it should have been used. This
part of code simply wasn't converted to the "new" style
after the routename() function was converted from the
protocol-generic version to protocol-specific version
in CSRG revision 5.6.
MFC after: 1 week
but list them if -d was specified).
Avoid listing expired dynamic rules unless the (new) -e option was specified.
If specific rule numbers were listed on the command line, and the -d flag was
specified, only list dynamic rules that match the specified rule numbers.
Try to partly clean up the bleeding mess this file has become. If there is
any justice in this world, the responsible parties (you know who you are!)
should expect to wake up one morning with a horse's head in their bed. The
code still looks like spaghetti, but at least now it's *properly intented*
spaghetti (hmm? did somebody say "tagliatelle"?).
when comparing with the alternate superblock. These fields are used
for temporary in-core information only. This should fix the "VALUES
IN SUPER BLOCK DISAGREE WITH THOSE IN FIRST ALTERNATE" error from
fsck_ffs that has been seen a lot recently.
attempting to remove nonexistant exports with MNT_DELEXPORT returns
an error; before this change it always succeeded. This caused
mountd(8) to log "can't delete exports for /whatever" warnings.
Change the error code from EINVAL to a more specific ENOENT, and
make mountd ignore this error when deleting the export list. I
could have just restored the previous behaviour of returning success,
but I think an error return is a useful diagnostic.
Reviewed by: phk
printed on a single, very long, and generally unreadable line. This
isn't very useful. It's also really ugly and most of the time you don't
care what media is supported anyway.
PR: 27701
Submitted by: Brooks Davis <brooks@one-eyed-alien.net>
- introduce a -o option that displays opaque variables.
- introduce a -x option that displays opaque variables in full.
- deprecate -A in favor of -ao and -X in favor of -ax.
- remove -A and -X from usage() and SYNOPSIS (but not from DESCRIPTION).
- ignore -a if one or more variables were listed on the command line.
- deprecate -w, it is not needed to determine the user's intentions.
- some language and style cleanup in the man page.
This commit should not break any existing scripts.
MFC after: 4 weeks
despite the fact that most people want to set exactly the same settings
regardless of which card they have. It has been repeatidly suggested
that this configuration should be done via ifconfig. This patch
implements the required functionality in ifconfig and add support to the
wi and an drivers. It also provides partial, untested support for the
awi driver.
PR: 25577
Submitted by: Brooks Davis <brooks@one-eyed-alien.net>
systems were repo-copied from sys/miscfs to sys/fs.
- Renamed the following file systems and their modules:
fdesc -> fdescfs, portal -> portalfs, union -> unionfs.
- Renamed corresponding kernel options:
FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS.
- Install header files for the above file systems.
- Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland
Makefiles.
if the kernel module is built that way.
Remove the gross debug device/non-debug device hack used to recognize
whether the kernel module was in sync with the userland module.
vinum_mirror, vinum_raid4, vinum_raid5.
Correct typos.
Show new output of the 'list' and 'ls' commands.
Update examples to use 279 kB stripe sizes instead of 256 kB.
Clarify some text.
Remove the description of the 'invalid ioctl' messages which now no
longer occur.
Add a description of the 'retryerrors' keyword.
to avoid including the kernel headers.
Move a number of definitions of userland functions from
dev/vinum/vinumext.h.
Desired by: bde
This commit is the first of a general cleanup of the header files..
It won't be enough to make bde happy.
Remove vinum_perror and associated DEVBUG definition.
Use userland expurgated versions of kernel structures, since that's
what the ioctls return now.
Remove vinum_perror.
main: Check kernel version with userland version in _vinum_conf. This
field is a constant which gets incremented every time the
kernel-userland interface changes. This enables vinum(8) to
check for the correct kernel version and to produce a useful
message if it doesn't match. For previous versions, which don't
have a version number, the length of the structure is different,
so we can recognize it via the EINVAL return from ioctl.
Supply count parameter to tokenize().
Change method of recognizing active devfs: replace devfs_is_active
with (complemented) no_devfs.
make_devices: remove references to devfs. If we're running devfs, we
don't need to call make_devices at all.
vinum_makedev (user command 'makedev'): Print a warning message if
devfs is running and don't do anything else.
Remove vinum_perror.
Modify 'list' brief printout to fit in 80 columns.
Modify 'ls' brief printout to show the drive to which the subdisk
before instead of the plex offset, which is usually less interesting.
The verbose printout remains unchanged.
Use userland expurgated versions of kernel structures, since that's
what the ioctls return now.
Move checkupdates here to simplify header file mess.
Remove 'vinum_perror'.
Only call make_devices if we're not running devfs.
Use userland expurgated versions of kernel structures, since that's
what the ioctls return now.
Update help list, which was lagging behind reality.
checkupdates: move to list.c to simplify header file mess.
vinum_stripe, vinum_mirror, vinum_raid4, vinum_raid5: change the
default stripe size from 256 k to 279 k, thus hopefully spreading
superblocks more evenly.
rules. Also, don't show dynamic rules if you only asked to see a
certain rule number.
PR: 18550
Submitted by: Lyndon Nerenberg <lyndon@orthanc.ab.ca>
Approved by: luigi
MFC after: 2 weeks
page with *all* the permissible values.
This should really be spelt ipencap (as /etc/protocols does),
but a precedent has already been set by the ipproto array in
setkey.c.
It would be nice if /etc/protocols was parsed for the upperspec
field, but I don't do yacc/lex...
This change allows policies that only encrypt the encapsulated
packets passing between the endpoints of a gif tunnel. Setting
such a policy means that you can still talk directly (and
unencrypted) between the public IP numbers with (say) ssh.
MFC after: 1 week
function; we now handle unknown protocols more gracefully.
- Cache the return from getnetconfigent() so that we don't have to
remember to call freenetconfigent() each time. This fixes a memory
leak that would cause retrying background mount_nfs processes to
slowly increase their memory usage.
longer includes machine/elf.h.
* consumers of elf.h now use the minimalist elf header possible.
This change is motivated by Binutils 2.11.0 and too much clashing over
our base elf headers and the Binutils elf headers.
least in -w's case, simply unsetting the correct bit in init_flags was not
enough. The bit may be reset later if, say, the filesystem is marked `ro'
in fstab. The command line option should override the fstab setting, but
did not. The implementation of -r was changed for consistency.
PR: 26886
Reviewed by: archie
Traditionally, fsck is invoked before the filesystems are mounted
and all checks are done to completion at that time. If background
checking is available, fsck is invoked twice. It is first invoked
at the traditional time, before the filesystems are mounted, with
the -F flag to do checking on all the filesystems that cannot do
background checking. It is then invoked a second time, after the
system has completed going multiuser, with the -B flag to do checking
on all the filesystems that can do background checking. Unlike
the foreground checking, the background checking is started
asynchonously so that other system activity can proceed even on
the filesystems that are being checked.
At the moment, only the fast filesystem supports background checking.
To be able to do background checking, a filesystem must have been
running with soft updates, not have been marked as needing a
foreground check, and be mounted and writable when the background
check is to be done (i.e., not listed as `noauto' in /etc/fstab).
These changes are the final piece needed to support background
filesystem checking. They will not have any effect until you update
your /etc/rc to invoke fsck in its new mode of operation. I am
still playing around with exactly what those changes should be
and should be committing them later this week.
filesystem needs foreground checking (usually at boot time) or
can defer to background checking (after the system is up and running).
See the manual page, fsck_ffs(8), for details on the -F and -B options.
These options are primarily intended for use by the fsck front end.
All output is directed to stdout so that the output is coherent
when redirected to a file or a pipe. Unify the code with the fsck
front end that allows either a device or a mount point to be
specified as the argument to be checked.
always look up -network and -mask addresses numerically before
trying getnetbyname(). Without this, we may end up attempting DNS
queries on silly names such as "127.0.0.0.my-domain.com". See the
commit log from revisions 1.21 and 1.20 for further details.
removes the last path component until the mount() succeeds. However,
the code never checks if it has passed the mountpoint, so in some
cases where the mount() never succeeds, it can end up applying the
flags from a mounted filesystem to the underlying one.
Add a sanity check to the code which removes the last path component:
test that the fsid associated with the new path is the same as that
of the old one.
PR: bin/7872
a number of assumptions related to the parsing of options in
/etc/exports, and missed a few necessary new error checks.
The main problems related to netmasks: an IPv6 network address
missing a netmask would result in the filesystem being exported to
the whole IPv6 world, non-continuous netmasks would be made continuous
without any warnings, and nothing prevented you specifying an IPv4
mask with an IPv6 address.
This change addresses these issues. As a side-effect we now store
netmasks in sockaddr structs (this matches the kernel interface,
and is closer to the way it used to be). Add a flag OP_HAVEMASK to
keep track of whether or not we have successfully got a mask from
any source. Replace some mask-related helper functions with versions
that use the sockaddr-based masks.
Also tidy up get_net() and fix the code that interprets IPv4 partial
networks such as "127.1" as network rather than host addresses.
Properly zero out some structures that were ending up partially
containing junk from the stack, fix a few formatting issues, and
add a comment noting some assumptions about export arguments.
would call malloc, stdio and other library functions from the signal
handler which is not safe due to reentrancy problems.
Instead, add a simple handler that just sets a flag, and call the
more complex function from main() when necessary. Unfortunately to
be able to check this flag, we must expand the svc_run() call, but
the RPC library makes that relatively easy to do.
- Remove some horrible code that faked a "struct addrinfo" to be
later passed to freeaddrinfo(). Instead, add a new group type
"GT_DEFAULT" used to denote that the filesystem is exported to the
world, and treat this case separately.
- Don't clear the AI_CANONNAME flag in a struct addrinfo returned
by getaddrinfo. There's still a bit more struct addrinfo abuse
left in here.
- Simplify do_mount() slightly by using an addrinfo pointer to keep
track of the current address.
- Revert del_mlist() to its pre-tirpc prototype. Unlike NetBSD's version,
ours lets the caller generate any syslog() messages, so that it
can include the service name in the message.
- Initialise a few local variables to clarify the logic and avoid some
compiler warnings.
- Remove a few unused functions and local variables, and fix some
whitespace issues.
- Reinstate the logic for avoiding duplicate host entries that got
removed accidentally in revision 1.41 (added in r1.5). This bit
was submitted in a slightly different form by Thomas Quinot.
Submitted by: Martin Blapp <mb@imp.ch>,
Thomas Quinot <quinot@inf.enst.fr>
PR: bin/26148
1) Set the FS_NEEDSFSCK flag when unexpected problems are encountered.
2) Clear the FS_NEEDSFSCK flag after a successful foreground cleanup.
3) Refuse to run in background when the FS_NEEDSFSCK flag is set.
4) Avoid taking and removing a snapshot when the filesystem is already clean.
5) Properly implement the force cleaning (-f) flag when in preen mode.
Note that you need to have revision 1.21 (date: 2001/04/14 05:26:28) of
fs.h installed in <ufs/ffs/fs.h> defining FS_NEEDSFSCK for this to compile.
Because the kernel will allow the mounting of unclean filesystems when
the soft updates flag is set, it is important that only soft updates
style inconsistencies (missing blocks and inodes) be present. Otherwise
a panic may ensue. It is also important that the filesystem be in a clean
state when the soft updates flag is set because the background fsck uses
the fact that the flag is set to indicate that it is safe to run. If
background fsck encounters non-soft updates style inconsistencies, it
will exit with unexpected inconsistencies.
not -tag. Instead, put a period after the error messages to aide
those using dumb terminals not capable of properly displaying markup.
Requested by: ru
the ability to use a preprocessor, use the -q (quiet) flag when reading
from a file). The source used is from ipfw.
Clean up exit codes while I am here.
KAME has been informed and plans on integrating these patches into their
own source as well.
number of issues:
- Fix background mounts; these were broken in revision 1.40.
- Don't give up before trying all addresses returned by getaddrinfo().
- Use protocol-independent routines where possible.
- Improve error reporting for RPC errors.
- In non-background mode, give up after trying all protocols once.
- Use daemon(3) instead of rolling our own version.
- Never go ahead with the mount() syscall until we have received
a reply from the remote nfsd; this is especially important with
non-interruptible mounts, as otherwise a mistyped command might
require a reboot to correct.
Reviewed by: alfred, Martin Blapp <mb@imp.ch>
His description of the problem and solution follow. My own tests show
speedups on typical filesystem intensive workloads of 5% to 12% which
is very impressive considering the small amount of code change involved.
------
One day I noticed that some file operations run much faster on
small file systems then on big ones. I've looked at the ffs
algorithms, thought about them, and redesigned the dirpref algorithm.
First I want to describe the results of my tests. These results are old
and I have improved the algorithm after these tests were done. Nevertheless
they show how big the perfomance speedup may be. I have done two file/directory
intensive tests on a two OpenBSD systems with old and new dirpref algorithm.
The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports".
The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release.
It contains 6596 directories and 13868 files. The test systems are:
1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for
test is at wd1. Size of test file system is 8 Gb, number of cg=991,
size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current
from Dec 2000 with BUFCACHEPERCENT=35
2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system
at wd0, file system for test is at wd1. Size of test file system is 40 Gb,
number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k
OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50
You can get more info about the test systems and methods at:
http://www.ptci.ru/gluk/dirpref/old/dirpref.html
Test Results
tar -xzf ports.tar.gz rm -rf ports
mode old dirpref new dirpref speedup old dirprefnew dirpref speedup
First system
normal 667 472 1.41 477 331 1.44
async 285 144 1.98 130 14 9.29
sync 768 616 1.25 477 334 1.43
softdep 413 252 1.64 241 38 6.34
Second system
normal 329 81 4.06 263.5 93.5 2.81
async 302 25.7 11.75 112 2.26 49.56
sync 281 57.0 4.93 263 90.5 2.9
softdep 341 40.6 8.4 284 4.76 59.66
"old dirpref" and "new dirpref" columns give a test time in seconds.
speedup - speed increasement in times, ie. old dirpref / new dirpref.
------
Algorithm description
The old dirpref algorithm is described in comments:
/*
* Find a cylinder to place a directory.
*
* The policy implemented by this algorithm is to select from
* among those cylinder groups with above the average number of
* free inodes, the one with the smallest number of directories.
*/
A new directory is allocated in a different cylinder groups than its
parent directory resulting in a directory tree that is spreaded across
all the cylinder groups. This spreading out results in a non-optimal
access to the directories and files. When we have a small filesystem
it is not a problem but when the filesystem is big then perfomance
degradation becomes very apparent.
What I mean by a big file system ?
1. A big filesystem is a filesystem which occupy 20-30 or more percent
of total drive space, i.e. first and last cylinder are physically
located relatively far from each other.
2. It has a relatively large number of cylinder groups, for example
more cylinder groups than 50% of the buffers in the buffer cache.
The first results in long access times, while the second results in
many buffers being used by metadata operations. Such operations use
cylinder group blocks and on-disk inode blocks. The cylinder group
block (fs->fs_cblkno) contains struct cg, inode and block bit maps.
It is 2k in size for the default filesystem parameters. If new and
parent directories are located in different cylinder groups then the
system performs more input/output operations and uses more buffers.
On filesystems with many cylinder groups, lots of cache buffers are
used for metadata operations.
My solution for this problem is very simple. I allocate many directories
in one cylinder group. I also do some things, so that the new allocation
method does not cause excessive fragmentation and all directory inodes
will not be located at a location far from its file's inodes and data.
The algorithm is:
/*
* Find a cylinder group to place a directory.
*
* The policy implemented by this algorithm is to allocate a
* directory inode in the same cylinder group as its parent
* directory, but also to reserve space for its files inodes
* and data. Restrict the number of directories which may be
* allocated one after another in the same cylinder group
* without intervening allocation of files.
*
* If we allocate a first level directory then force allocation
* in another cylinder group.
*/
My early versions of dirpref give me a good results for a wide range of
file operations and different filesystem capacities except one case:
those applications that create their entire directory structure first
and only later fill this structure with files.
My solution for such and similar cases is to limit a number of
directories which may be created one after another in the same cylinder
group without intervening file creations. For this purpose, I allocate
an array of counters at mount time. This array is linked to the superblock
fs->fs_contigdirs[cg]. Each time a directory is created the counter
increases and each time a file is created the counter decreases. A 60Gb
filesystem with 8mb/cg requires 10kb of memory for the counters array.
The maxcontigdirs is a maximum number of directories which may be created
without an intervening file creation. I found in my tests that the best
performance occurs when I restrict the number of directories in one cylinder
group such that all its files may be located in the same cylinder group.
There may be some deterioration in performance if all the file inodes
are in the same cylinder group as its containing directory, but their
data partially resides in a different cylinder group. The maxcontigdirs
value is calculated to try to prevent this condition. Since there is
no way to know how many files and directories will be allocated later
I added two optimization parameters in superblock/tunefs. They are:
int32_t fs_avgfilesize; /* expected average file size */
int32_t fs_avgfpdir; /* expected # of files per directory */
These parameters have reasonable defaults but may be tweeked for special
uses of a filesystem. They are only necessary in rare cases like better
tuning a filesystem being used to store a squid cache.
I have been using this algorithm for about 3 months. I have done
a lot of testing on filesystems with different capacities, average
filesize, average number of files per directory, and so on. I think
this algorithm has no negative impact on filesystem perfomance. It
works better than the default one in all cases. The new dirpref
will greatly improve untarring/removing/coping of big directories,
decrease load on cvs servers and much more. The new dirpref doesn't
speedup a compilation process, but also doesn't slow it down.
Obtained from: Grigoriy Orlov <gluk@ptci.ru>
[I first added this functionality, and thought to check prior art. Seeing
OpenBSD had already done this, I changed my addition to reduce the diffs
between the two and went with their option letter.]
Obtained from: OpenBSD