drops the counting in bwrite and puts it all in spec_strategy.
I did some tests and verified that the counts collected for writes
in spec_strategy is identical to the counts that we previously
collected in bwrite. We now also get read counts (async reads
come from requests for read-ahead blocks). Note that you need
to compile a new version of mount to get the read counts printed
out. The old mount binary is completely compatible, the only
reason to install a new mount is to get the read counts printed.
Submitted by: Craig A Soules <soules+@andrew.cmu.edu>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code.
Unify spec_open() for bdev and cdev cases.
Remove the disabled bdev specific read/write code.
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.
This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
to remove 'b'lock devices. The agreement is, essentially, that
block devices will be collapsed into character devices as a first
step (though I don't particularly agree), and raw device names 'rxxx'
will become simply 'xxx' in devfs in the second step (i.e. no 'rxxx'
names will exist). The renaming will not effect the original /dev
and the expectation is that devfs will eventually (but not immediately)
become the standard way to access devices in the system.
If it is determined that a reimplementation of block device access
characteristics is beneficial, a number of alternatives will
be possible that do not involve resurrecting the 'b'lock device class.
For example, an ioctl() that might be made on an open character device
descriptor or a generic buffered overlay device.
This commit removes the blockdev disablement sysctl which does not
apply to the solution that was reached.
This means that access to block devices nodes will act the
same as char device nodes for disk-like devices.
If you encounter problems after this, where programs accessing
disks directly fail to operate, please use the following command
to revert to previous behaviour:
sysctl -w vfs.bdev_buffered=1
And verify that this was indeed the cause of your trouble.
See the mail-archives of the arch@FreeBSD.org list for background.
two new functions spec_buf{read|write}.
Add sysctl vfs.bdev_buffered which defaults to 1 == true. This
sysctl can be used to experimentally turn buffered behaviour for
bdevs off. I should not be changed while any blockdevices are
open. Remove the misplaced sysctl vfs.enable_userblk_io.
No other changes in behaviour.
have been there in the first place. A GENERIC kernel shrinks almost 1k.
Add a slightly different safetybelt under nostop for tty drivers.
Add some missing FreeBSD tags
fields in struct cdevsw:
d_stop moved to struct tty.
d_reset already unused.
d_devtotty linkage now provided by dev_t->si_tty.
These fields will be removed from struct cdevsw together with
d_params and d_maxio Real Soon Now.
The changes in this patch consist of:
initialize dev->si_tty in *_open()
initialize tty->t_stop
remove devtotty functions
rename ttpoll to ttypoll
a few adjustments to these changes in the generic code
a bump of __FreeBSD_version
add a couple of FreeBSD tags
d_maxio is replaced by the dev->si_iosize_max field which the driver
should be set in all calls to cdevsw->d_open if it has a better
idea than the system wide default.
The field is a generic dev_t field (ie: not disk specific) so that
tapes and other devices can use physio as well.
transfer size calculation was incorrect resulting in the last read being
potentially larger then the actual extent of the device.
EOF and write handling has not yet been fixed.
Reviewed by: Tor.Egge@fast.no
Rename dev->si_bsize_max to si_iosize_max and set it in spec_open
if the device didn't.
Set vp->v_maxio from dev->si_bsize_max in spec_open rather than
in ufs_bmap.c
to buffered block devices are allowed. The default is to be backwards
compatible, i.e. reads and writes are allowed.
The idea is for a larger crowd to start running with this disabled and
see what problems, if any, crop up, and then to change the default to
off and see if any problems crop up in the next 6 months prior to
potentially removing support entirely. There are still a few people,
Julian and myself included, who believe the buffered block device
access from usermode to be useful.
Remove use of vnode->v_lastr from buffered block device I/O in
preparation for removal of vnode->v_lastr field, replacing it with
the already existing seqcount metric to detect sequential operation.
Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
a quick think and discussion among various people some form of some of
these changes will probably be recommitted.
The reversion requested was requested by dg while discussions proceed.
PHK has indicated that he can live with this, and it has been agreed
that some form of some of these changes may return shortly after further
discussion.
the highly non-recommended option ALLOW_BDEV_ACCESS is used.
(bdev access is evil because you don't get write errors reported.)
Kill si_bsize_best before it kills Matt :-)
Use the specfs routines rather having cloned copies in devfs.
Make the alias list a SLIST.
Drop the "fast recycling" optimization of vnodes (including
the returning of a prexisting but stale vnode from checkalias).
It doesn't buy us anything now that we don't hardlimit
vnodes anymore.
Rename checkalias2() and checkalias() to addalias() and
addaliasu() - which takes dev_t and udev_t arg respectively.
Make the revoke syscalls use vcount() instead of VALIASED.
Remove VALIASED flag, we don't need it now and it is faster
to traverse the much shorter lists than to maintain the
flag.
vfs_mountedon() can check the dev_t directly, all the vnodes
point to the same one.
Print the devicename in specfs/vprint().
Remove a couple of stale LFS vnode flags.
Remove unimplemented/unused LK_DRAINED;
Introduce BUF_STRATEGY(struct buf *, int flag) macro, and use it throughout.
please see comment in sys/conf.h about the flag argument.
Remove strategy argument from all the diskslice/label/bad144
implementations, it should be found from the dev_t.
Remove bogus and unused strategy1 routines.
Remove open/close arguments from dssize(). Pick them up from dev_t.
Remove unused and unfinished setgeom support from diskslice/label/bad144 code.
operations. This allows a device driver better insight into
what is going on that the current:
proc1: open /dev/foo R/O
devsw->open( R/O, proc1 )
proc2: open /dev/foo R/W
devsw->open( R/W, proc2 )
proc2: close
/* nothing, but device is
really only R/O open */
proc1: close
devsw->close( R/O, proc1 )
vnodes referencing this device.
Details:
cdevsw->d_parms has been removed, the specinfo is available
now (== dev_t) and the driver should modify it directly
when applicable, and the only driver doing so, does so:
vn.c. I am not sure the logic in checking for "<" was right
before, and it looks even less so now.
An intial pool of 50 struct specinfo are depleted during
early boot, after that malloc had better work. It is
likely that fewer than 50 would do.
Hashing is done from udev_t to dev_t with a prime number
remainder hash, experiments show no better hash available
for decent cost (MD5 is only marginally better) The prime
number used should not be close to a power of two, we use
83 for now.
Add new checkalias2() to get around the loss of info from
dev2udev() in bdevvp();
The aliased vnodes are hung on a list straight of the dev_t,
and speclisth[SPECSZ] is unused. The sharing of struct
specinfo means that the v_specnext moves into the vnode
which grows by 4 bytes.
Don't use a VBLK dev_t which doesn't make sense in MFS, now
we hang a dummy cdevsw on B/Cmaj 253 so that things look sane.
Storage overhead from all of this is O(50k).
Bump __FreeBSD_version to 400009
The next step will add the stuff needed so device-drivers can start to
hang things from struct specinfo
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.
The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it. cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.
cdevsw_add() will print an message if the d_maj field looks bogus.
Remove nblkdev and nchrdev variables. Most places they were used
bogusly. Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.
Move bdevsw() and devsw() functions to kern/kern_conf.c
Bump __FreeBSD_version to 400006
This commit removes:
72 bogus makedev() calls
26 bogus SYSINIT functions
if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.
I4b and vinum not changed. Patches emailed to authors. LINT
probably broken until they catch up.
udev_t in the kernel but still called dev_t in userland.
Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()
For now they're functions, they will become in-line functions
after one of the next two steps in this process.
Return major/minor/makedev to macro-hood for userland.
Register a name in cdevsw[] for the "filedescriptor" driver.
In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.
In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).
A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.
Without DEVT_FASCIST I belive this patch is a no-op.
Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.
Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).
Made a new (inline) function devsw(dev_t dev) and substituted it.
Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)
DEVFS will eventually benefit from this change too.
Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline)
function.
Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
to the order of the cmaj/bmaj arguments!)
Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
(ditto!)
(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
ufs/ufs/ufs_readwrite.c kern/vfs_bio.c
Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>
Add d_parms() to {c,b}devsw[]. If non-NULL this function points to
a device routine that will properly fill in the specinfo structure.
vfs_subr.c's checkalias() supplies appropriate defaults. This change
should be fully backwards compatible with existing devices.
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.
Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
device drivers about sectors no longer in use.
Device-drivers receive the call through d_strategy, if they have
D_CANFREE in d_flags.
This allows flash based devices to erase the sectors and avoid
pointlessly carrying them around in compactions.
Reviewed by: Kirk Mckusick, bde
Sponsored by: M-Systems (www.m-sys.com)
respectively. Most of the longs should probably have been
u_longs, but this changes is just to prevent warnings about
casts between pointers and integers of different sizes, not
to fix poorly chosen types.
There is only cdevsw (which should be renamed in a later edit to deventry
or something). cdevsw contains the union of what were in both bdevsw an
cdevsw entries. The bdevsw[] table stiff exists and is a second pointer
to the cdevsw entry of the device. it's major is in d_bmaj rather than
d_maj. some cleanup still to happen (e.g. dsopen now gets two pointers
to the same cdevsw struct instead of one to a bdevsw and one to a cdevsw).
rawread()/rawwrite() went away as part of this though it's not strictly
the same patch, just that it involves all the same lines in the drivers.
cdroms no longer have write() entries (they did have rawwrite (?)).
tapes no longer have support for bdev operations.
Reviewed by: Eivind Eklund and Mike Smith
Changes suggested by eivind.
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.
The prototype FreeBSD/alpha machdep will follow in a couple of days
time.
unexpectedly do not complete writes even with sync I/O requests.
This should help the behavior of mmaped files when using
softupdates (and perhaps in other circumstances also.)
This code will be turned on with the TWO options
DEVFS and SLICE. (see LINT)
Two labels PRE_DEVFS_SLICE and POST_DEVFS_SLICE will deliniate these changes.
/dev will be automatically mounted by init (thanks phk)
on bootup. See /sys/dev/slice/slice.4 for more info.
All code should act the same without these options enabled.
Mike Smith, Poul Henning Kamp, Soeren, and a few dozen others
This code does not support the following:
bad144 handling.
Persistance. (My head is still hurting from the last time we discussed this)
ATAPI flopies are not handled by the SLICE code yet.
When this code is running, all major numbers are arbitrary and COULD
be dynamically assigned. (this is not done, for POLA only)
Minor numbers for disk slices ARE arbitray and dynamically assigned.
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:
1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.
Be gentle, and please give me feedback asap.
Rename vn_default_error to vop_defaultop all over the place.
Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite.
Use vop_null instead of nullop.
Move vop_nopoll from vfs_subr.c to vfs_default.c
Move vop_sharedlock from vfs_subr.c to vfs_default.c
Move vop_nolock from vfs_subr.c to vfs_default.c
Move vop_nounlock from vfs_subr.c to vfs_default.c
Move vop_noislocked from vfs_subr.c to vfs_default.c
Use vop_ebadf instead of *_ebadf.
Add vop_defaultop for getpages on master vnode in MFS.
1. Add defaults for more VOPs
VOP_LOCK vop_nolock
VOP_ISLOCKED vop_noislocked
VOP_UNLOCK vop_nounlock
and remove direct reference in filesystems.
2. Rename the nfsv2 vnop tables to improve sorting order.
1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS}
intereface function, and now lives in the ufsmount structure.
2. Remove VOP_SEEK, it was unused.
3. Add mode default vops:
VOP_ADVLOCK vop_einval
VOP_CLOSE vop_null
VOP_FSYNC vop_null
VOP_IOCTL vop_enotty
VOP_MMAP vop_einval
VOP_OPEN vop_null
VOP_PATHCONF vop_einval
VOP_READLINK vop_einval
VOP_REALLOCBLKS vop_eopnotsupp
And remove identical functionality from filesystems
4. Add vop_stdpathconf, which returns the canonical stuff. Use
it in the filesystems. (XXX: It's probably wrong that specfs
and fifofs sets this vop, shouldn't it come from the "host"
filesystem, for instance ufs or cd9660 ?)
5. Try to make system wide VOP functions have vop_* names.
6. Initialize the um_* vectors in LFS.
(Recompile your LKMS!!!)
1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.
2. Change VOP_BLKATOFF to a normal function in cd9660.
3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.
4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.
5. Fix another VCALL in vfs_cache.c (thanks Bruce!)
1. Use the default function to access all the specfs operations.
2. Use the default function to access all the fifofs operations.
3. Use the default function to access all the ufs operations.
4. Fix VCALL usage in vfs_cache.c
5. Use VOCALL to access specfs functions in devfs_vnops.c
6. Staticize most of the spec and fifofs vnops functions.
7. Make UFS panic if it lacks bits of the underlying storage handling.
1. Remove comment stating the blatantly obvious.
2. Align in two columns.
3. Sort all but the default element alphabetically.
4. Remove XXX comments pointing out entries not needed.
1. Clustered I/O is switched by the MNT_NOCLUSTERR and MNT_NOCLUSTERW
bits of the mnt_flag. The sysctl variables, vfs.foo.doclusterread
and vfs.foo.doclusterwrite are deleted. Only mount option can
control clustered I/O from userland.
2. When foofs_mount mounts block device, foofs_mount checks D_CLUSTERR
and D_CLUSTERW bits of the d_flags member in the block device switch
table. If D_NOCLUSTERR / D_NOCLUSTERW are set, MNT_NOCLUSTERR /
MNT_NOCLUSTERW bits will be set. In this case, MNT_NOCLUSTERR and
MNT_NOCLUSTERW cannot be cleared from userland.
3. Vnode driver disables both clustered read and write.
4. Union filesystem disables clutered write.
Reviewed by: bde
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.
The system boots and can mount UFS filesystems.
Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.
Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
fcntl() and EOPNOTSUPP for flock(). POSIX specifies the weaker EINVAL
errno and the man page agrees.
Not fixed:
deadfs: always returns wrong EBADF
devfs, msdosfs: always return sometimes-wrong EINVAL
cd9660, fdesc, kernfs, portal: always return sometimes-wrong EOPNOTSUPP
procfs: always returns wrong EIO
mfs: panic?!
nfs: fudged
NetBSD uses a generic file system genfs to do return the sometimes-wrong
EOPNOTSUPP more consistently :-)(.
Found by: NIST-PCTS