heuristic to detect sequential operation.
VM-related forced clustering code removed from ufs in preparation for a
commit to vm/vm_fault.c that does it more generally.
Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
Translated from: a similar fix in ufs_readwrite.c rev.1.61.
Don't forget to set DE_ACCESS for short reads.
Check for invalid (negative) offsets before checking for reads of
0 bytes, as in ufs, although checking for invalid offsets at all
is probably a bug.
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.
udev_t in the kernel but still called dev_t in userland.
Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()
For now they're functions, they will become in-line functions
after one of the next two steps in this process.
Return major/minor/makedev to macro-hood for userland.
Register a name in cdevsw[] for the "filedescriptor" driver.
In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.
In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).
A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.
Without DEVT_FASCIST I belive this patch is a no-op.
Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.
Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).
This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.
For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".
Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.
Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.
It generally does what one would expect, but setting up a jail
still takes a little knowledge.
A few notes:
I have no scripts for setting up a jail, don't ask me for them.
The IP number should be an alias on one of the interfaces.
mount a /proc in each jail, it will make ps more useable.
/proc/<pid>/status tells the hostname of the prison for
jailed processes.
Quotas are only sensible if you have a mountpoint per prison.
There are no privisions for stopping resource-hogging.
Some "#ifdef INET" and similar may be missing (send patches!)
If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!
Tools, comments, patches & documentation most welcome.
Have fun...
Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/
1:
s/suser/suser_xxx/
2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.
3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/
The remaining suser_xxx() calls will be scrutinized and dealt with
later.
There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.
More changes to the suser() API will come along with the "jail" code.
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>
---------
Make callers of namei() responsible for releasing references or locks
instead of having the underlying filesystems do it. This eliminates
redundancy in all terminal filesystems and makes it possible for stacked
transport layers such as umapfs or nullfs to operate correctly.
Quality testing was done with testvn, and lat_fs from the lmbench suite.
Some NFS client testing courtesy of Patrik Kudo.
vop_mknod and vop_symlink still release the returned vpp. vop_rename
still releases 4 vnode arguments before it returns. These remaining cases
will be corrected in the next set of patches.
---------
Submitted by: Michael Hancock <michaelh@cet.co.jp>
They are atomic, but return in essence what is in the "time" variable.
gettime() is now a macro front for getmicrotime().
Various patches to use the two new functions instead of the various
hacks used in their absence.
Some puntuation and grammer patches from Bruce.
A couple of XXX comments.
|In the process of evaluating the getpages/putpages issues I discovered
|that mmap on MSDOSFS does not work. This is because I blindly merged
|NetBSD changes in msdosfs_bmap and msdosfs_strategy. Apparently, their
|blocksize is always DEV_BSIZE (even in files), while in FreeBSD
|blocksize in files is v_mount->mnt_stat.f_iosize (i.e. clustersize in
|MSDOSFS case). The patch is below.
Submitted by: Dmitrij Tejblum <dima@tejblum.dnttm.rssi.ru>
- 'mv longnamedfile1 longnamedfile2' would cause longnamedfile2 to lose its
long name.
- Long names have trailing spaces/dots stripped for lookup as well as
assignment.
- A lockup when the mdsosfs was accessed from within the Linux emulator is fixed.
- A bug whereby long filenames were recognised by Microsoft operating systems but
not FreeBSD is fixed.
Submitted by: Dmitrij Tejblum <dima@tejblum.dnttm.rssi.ru>
FAT32 partitions. Unfortunately, we looked around here at
Walnut Creek CDROM for any newer FAT32-supporting versions
of Win95 and we were unsuccessful; only the older stuff here.
So this is untested beyond simply making sure it compiles and
someone with access to an actual FAT32 fs will have
to let us know how well it actually works.
Submitted by: Dmitrij Tejblum <dima@tejblum.dnttm.rssi.ru>
Obtained from: NetBSD
Rename vn_default_error to vop_defaultop all over the place.
Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite.
Use vop_null instead of nullop.
Move vop_nopoll from vfs_subr.c to vfs_default.c
Move vop_sharedlock from vfs_subr.c to vfs_default.c
Move vop_nolock from vfs_subr.c to vfs_default.c
Move vop_nounlock from vfs_subr.c to vfs_default.c
Move vop_noislocked from vfs_subr.c to vfs_default.c
Use vop_ebadf instead of *_ebadf.
Add vop_defaultop for getpages on master vnode in MFS.
1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS}
intereface function, and now lives in the ufsmount structure.
2. Remove VOP_SEEK, it was unused.
3. Add mode default vops:
VOP_ADVLOCK vop_einval
VOP_CLOSE vop_null
VOP_FSYNC vop_null
VOP_IOCTL vop_enotty
VOP_MMAP vop_einval
VOP_OPEN vop_null
VOP_PATHCONF vop_einval
VOP_READLINK vop_einval
VOP_REALLOCBLKS vop_eopnotsupp
And remove identical functionality from filesystems
4. Add vop_stdpathconf, which returns the canonical stuff. Use
it in the filesystems. (XXX: It's probably wrong that specfs
and fifofs sets this vop, shouldn't it come from the "host"
filesystem, for instance ufs or cd9660 ?)
5. Try to make system wide VOP functions have vop_* names.
6. Initialize the um_* vectors in LFS.
(Recompile your LKMS!!!)
1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.
2. Change VOP_BLKATOFF to a normal function in cd9660.
3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.
4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.
5. Fix another VCALL in vfs_cache.c (thanks Bruce!)
1. Remove comment stating the blatantly obvious.
2. Align in two columns.
3. Sort all but the default element alphabetically.
4. Remove XXX comments pointing out entries not needed.
plus the previous changes to use the zone allocator decrease the useage
of malloc by half. The Zone allocator will be upgradeable to be able
to use per CPU-pools, and has more intelligent usage of SPLs. Additionally,
it has reasonable stats gathering capabilities, while making most calls
inline.
This unifies several times in theory indentical 50 lines of code.
The filesystems have a new method: vop_cachedlookup, which is the
meat of the lookup, and use vfs_cache_lookup() for their vop_lookup
method. vfs_cache_lookup() will check the namecache and pass on
to the vop_cachedlookup method in case of a miss.
It's still the task of the individual filesystems to populate the
namecache with cache_enter().
Filesystems that do not use the namecache will just provide the
vop_lookup method as usual.
- *fs_init routines now take a "struct vfsconf * vfsp" pointer
as an argument.
- Use the correct type for cookies.
- Update function prototypes.
Submitted by: bde
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.
The system boots and can mount UFS filesystems.
Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.
Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
successful write. Only do it for the IO_SYNC case (like ufs). On
one of my systems, this speeds up `iozone 24 512' from 32K/sec
(1/128 as fast as ufs) to 2.8MB/sec (7/10 as fast as ufs).
Obtained from: partly from NetBSD
certain error conditions, it is possible for pages to be left allocated
in the object beyond it's end. It is generally bad practice to allocate
pages beyond the end of an object.
/*
* Structure defined by POSIX.4 to be like a timeval.
*/
struct timespec {
time_t ts_sec; /* seconds */
long ts_nsec; /* and nanoseconds */
};
The correct names of the fields are tv_sec and tv_nsec.
Reminded by: James Drobina <jdrobina@infinet.com>
All new code is "#ifdef PC98"ed so this should make no difference to
PC/AT (and its clones) users.
Ok'd by: core
Submitted by: FreeBSD(98) development team
for certain common combinations of directory sizes, cluster sizes, and i/o
sizes (e.g., 4K, 4K, and 4K). The fix in rev. 1.21 was incomplete.
Reviewed by: dfr
Obtained from: party from NetBSD
it 1138 times (:-() in casts and a few more times in declarations.
This change is null for the i386.
The type has to be `typedef int vop_t(void *)' and not `typedef
int vop_t()' because `gcc -Wstrict-prototypes' warns about the
latter. Since vnode op functions are called with args of different
(struct pointer) types, neither of these function types is any use
for type checking of the arg, so it would be preferable not to use
the complete function type, especially since using the complete
type requires adding 1138 casts to avoid compiler warnings and
another 40+ casts to reverse the function pointer conversions before
calling the functions.
filesystem layer, as was done in lite-2. Merged in some other cosmetic
changes while I was at it. Rewrote most of msdosfs_access() to be more
like ufs_access() and to include the FS read-only check.
Obtained from: partially from 4.4BSD-lite2
wrong vp's ops vector being used by changing the VOP_LINK's argument order.
The special-case hack doesn't go far enough and breaks the generic
bypass routine used in some non-leaf filesystems. Pointed out by Kirk
McKusick.
msdosfs_lookup() did no validation to see if the caller was validated
to delete/rename/create files. msdosfs_setattr() did no validation
to see if the caller was allowed to change the file permissions (turn
on/off the write bit) or update the file modification time (utimes).
The routines were fixed to validate the calls just like ufs does.
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
of attributes so that `cp -p' to an msdos file system can succeed under
favourable circumstances (no uid or gid changes and no nonzero flags
except SF_ARCHIVED).
msdosfs_vnops.c:
The in-core inode flags were confused with the on-disk inode flags, so
chflags() clobbered the lock flag and caused a panic.
denode.h, msdosfs_denode.c, msdosfs_vnops.c:
Support the msdosfs archive attibute (ATTR_ARCHIVE) by mapping it to
the complement of the SF_ARCHIVED flag and setting the ATTR_ARCHIVE
bit when a file's modification time is set (but not when a file's
permissions are set; this is the standard wrong DOS behaviour).
denode.h, msdosfs_denode.c:
Remove the DE_UPDAT() macro. It was only used once, and the corresponding
macro in ufs has already been removed.
denode.h:
Don't change the timestamp for directories in DE_TIMES() (be consistent
with deupdat()).
msdosfs_vnops.c:
Handle chown() better: return EPERM instead of EINVAL if there are
insufficient permissions; otherwise, allow null changes.
DE_UPDATE was confused with DE_MODIFIED in some places (they do have
confusing names). Handle them exactly the same as IN_UPDATE and
IN_MODIFIED. This fixes chmod() and chown() clobbering the mtime
and other bugs.
DE_MODIFIED was set but not used.
Parenthesize macro args.
DE_TIMES() now takes a timeval arg instead of a timespec arg. It was
stupid to use a macro for speed and do unused conversions to prepare
for the macro.
Restore the left shifting of the DOS seconds count by 1. It got
lost among the shifts for the bitfields, so DOS seconds counts
appeared to range from 0 to 29 seconds (step 1) instead of 0 to 58
seconds (step 2).
Actually use the passed-in mtime in deupdat() as documented so that
utimes() works.
Change `extern __inline's to `static inline's so that msdosfs_fat.o
can be linked when it is compiled without -O.
Remove faking of directory mtimes to always be the current time. It's
more surprising for directory mtimes to change when you read the
directories than for them not to change when you write the directories.
This should be controlled by a mount-time option if at all.
I'm not sure if this is just masking another problem (like, should
ap->a_eofflag EVER be NULL?), but if it prevents a panic for now then
it may save an ALPHA customer.
Submitted by: jhay