Fixed remaining known bugs in the buffer IO and VM system.
vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).
(various)
Replaced calls to clrbuf() with calls to an optimized routine
call vfs_bio_clrbuf().
(various FS sync)
Sync out modified vnode_pager backed pages.
ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.
vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.
vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().
vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.
vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.
drivers to protect DDB from being invoked while the console is in
process-controlled (i.e., graphics) mode.
Implement the logic to use this hook from within pcvt. (I'm sure
Søren will do the syscons part RSN).
I've still got one occasion where the system stalled, but my attempts
to trigger the situation artificially resulted int the expected
behaviour. It's hard to track bugs without the console and DDB
available. :-/
Added a new type to uiomove - "UIO_NOCOPY" which causes it to update
pointers and counts, but doesn't do any data copying. This is needed
for upcoming changes to the way that the vnode pager does its page
outs.
Added a new hash init function call "phashinit" that allocates and
initializes a prime number sized hash table.
vfs_cache.c:
Changed hashing algorithm to use the remainder of dividing by a prime
number to improve the distribution characteristcs. Uses new phashinit
function in kern_subr.c.
#179). The fix implements a ttyhalfclose() (sort of), resetting the
session and pgrp pointers when the physical device is about to be
closed.
Suggested by: bde
1) Preserve old buffer contents when input buffer overflows.
Old code clear buffer and rewrite it again, if !MAXBEL
(for MAXBEL it does right thing :-).
F.e. if you type too long string, last chars passed,
not first ones as expected.
Moreover, it flush output queue too in this case without any needs.
2) Don't do IXOFF, if IGNCR and c==\r, ignore completely.
3) If PARMRK is active and !ISTRIP and char == 0377
put yet one 0377 to distinguish it from parity mark sequence.
POSIX standard (thanx Bruce).
Reviewed by:
Submitted by:
Obtained from:
CVS:
merged cache changes, and figure it out based on the B_VMIO buffer flag.
Fixes a problem where delayed write VMIO buffers would sometimes get
recopied into kernel-alloced memory.
Submitted by: John Dyson
- ignore the partition table if it is identical with the bogus one in
/usr/src/sys/i386/boot/biosboot/start.S. Honoring the bogus size
field was fatal. The error is detected but other compatibilty
cruft weakens the error handling too much for this case.
- weaken the partition entry checking to allow the following treatments
of C/H/S addresses when C should be >= 1024:
(1) allow C = 1023, H = max, S = max.
(2) allow C to be correct modulo 1024.
Other compatibilty cruft weakens the error handling to allow all
C/H/S addresses, but there too many errors were reported.
Improve error messages:
- print C/H/S addresses if relevant.
- distinguish primary partition table from extended partition tables.
- don't use diskerr() except for i/o errors.
to the user address space unless pcb_onfault is set. The code is currently
commented out because iBCS2 and process debugging parts of the kernel
need to be changed/fixed first.
It was previously after the VOP_RENAME and the reference and lock on
the vnode had already been lost, allowing interesting internel
inconsistencies. This is one of the two reasons why freefall was crashing
every hour or two (the other being nullfs bugs).
Don't call vnode_pager_uncache in revoke(). revoke() is only allowed on
VCHR and VBLK vnodes.
now returns NULL and sets a global 'mb_map_full' when the map is full.
m_clalloc() has further been taught to expect this and do the right thing.
This should fix the "mb_map full" panics that several people have reported.
1. We always keep one 16th of the vnodes on the freelist, so that the
namecache doesn't get trashed. It used to be that it wasn't a problem, but
the only vnodes getting released these days are directories and things which
Clean up and improve the namecache.
1. We always keep one 16th of the vnodes on the freelist, so that the
namecache doesn't get trashed. It used to be that it wasn't a problem, but
the only vnodes getting released these days are directories and things which
gets forced out of the VM/cache. The latter is not numerous enough to keep
the pool of vnodes needed for the namecache sufficiently big.
2. Purge invalid entries in the namecache as soon as we notice them. This
avoids a stale entry pushing out a valid entry on the LRU list.
3. Speed up the lookup in the namecache by avoid a special case branch.
4. Make the cache purge routines do the thing they're supposed to, and in
a decently efficient manner.
5. Make the size of the namecache follow the number of vnodes, so that we
can always point to all the vnodes we have in core.
6. Readability has gone way up.
7. Added a "options NCH_STATISTICS" feature that will gather more
detailed statistics on the performance of the namecache.
Reviewed by: davidg
(cvs is dumping core on me :-( )
1. We always keep one 16th of the vnodes on the freelist, so that the
namecache doesn't get trashed. It used to be that it wasn't a problem, but
the only vnodes getting released these days are directories and things which
gets forced out of the VM/cache. The latter is not numerous enough to keep
the pool of vnodes needed for the namecache sufficiently big.
2. Purge invalid entries in the namecache as soon as we notice them. This
avoids a stale entry pushing out a valid entry on the LRU list.
3. Speed up the lookup in the namecache by avoid a special case branch.
4. Make the cache purge routines do the thing they're supposed to, and in
a decently efficient manner.
5. Make the size of the namecache follow the number of vnodes, so that we
can always point to all the vnodes we have in core.
6. Readability has gone way up.
7. Added a "options NCH_STATISTICS" feature that will gather more
detailed statistics on the performance of the namecache.
Reviewed by: davidg
Don't print debugging messages by default.
Initialize the compatibility slice here and not in the machine-dependent
code.
Fix initialization of the label for the whole disk slice.
Make it clear that write protection of labels doesn't apply when there is
no label.
New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that
are used to reduce the actual number of wakeups.
New function vm_page_protect which is used in conjuction with some new
page flags to reduce the number of calls to pmap_page_protect.
Minor changes to reduce unnecessary spl nesting.
Rewrote vm_page_alloc() to improve readability.
Various other mostly cosmetic changes.
(a) bring back ttselect, now that we have xxxdevtotty() it isn't dangerous.
(b) remove all of the wrappers that have been replaced by ttselect
(c) fix formatting in syscons.c and definition in syscons.h
(d) add cxdevtotty
NOT DONE:
(e) make pcvt work... it was already broken...when someone fixes pcvt to
link properly, just rename get_pccons to xxxdevtotty and we're done
may not properly initialize this field in all cases, and this would
result in very anti-social behavior (overwriting on some other random
device/location).
Submitted by: John Dyson
(b) add a function callback vector to tty drivers that will return a pointer
to a valid tty structure based upon a dev_t
(c) make syscons structures the same size whether or not APM is enabled so
utilities don't crash if NAPM changes (and make the damn kernel compile!)
(d) rewrite /dev/snp ioctl interface so that it is device driver and i386
independant
Fix the sign of the adjustment after writing a label.
Writing of labels should work now.
Merge adjust_label() into fixlabel(). Detect more errors and don't
write if there is an error. Adjust sectors/unit and total sectors
to the numbers on the slice.
Add a function dsname() to print slice device names consistently, and
use it.
Various more tweaks from John Dyson to improve read ahead calculations.
vfs_subr.c:
Only wakeup if numoutput is 0 in vwakeup().
Submitted by: John Dyson
metadata aren't thrashed by regular file I/O.
Added mechanism to limit the amount of outstanding I/O on a given vnode.
Pagedaemon wakeup policy changed to skew priority a little in favor of
file caching.
Slight code reorganization to improve clarity.
Added a few more comments.
Submitted by: John Dyson
the same as when initializing the in-core copies. Adjust checksums in
labels after adjusting labels. This finishes fudging the on-disk label to
make it coherent with the in-core label.
Handle EIO during initialization better.
Initialize the compatibility slice to the whole disk If there are no real
slices.
Don't warn about adjusting offsets in the label to make the 'c' partition
start at 0. The 'c' offset is now always absolute on-disk and 0 in-core
so an adjustment is usually required.
Don't confuse LABEL_PART with RAW_PART so much.
Check for partitions being within slices differently.
via sysctl(8). The initial value of maxprocperuid is maxproc-1,
that of maxfilesperproc is maxfiles (untill maxfile will disappear)
Now it is at least possible to prohibit one user opening maxfiles
-Guido
Submitted by:
Obtained from:
requires complications to adjust the offsets to relative when a block
containing the label is read and back to absolute when such a block is
written. The adjustment is not made on the whole disk slice.
Don't allow setting the offset of partition C to nonzero in in-core labels.
This will cause some (nonstandard) disktab entries to fail. They will
need to be changed to have relative offsets (and no partitions outside
of the slice).
Don't write protect the (nonexistent) label on the whole disk slice.
Writing labels and bootstraps should work right now (except if there is
no DOSpartition table).
Slice 0 is now for the first BSD slice. The first BSD slice is
the first DOSpartition with id 0xa5 or the whole disk if their
are no DOSpartitions (except the latter is not yet implemented).
Existing partitions on it work the same as in 2.0 except the
'd' partition is no longer special and partitions are relative
to the skice.
Slice 1 is now for the whole disk and gets a read-only label
describing the disk. Previously, slice 0 was for the whole disk
and there was no label on it.
Slices 2-31 are for DOSpartitions. Slice 0 is an alias for one
of these if there is a BSD slice. Previously, slices 1-31 were
for DOSpartitions.
diskslice_machdep.c:
Expand whole disk slice to include all DOSpartitions. More work
is required for >1024 cylinders and to rewrite the label iff the
driver is unsure about the geometry.
subr_diskslice.c:
New function dsisopen() to help handle media changes.
mapping from numbers to names is messy for backwards compatibility.
E.g., for driver "sd", unit "0":
slice 0: omit the slice number for compatibility; names are sd0[a-h].
slice 1: omit the partition letter 'c' because the whole disk device
shouldn't have anything to do with partitions; sd0 is the
only name.
slices 2-31: subtract 1 from slice number to compensate for the
compatibility slice 0; names are sd0s[1-30][a-h].
- Overflow now calculated right
- Close works ok,does not looses tty
- Better overflow handling now the snooping stops
on overflow,but programm notified and can reconnect if
it want to..Default maximal buffer set to 664 K and this
is probably too much..:)))
Utility still to come
Restore fixes to flushing that were lost in the previous commit.
Clean up snoop changes.
Add my TODO list from 1.1.5. The improvements in 1.1.5 should be "obtained"
first.
Users-beware..
It is tested and working for me but probably have some bugs i
didn't noticed so test it and reply...
It can:
look at what's sent to the user from tty device
snoop on pty's,vty's and serial tty's
It (still) can't:
write to tty
see what user types in local echo mode
It is probably bad styled and
very dependant on tty_pty.c,sio.c and syscons.c
I would be really happy if another ppl would make their
changes because i am not sure this is the best snoop
we can have..but it is good..:)))))
Bruce finally caught this bogon for me, Thank you Bruce !
Due to some part of the VM/buffer/pmap magic doing clustering, this bogon
managed to work better than 99.9% of the time. Amazing.
If You ever again see a weird message from the gzip code, please tell me.
TS_WOPEN state when CLOCAL is toggled from on to off while there
is no carrier. There is no way back, and with sio there is no way
forward either (TS_ISOPEN will never be set again for the current
open). This bug was observed in 1.1 and was fixed in 1.1.5.
argument is now more than just a single flag. (kern_malloc.c)
Used new M_KERNEL value for socket allocations that previous were
"M_NOWAIT". Note that this will change when we clean up the M_ namespace
mess.
Submitted by: John Dyson
Now it matches the man page and also the only other commercial implementation
i have found so far ( Solaris 2.x).
Changed the name from ss_base to ss_sp.
implemented the ability to limit bufferspace by memory consumed. (vfs_bio.c)
Fixed recently introduced bugs that caused extra I/O to happen in some
cases. (vfs_cluster.c)
Submitted by: John Dyson
Moved various pmap 'bit' test/set functions back into real functions; gcc
generates better code at the expense of more of it. (pmap.c)
Fixed a deadlock problem with pv entry allocations (pmap.c)
Added a new, optional function 'pmap_prefault' that does clustered page
table preloading (pmap.c)
Changed the way that page tables are held onto (trap.c).
Submitted by: John Dyson
the physical device is closed. Previously only the reverse case was handled.
Abuse the cdevsw interface instead of the vfs interface to do this.
Remove unnecessary #includes.
When using cp to copy a file under the following circumstanes:
- original file in on an NFS filesystem
- destination file is on the same NFS filesystem
- the file is less than 8Mbytes in size
- the file is larger than 65536 bytes in size
the cp process can get frozen in device-wait and never wake up (cp uses
mmap() in this case).
A small change to allocbuf() fixes this.
attempted to check for insecure and fatal eflags and segment
selectors, but missed many cases and got the IOPL check back to
front. The other syscalls didn't check at all.
sys_process.c, machdep.c:
Only allow PT_WRITE_U to write to the registers (ordinary and FP).
psl.h, locore.s, machdep.c:
Eliminate PSL_MBZ, PSL_MBO and PSL_USERCLR. We are not supposed
to assume anything about the reserved bits. Use PSL_USERCHANGE
and PSL_KERNEL instead. Rename PSL_USERSET to PSL_USER.
exception.s:
Define a private label for use by doreti when returning to user
mode fails.
machdep.c:
In syscalls, allow changing only the eflags that can be changed on
486's in user mode (no longer attempt to allow benign IOPL changes;
allow changing the nasty PSL_NT; don't allow changing the i586
bits).
Don't attempt to check all the cases involving invalid selectors
and %eip's. Just check for privilege violations and let the invalid
things cause a trap.
procfs_machdep.c:
Call the ptrace register functions to do all the work for reading
and writing ordinary registers and for single stepping.
trap.c:
Ignore traps caused by PSL_NT being set. Previously, users could
cause a fatal trap in user mode by setting PSL_NT and executing an
iret, and a fatal trap in kernel mode by setting PSL_NT and making
a syscall. PSL_NT was cleared too late and not in enough modes to
fix the problem.
Make all traps in user mode (except T_NMI) nonfatal.
Recover from traps caused by attempting to load invalid user
registers in doreti by restarting the traps so that they appear to
occur in user mode.
---
Fix bogons that I noticed while fixing the above:
psl.h:
Fix some comments.
Uniformize idempotency ifdef.
exception.s, machdep.c:
Remove rsvd[0-14]. rsvd0 hasn't been reserved since the 486 came
out. Replace rsvd0 by `align'. rsvd[0-11] used wrong (magic
non-unique) trap numbers. Replace rsvd[1-14] by rsvd.
locore.s:
Enable alignment check flag on 486's and 586's.
machdep.c:
Use a better type for kstack[].
Use TFREGP() to find the registers.
Reformat ptrace functions from SEF to something closer to KNF.
procfs_machdep.c:
The wrong pointer to the registers got fixed as a side effect.
Implement reading and writing of FP registers.
/proc/*/*regs now work (only) for processes that are in memory.
Clean up comments.
trap.c, trap.h:
Remove unused trap types.
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.
The majority of the merged VM/cache work is by John Dyson.
The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.
vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.
vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.
vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.
vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.
vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.
pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.
vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.
proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.
swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.
machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.
machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.
ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.
Submitted by: John Dyson and David Greenman
Fix single-stepping of emulated FPU instructions.
Don't panic if an FPU instruction is attempted but there is no FPU
and no FPU emulator is configured.
short, it gets filled uop to its length. This matches the getdomainname
and gethostname manual pages.
(getbootfile also uses this function and I think it should have the same
behaviour)
This also fixes a bug with keyinit where the seed was not saved in
/etc/skeykeys. So S/Key should be fully functional again.
Reviewed by:
Submitted by:
Obtained from:
Improve hzto():
Round up instead of down and then add 1 tick. This fixes sleep(1)
sometimes sleeping for < 1 second and usleep(10000) sometimes sleeping
for as little as 1 usec + syscall time.
Don't do all the calculations at splhigh().
Don't depend on `tick' being a multiple of 1000.
Don't lose accuracy for `sec' between 0x7fffffff / 1000 - 1000 and
0x7fffffff / hz.
Don't assume that longs are 32 bits or that ints have the same size as
longs.
of returning EINVAL since something may depend on them being broken.
Allowing negative limits caused bugs almost everywhere. The recent
fixes for MAXSSIZ checked the limits too late to stop anyone defeating
limits set by root...
used as an address value. Then all comparisons should be done unsigned
and not signed. Fix it with a typecast of u_quad_t.
Error can be demonstrated with the current bash in port, do a
ulimit -s unlimited and the machine hangs. bash delivers through
an internal error a large negative value for the stacksize, the
comparison saw this smaller than MAXSSIZ and then tried to expand
the stack to this size.
operation of each clist. Limit the growth of each clist. Clists
can only grow larger than the reserved minimum if there are free
cblocks in a shared pool. The size of this pool is now fixed
(this could be improved). The reserved and maximum sizes are more
carefully allocated for slip and ppp, depending on the mtu. A maximum
MTU of 16384 is now enforced for ppp.
1. The pageout daemon used to block under certain
circumstances, and we needed to add new functionality
that would cause the pageout daemon to block more often.
Now, the pageout daemon mostly just gets rid of pages
and kills processes when the system is out of swap.
The swapping, rss limiting and object cache trimming
have been folded into a new daemon called "vmdaemon".
This new daemon does things that need to be done for
the VM system, but can block. For example, if the
vmdaemon blocks for memory, the pageout daemon
can take care of it. If the pageout daemon had
blocked for memory, it was difficult to handle
the situation correctly (and in some cases, was
impossible).
2. The collapse problem has now been entirely fixed.
It now appears to be impossible to accumulate unnecessary
vm objects. The object collapsing now occurs when ref counts
drop to one (where it is more likely to be more simple anyway
because less pages would be out on disk.) The original
fixes were incomplete in that pathological circumstances
could still be contrived to cause uncontrolled growth
of swap. Also, the old code still, under steady state
conditions, used more swap space than necessary. When
using the new code, users will generally notice a
significant decrease in swap space usage, and theoretically,
the system should be leaving fewer unused pages around
competing for memory.
Submitted by: John Dyson
Find enclosed a short bugfix to get the union filesystem up and running
in FreeBSD-current. We don't think we've got all the problems yet but
these fixes sort out the major ones (which mostly concert bad locking
of vnodes), no doubt we'll post others as necessary. Known problems
include the inability of the umount command (not the system call) to unmount
unions in certain circumstances (this is due the way "realpath" works),
and the failure of direntries to always get all available files in
unioned subdirectories. We are, as they say, working on it.
Submitted by: tim@cs.city.ac.uk (Tim Wilkinson)
a clist return with an error. There are some clist starvation/deadlock
bugs elsewhere and killing clist hogs didn't help because the breaks
only exited from the inner loops.
Cosmetic.
Return from trap() if trap_fatal() returns. trap_fatal() isn't
fatal if you have ddb. Returning from trap() is usually the right
thing to do and much better than falling through.
and all SCSI devices (except that it's not done quite the way I want). New
information added includes:
- A text description of the device
- A ``state''---unknown, unconfigured, idle, or busy
- A generic parent device (with support in the m.i. code)
- An interrupt mask type field (which will hopefully go away) so that
. ``doconfig'' can be written
This requires a new version of the `lsdev' program as well (next commit).
A word of wisdom, don't do this:
| cd /usr/bin
| for i in *
| do
| cp $i /tmp/a
| gzip -9 < /tmp/a > $i
| done
It will compress files with multiple links several times. do it this way:
| cd /usr/bin
| for i in *
| do
| gunzip -f < $i > /tmp/a
| gzip -9 < /tmp/a > $i
| done
For it to be useful, you must stick your disklabel on the partition which
starts where the MBR says FreeBSD lives. If you don't do that, you might
get a bad day.
Oh, that probably also means that putting swap there is a bad idea...
- excise some unused code (#if 0'd out - don't want to nuke it yet)
- fix problems with "make depend" - some macros were screwing it up
- get rid of some static local variables
There still seems to be a small reentrancy problem somewhere.
pmap.c: tons of unused vars zapped, various other warnings silenced.
trap.c: unused vars zapped.
vm_machdep.c: A wrong argument, which by chance did the right thing, was
corrected.
1) cut this up into /sys/sys/inflate.h, sys/kern/inflate.c
sys/kern/ingact_gzip.c
2) make a lot more things static
3) make a lot of globals const
4) make some args const
5) first stage of making globals into a struct (not used yet)
The vm_allocate() call which was introduced between revisions 1.4 and
1.5 of imagact_gzip.c broke things. I have backed that out for the time
being. (Davidg: help please)
WARNING: if you have gzip enabled in your kernel, you must now run
config again, as another source file has been added. Otherwise your
kernel compile will fall over.
This is all still WIP. More commits to come.
Suggestions from: phk.
more weird kinds of a.out than anyone can argue for. This code failed to
load the first 28K of the text-segment, in the case where the first page
of the a.out contains only the a.out-header, and the text is still at 0x0.
Thanks Steven !