- moved definition of MACHINE_ARCH from cpu.h to parm.h as alpha.
- Added definitions of _MACHINE and _MACHINE_ARCH.
- Added hw.ispc98. The hw.ispc98 is 1 in PC98 kernel and is 0 in
IBM-PC kernel.
Discussed with: John Birrell <jb@FreeBSD.ORG>
Add some overflow checks to read/write (from bde).
Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.
Reviewed by: Bruce Evans <bde@zeta.org.au>
for the Lite2 fix for always returning EIO in dead_read().
Cleaned up the cdevswitch initializers for all tty drivers.
Removed explicit calls to ttsetwater() from all (tty) drivers. ttsetwater()
is now called centrally for opens, not just for parameter changes.
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.
They shouldn't be used there either. They should have gone away
about 3 years ago when the statically initialized devswitches went
away, but su.c unfortunately still frobs the cdevswitch in the old
way.
since (hardware) ttys have too low a bandwidth to benefit significantly
from large buffers. Use twice the old limit for the new-default case
and 8 times the old limit for the driver-specifies-watermark case.
Nothing uses these cases yet.
Removed related debugging code.
in a SMP system. Unexpected things could happen if each cpu
has a different ldt setting and one cpu tries to use value
of currentldt set by another cpu.
The fix is to move currentldt to the per-cpu area. It includes
patches I filed in PR i386/6219 which are also user ldt related.
PR: i386/7591, i386/6219
Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>
Cast pointers to (vm_offset_t) instead of to (u_long) (as before) or to
(uintptr_t)(void *) (as would be more correct). Don't cast vm_offset_t's
to (u_long) just to do arithmetic on them.
mp_machdep.c:
Cast pointers to (uintptr_t) instead of to (u_long). Don't forget
to cast pointers to (void *) first or to recover from integral
possible integral promotions, although this is too much work for
machine-dependent code.
vm code generally avoids warnings for pointer vs long size mismatches
by using vm_offset_t to represent pointers; pmap.c often uses plain
`unsigned int' instead of vm_offset_t and didn't use u_long elsewhere,
but this style was messed up by code apparently imported from mp_machdep.c.
in ddb) which I broke by changing %8[l]x to %8p. Hacked the central
printf routine to not add an "0x" prefix for %p formats if the field
width is nonzero. The tables are still horribly misformatted on
64-bit machines.
Use %p instead of %8p to print pointers when the field width isn't
important.
DOS partition type 15 (Extended DOS, LBA) as a container for
DOS logical volumes, so the appropriate slices (e.g. sd1s5)
are not initialized.
PR: 7549
PR: 4120
Reviewed by: phk
Submitted by: Jim Mattson <jmattson@sonic.net>
managed to avoid corruption of this variable by luck (the compiler used a
memory read-modify-write instruction which wasn't interruptable) but other
architectures cannot.
With this change, I am now able to 'make buildworld' on the alpha (sfx: the
crowd goes wild...)
and DSO_NOLABELS flags prevent searching for slices and labels
respectively. Current drivers don't set these flags. When
DSO_NOLABELS is set, the in-core label for the whole disk is cloned
to create an in-core label for each slice. This gives the correct
result (a good in-core label for the compatibility slice) if
DSO_ONESLICE is set or only one slice is found, but usually gives
broken labels otherwise, so DSO_ONESLICE should be set if DSO_NOLABELS
is set.
protection checks. Using the partition-relative blkno in some
parts broke the write protection for partitions at unusual
offsets (only for partitions at offset 1 on i386's).
best place to set it, and the wd and wfd strategy routines don't
set it (for failed transfers) because they expect dscheck() to
initialize everything necessary. dscheck() has always set B_ERROR,
but this is not quite sufficient, because b_resid is used by physio()
to decide how much of a B_ERROR'ed i/o was done.
formats and args to match. Fixed old printf format errors (all related;
most were hidden by calling printf indirectly).
This change somehow avoids compiler bugs for 64-bit longs on i386's,
although it increases the number of 64-bit calculations.
timeslice of the exiting process was counted for both the exiting
process and the next process to run if the next process runs
immediately.
Broken in: mostly in kern_clock.c rev.1.70 (1998/05/28)
Callers only need to initialize d_secperunit now, but should
initialize d_type (to reduce the IDE/SCSI confusion), d_typename
(put the disk model in it) and geometry info (if it isn't completely
ficticious). Callers will soon need to initialize d_secsize.
normally only defined in opt_devfs.h, so testing it before including
anything is normally a no-op. Undef'ing DEVFS before including
opt_devfs.h is similarly useless. OTOH, DEVFS support for sliced
but not SLICEd (despite defined(SLICE)) devices is either harmless
(if there are no such devices, then nothing in this file is used)
or necessary (otherwise). It even seems to work for sliced cd
devices.
nearly so many casts here. Casting an pointer that was an integer
back to an integer just to compare it with -1 is bad, and casting
it back just to compare it with NULL is just wrong.
Access the entry address as a uintfptr_t, not as a long, and not
necessarily as what modload(8) passes (it takes a u_long from the
exec header and passes a u_int).
Fixed bitrot in K&R support (suword() now takes a long word).
Didn't fix corresponding bitrot in store.9 and fetch.9.
The correct types for the store and fetch families are problematic.
The `word' functions are unfortunately named and need to be split
to handle ints/longs/object pointers/function pointers. Storing
argv[] as longs is quite broken when longs are longer than pointers,
but usually works because it clobbers variables that will soon be
reinitialized.
Hopefully caddr_t is large enough to hold function pointers.
Cast object pointers to uintptr_t before casting them to u_long.
Types are wronger than usual for the PT_READ_U case. ptrace() can
only return ints, but longs are accessed.
respectively. Most of the longs should probably have been
u_longs, but this changes is just to prevent warnings about
casts between pointers and integers of different sizes, not
to fix poorly chosen types.
suitable for holding object pointers (ptrint_t -> uintptr_t).
Added corresponding signed type (intptr_t). Changed/added
corresponding non-C9x types for function pointers to match. Don't
use nonstandard types to implement these types, and don't comment
on them in <machine/types.h>.
(long)(u_long)(u_int)-4 = 0x00000000fffffffc on machines with 32-bit
ints and 64-bit longs.
Restored %z format for printing signed hex. %+x shouldn't have been
used since it is an error in userland.
Prepared to nuke %n format by cloning it to %r. %n shouldn't have
been used because it means something completely different in
userland. Now %+r is equivalent to ddb's original %r, and %r is
equivalent to ddb's original %n.
Ignore '+' flag in combination with unsigned formats %{o,p,u,x}.
you can specify the corefile name by using:
sysctl -w kern.corefile="format"
where format is a pathname (relative or absolute -- default is "%N.core"),
with "%N" (process name), "%P" (process ID), and "%U" (user ID) formats.
Reviewed by: Mike Smith, with strong requests by Julian :)
writes of size (100,208]+N*MCLBYTES.
The bug:
sosend() hands each mbuf off to the protocol output routine as soon as it
has copied it, in the hopes of increasing parallelism (see
http://www.kohala.com/~rstevens/vanj.88jul20.txt ). This works well for
TCP as long as the first mbuf handed off is at least the MSS. However,
when doing small writes (between MHLEN and MINCLSIZE), the transaction is
split into 2 small MBUF's and each is individually handed off to TCP.
TCP assumes that the first small mbuf is the whole transaction, so sends
a small packet. When the second small mbuf arrives, Nagle prevents TCP
from sending it so it must wait for a (potentially delayed) ACK. This
sends throughput down the toilet.
The workaround:
Set the "atomic" flag when we're doing small writes. The "atomic" flag
has two meanings:
1. Copy all of the data into a chain of mbufs before handing off to the
protocol.
2. Leave room for a datagram header in said mbuf chain.
TCP wants the first but doesn't want the second. However, the second
simply results in some memory wastage (but is why the workaround is a
hack and not a fix).
The real fix:
The real fix for this problem is to introduce something like a "requested
transfer size" variable in the socket->protocol interface. sosend()
would then accumulate an mbuf chain until it exceeded the "requested
transfer size". TCP could set it to the TCP MSS (note that the
current interface causes strange TCP behaviors when the MSS > MCLBYTES;
nobody notices because MCLBYTES > ethernet's MTU).
Not sure of the result of it..
(may or may not effect anything) but it's fixed now.
(found by: comparing what cvsup sent back to me with what I tested..)
There is only cdevsw (which should be renamed in a later edit to deventry
or something). cdevsw contains the union of what were in both bdevsw an
cdevsw entries. The bdevsw[] table stiff exists and is a second pointer
to the cdevsw entry of the device. it's major is in d_bmaj rather than
d_maj. some cleanup still to happen (e.g. dsopen now gets two pointers
to the same cdevsw struct instead of one to a bdevsw and one to a cdevsw).
rawread()/rawwrite() went away as part of this though it's not strictly
the same patch, just that it involves all the same lines in the drivers.
cdroms no longer have write() entries (they did have rawwrite (?)).
tapes no longer have support for bdev operations.
Reviewed by: Eivind Eklund and Mike Smith
Changes suggested by eivind.
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>
which makes adjtime(2) useless and confuses xntpd(8) into refusing
to start even when it would use the kernel PLL instead of adjtime().
The result is the same as recommended by tickadj(8), at least when
HZ divides 10^6. Of course, you wouldn't want to actually use
adjtime() when HZ is large. In the silly boundary case of HZ == 10^6,
tickadj == tick == 1 so the clock stops while adjtime() is active.
mmioctl() to fix hundreds of style bugs and a few error handling bugs
(don't check for superuser privilege for inappropriate ioctls, don't
check the input arg for the output-only MEM_RETURNIRQ ioctl, and don't
return EPERM for null changes).
`void *' arg. Fixed or hid most of the resulting type mismatches.
Handlers can now be updated locally (except for reworking their
global declarations in isa_device.h).
Major changes to the generic device framework for FreeBSD/alpha:
* Eliminate bus_t and make it possible for all devices to have
attached children.
* Support dynamically extendable interfaces for drivers to replace
both the function pointers in driver_t and bus_ops_t (which has been
removed entirely. Two system defined interfaces have been defined,
'device' which is mandatory for all devices and 'bus' which is
recommended for all devices which support attached children.
* In addition, the alpha port defines two simple interfaces 'clock'
for attaching various real time clocks to the system and 'mcclock'
for the many different variations of mc146818 clocks which can be
attached to different alpha platforms. This eliminates two more
function pointer tables in favour of the generic method dispatch
system provided by the device framework.
Future device interfaces may include:
* cdev and bdev interfaces for devfs to use in replacement for specfs
and the fixed interfaces bdevsw and cdevsw.
* scsi interface to replace struct scsi_adapter (not sure how this
works in CAM but I imagine there is something similar there).
* various tailored interfaces for different bus types such as pci,
isa, pccard etc.
* Eliminate bus_t and make it possible for all devices to have
attached children.
* Support dynamically extendable interfaces for drivers to replace
both the function pointers in driver_t and bus_ops_t (which has been
removed entirely. Two system defined interfaces have been defined,
'device' which is mandatory for all devices and 'bus' which is
recommended for all devices which support attached children.
* In addition, the alpha port defines two simple interfaces 'clock'
for attaching various real time clocks to the system and 'mcclock'
for the many different variations of mc146818 clocks which can be
attached to different alpha platforms. This eliminates two more
function pointer tables in favour of the generic method dispatch
system provided by the device framework.
Future device interfaces may include:
* cdev and bdev interfaces for devfs to use in replacement for specfs
and the fixed interfaces bdevsw and cdevsw.
* scsi interface to replace struct scsi_adapter (not sure how this
works in CAM but I imagine there is something similar there).
* various tailored interfaces for different bus types such as pci,
isa, pccard etc.
Fix for potential hang when trying to reboot the system or
to forcibly unmount a soft update enabled filesystem.
FreeBSD already handled the reboot case differently, this is however a better
fix.
work in progress and has never booted a real machine. Initial
development and testing was done using SimOS (see
http://simos.stanford.edu for details). On the SimOS simulator, this
port successfully reaches single-user mode and has been tested with
loads as high as one copy of /bin/ls :-).
Obtained from: partly from NetBSD/alpha
machine-independent code and try mounting the devices in the
lists instead of guessing alternative root devices in a machine-
dependent way.
autoconf.c:
Reject preposterous slice numbers instead of silently converting
them to COMPATIBILITY_SLICE.
Don't forget to force slice = COMPATIBILITY_SLICE in the floppy
device name.
Eliminated most magic numbers and magic device names in setroot().
Fixed dozens of style bugs.
vfs_conf.c:
Put the actual root device name instead of "root_device" in the
mount struct if the actual name is available. This is useful after
booting with -s. If it were set in all cases then it could be used
to do mount(8)'s ROOTSLICE_HUNT and fsck(8)'s hotroot guess better.
In particular, don't generate an include of "opt_compat.h" if it
wouldn't affect anything we create. This will fix recent breakage
of the ibcs2 LKM. The ibcs2 syscall files were not regenerated
properly, so the LKM didn't break immediately when we started
generating this extraneous include.
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.
The prototype FreeBSD/alpha machdep will follow in a couple of days
time.
The slices "struct" isn't really a struct; we allocate only part of
it in the fully dangerously dedicated case. Since the "struct" is
malloced, the page beyond it may not be mapped, so attempts to copy
it would crash. This problem became larger when the full struct was
bloated from < 1K to > 3K by the addition of (mostly unused) DEVFS
tokens some time before 2.2.0 was released.
the alloc is not M_DONTWAIT, then panic with "Out of mbuf clusters".
Callers that specify M_WAIT can't deal with getting a NULL buffer, so this
is a more graceful failure than randomly page faulting in the socket code
or elsewhere.
Clean up (or if antipodic: down) some of the msgbuf stuff.
Use an inline function rather than a macro for timecounter delta.
Maintain process "on-cpu" time as 64 bits of microseconds to avoid
needless second rollover overhead.
Avoid calling microuptime the second time in mi_switch() if we do
not pass through _idle in cpu_switch()
This should reduce our context-switch overhead a bit, in particular
on pre-P5 and SMP systems.
WARNING: Programs which muck about with struct proc in userland
will have to be fixed.
Reviewed, but found imperfect by: bde
unexpectedly do not complete writes even with sync I/O requests.
This should help the behavior of mmaped files when using
softupdates (and perhaps in other circumstances also.)
end up being the same, but it doesn't look like you're comparing
apples and oranges.
2. Use need_resched instead of reset_priority. This isn't right
either, since for example you'll round-robin against equal priority FIFO
processes when lowering the priority of another process,
but this works better and a real fix needs to be in kern_synch and
not out here.
3. This is not a device driver: copyin/copyout the structure.
update of cpu usage as shown by top when one process is cpu bound
(no system calls) while the system is otherwise idle (except for top).
Don't attempt to switch to the BSP in boot(). If the system was idle when
an interrupt caused a panic, this won't work. Instead, switch to the BSP
in cpu_reset.
Remove some spurious forward_statclock/forward_hardclock warnings.
Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.
Convert PF_LOCAL PCB structures to be type-stable and add a version number.
Define an external format for infomation about socket structures and use
it in several places.
Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.
It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).
signanosleep() did not deal with signal masks properly. This change was
based on a discussion with bde some time ago (at least 6 months or more).
signanosleep() should probably go away since it was never really used for
more than a few weeks and doesn't appear in released code. It should
probably be killed before somebody uses it and it becomes a gratuitous
nonstandard feature.
possibly non-open devices, and we don't want to restrict dumping
to swap devices anwyay. It is especially invalid to call d_ioctl()
in non-process context for panics. d_psize() can be called on
non-open devices, at least on non-SLICED ones that support d_dump(),
and setdumpdev() has depended on this for a long time although it
is probably wrong, but even d_psize() can't be called in non-process
context - that's why dumpsys() depends on previously computed values
although these values may be stale. The historical restriction to
devices with dkpart(dev) == SWAP_PART should go away.
the only common usage of utrace (the possible problem with this
commit) is with malloc, so this should be a real problem. Add
the various NetBSD syscalls that allow full emulation of their
development environment.
---------
Make callers of namei() responsible for releasing references or locks
instead of having the underlying filesystems do it. This eliminates
redundancy in all terminal filesystems and makes it possible for stacked
transport layers such as umapfs or nullfs to operate correctly.
Quality testing was done with testvn, and lat_fs from the lmbench suite.
Some NFS client testing courtesy of Patrik Kudo.
vop_mknod and vop_symlink still release the returned vpp. vop_rename
still releases 4 vnode arguments before it returns. These remaining cases
will be corrected in the next set of patches.
---------
Submitted by: Michael Hancock <michaelh@cet.co.jp>
Reverse the VFS_VRELE patch. Reference counting of vnodes does not need
to be done per-fs. I noticed this while fixing vfs layering violations.
Doing reference counting in generic code is also the preference cited by
John Heidemann in recent discussions with him.
The implementation of alternative vnode management per-fs is still a valid
requirement for some filesystems but will be revisited sometime later,
most likely using a different framework.
Submitted by: Michael Hancock <michaelh@cet.co.jp>
called from vfs_bio_awrite() without going through cluster_write()
or ufs_bmaparray(), in particular for all writes to block disk devices.
Only ufs_bmaparray() sets vp->v_maxio in a correct way, and it doesn't
seem to be called early enough even for regular files.
expecting a sub-page offset. We were passing the file position,
and vm_page_bits() could do some interesting things when base was
larger PAGE_SIZE.
if (size > PAGE_SIZE - base)
size = PAGE_SIZE - base;
is interesting when (PAGE_SIZE - base) is negative. I could imagine that
this could have interesting consequences for memory page -> device block
bit validation.
Linux emulation. This make Allegro Common Lisp 4.3 work under
FreeBSD!
Submitted by: Fred Gilham <gilham@csl.sri.com>
Commented on by: bde, dg, msmith, tg
Hoping he got everything right: eivind
This code will be turned on with the TWO options
DEVFS and SLICE. (see LINT)
Two labels PRE_DEVFS_SLICE and POST_DEVFS_SLICE will deliniate these changes.
/dev will be automatically mounted by init (thanks phk)
on bootup. See /sys/dev/slice/slice.4 for more info.
All code should act the same without these options enabled.
Mike Smith, Poul Henning Kamp, Soeren, and a few dozen others
This code does not support the following:
bad144 handling.
Persistance. (My head is still hurting from the last time we discussed this)
ATAPI flopies are not handled by the SLICE code yet.
When this code is running, all major numbers are arbitrary and COULD
be dynamically assigned. (this is not done, for POLA only)
Minor numbers for disk slices ARE arbitray and dynamically assigned.
(ie: it has a vm_object attached and is marked as OBJ_MIGHTBEDIRTY) before
attempting to lock it. This should reduce the cpu hit that is incurred
when doing a sync(2) and when the syncer process is doing the 30-second
writeback of dirty mmap() data to disk. Skip this speedup if we are
doing an unmount() to be sure to get everything - we can afford to
occasionally miss a msync while the system is running, but not at unmount.
I'm not sure about the VXLOCK and MNT_WAIT case, it seems a bit odd to skip
doing a page_clean at unmount time just because a vnode is VXLOCKed, but
that's what was being done before...
update got lost. This is responsible for ensuring that dirty mmap() pages
get periodically written to disk. Without it, long time mmap's might not
have their dirty pages written out at all of the system crashes or isn't
cleanly shut down. This could be nasty if you've got a long-running
writing via mmap(), dirty pages used to get written to disk within 30
seconds or so.
data is greater than MLEN, setsockopt is unable to pass it onto
the protocol handler. Allocate a cluster in such case.
PR: 2575
Reviewed by: phk
Submitted by: Julian Assange proff@iq.org
kernal page table may need to be extended. But while growing the
kernel page table (pmap_growkernel()), newly allocated kernel page
table pages are entered into every process' page directory. For
proc0, the page directory is not allocated yet, and results in a
page fault. Eventually, the machine panics with "lockmgr: not
holding exclusive lock".
PR: 5458
Reviewed by: phk
Submitted by: Luoqi Chen <luoqi@luoqi.watermarkgroup.com>
an error if it gets a link (like it does if it gets a socket). The
implications of letting users try and do file operations on symlinks
themselves were too worrying.
to not follow symlinks, but to open a handle on the link itself(!).
As strange as this might sound, it has several useful applications
safe race-free ways of opening files in hostile areas (eg: /tmp, a mode
1777 /var/mail, etc). It also would allow things like fchown() to work
on the link rather than having to implement a new syscall specifically for
that task.
Reviewed by: phk
ints. Remove some no longer needed casts. Initialize the per-cpu
global data area using the structs rather than knowing too much about
layout, alignment, etc.
* Figure out UTC relative to boottime. Four new functions provide
time relative to boottime.
* move "runtime" into struct proc. This helps fix the calcru()
problem in SMP.
* kill mono_time.
* add timespec{add|sub|cmp} macros to time.h. (XXX: These may change!)
* nanosleep, select & poll takes long sleeps one day at a time
Reviewed by: bde
Tested by: ache and others
- Attempt to handle PCI devices where the interrupt is
an ISA/EISA interrupt according to the mp table.
- Attempt to handle multiple IO APIC pins connected to
the same PCI or ISA/EISA interrupt source. Print a
warning if this happens, since performance is suboptimal.
This workaround is only used for PCI devices.
With these two workarounds, the -SMP kernel is capable of running on
my Asus P/I-P65UP5 motherboard when version 1.4 of the MP table is disabled.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.
Most uses of time.tv_sec now uses the new variable time_second instead.
gettime() changed to getmicrotime(0.
Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).
A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.
Add a new nfs_curusec() function.
Mark a couple of bogosities involving the now disappeard time variable.
Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.
Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.
Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.
Reviewed by: bde
_KPOSIX_PRIORITY_SCHEDULING options to work. Changes:
Change all "posix4" to "p1003_1b". Misnamed files are left
as "posix4" until I'm told if I can simply delete them and add
new ones;
Add _POSIX_PRIORITY_SCHEDULING system calls for FreeBSD and Linux;
Add man pages for _POSIX_PRIORITY_SCHEDULING system calls;
Add options to LINT;
Minor fixes to P1003_1B code during testing.
They are atomic, but return in essence what is in the "time" variable.
gettime() is now a macro front for getmicrotime().
Various patches to use the two new functions instead of the various
hacks used in their absence.
Some puntuation and grammer patches from Bruce.
A couple of XXX comments.
everything is contained inside #ifdef VM86, so this option must be
present in the config file to use this functionality.
Thanks to Tor Egge, these changes should work on SMP machines. However,
it may not be throughly SMP-safe.
Currently, the only BIOS calls made are memory-sizing routines at bootup,
these replace reading the RTC values.
In vfs_bio.c, remove b_generation count usage,
remove redundant reassignbuf,
remove redundant spl(s),
manage page PG_ZERO flags more correctly,
utilize in invalid value for b_offset until it
is properly initialized. Add asserts
for #ifdef DIAGNOSTIC, when b_offset is
improperly used.
when a process is not performing I/O, and just waiting
on a buffer generally, make the sleep priority
low.
only check page validity in getblk for B_VMIO buffers.
In vfs_cluster, add b_offset asserts, correct pointer calculation
for clustered reads. Improve readability of certain parts of
the code. Remove redundant spl(s).
In vfs_subr, correct usage of vfs_bio_awrite (From Andrew Gallatin
<gallatin@cs.duke.edu>). More vtruncbuf problems fixed.
me any problems until after the previous commit. This problem then
caused a severe case of creeping crud on my diskdrive, and hosed
my system so bad, that I needed to do a complete reinstall. Sorry!!!
I assume that others have manifest this bug.
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!
pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)
pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.
vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)
vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.
vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.
vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)
vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.
vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)
vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)
10) Modify FFS to use vtruncbuf.
vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)
vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.
vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)
vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.
of a disk, because that slice does not exist, try again mounting from the
compatability slice.
This handles the case where a disk has been initialised by 'disklabel
auto', which places a bogus and invalid slice entry on the disk.
The bootstrap is not smart enough to reject this slice, and pretends to
boot from it. Believing the the bootstrap at this point is unwise.
Booting from non-'wd' disks thus prepared is still broken, as
'disklabel -rwB xdN auto' does not initialise the disk type field, and
the bootstrap mistakenly claims that the disk is handled by 'wd'.
Behaviour is now consistent with DEVFS expected characteristics.
during the libc/libc_r to automatically pick up syscall names on
the assumption that default asm code needs to generated for them.
In the up-coming changes to the libc makefiles, there is the option
to provide a machine dependent asm source file which will turn off
the automatic generation of the default. There is also an option
to just stop code being generated for a syscall. In most cases,
though, the default asm code is all that is required, so this
change makes that the most convenient was to do business.
Idea suggested by: bde
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
f00f_hack has run.
Use the global r_idt descriptor in f00f_hack when in SMP mode,
so the APs find the relocated interrupt descriptor table.
Submitted by: Partially from David A Adkins <adkin003@tc.umn.edu>
dynamically depending on the line speed(s). This should give the old
sizes and watermarks until drivers are changed.
Display the input watermarks in pstat and sicontrol.
Fix for RTPRIO scheduler to eliminate invalid context switches.
POSIX.4 headers and sysctl variables. Nothing should change
unless POSIX4 is defined or _POSIX_VERSION is set to 199309.
interrupts are masked, and EOI is sent iff the corresponding ISR bit
is set in the local apic. If the CPU cannot obtain the interrupt
service lock (currently the global kernel lock) the interrupt is
forwarded to the CPU holding that lock.
Clock interrupts now have higher priority than other slow interrupts.
the signal handling latency for cpu-bound processes that performs very
few system calls.
The IPI for forcing an additional software trap is no longer dependent upon
BETTER_CLOCK being defined.
than rolling it's own. This means that it now uses the "safe"
exec_map_first_page() to get the ld.so headers rather than risking a panic
on a page fault failure (eg: NFS server goes down).
Since all the ELF tools go to a lot of trouble to make sure everything
lives in the first page for executables, this is a win. I have not seen
any ELF executable on any system where all the headers didn't fit in the
first page with lots of room to spare.
I have been running variations of this code for some time on my pure ELF
systems.
a complement to all ops that return a vpp, VFS_VRELE. This is
initially only for file systems that implement the following ops
that do a WILLRELE:
vop_create, vop_whiteout, vop_mknod, vop_remove, vop_link,
vop_rename, vop_mkdir, vop_rmdir, vop_symlink
This is initial DNA that doesn't do anything yet. VFS_VRELE is
implemented but not called.
A default vfs_vrele was created for fs implementations that use the
standard vnode management routines.
VFS_VRELE implementations were made for the following file systems:
Standard (vfs_vrele)
ffs mfs nfs msdosfs devfs ext2fs
Custom
union umapfs
Just EOPNOTSUPP
fdesc procfs kernfs portal cd9660
These implementations may change as VOP changes are implemented.
In the next phase, in the vop implementations calls to vrele and the vrele
part of vput will be moved to the top layer vfs_vnops and made visible
to all layers. vput will be replaced by unlock in these cases. Unlocking
will still be done in the per fs layer but the refcount decrement will be
triggered at the top because it doesn't hurt to hold a vnode reference a
little longer. This will have minimal impact on the structure of the
existing code.
This will only be done for vnode arguments that are released by the various
fs vop implementations.
Wider use of VFS_VRELE will likely require restructuring of the code.
Reviewed by: phk, dyson, terry et. al.
Submitted by: Michael Hancock <michaelh@cet.co.jp>
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.
All in all, SMP systems especially should show a significant
improvement in "snappyness."
times consistently wrong (up to 1 tick too late), but recent changes
fixed the setting of the main clock, making other times inconsistent.
The inconsistencies tended to show up as a negative resource usage
for the process that set the time.
Fixed the check for setting the clock backwards. A stale timestamp
(`time') was checked, so it was possible to set the clock backwards
by up to almost 1 tick. Until recently, this bug was compensated
for by setting the clock consistently wrong.
Merged the comment about setting the clock backwards from Lite2.
Removed latency micro-optimizations/speed pessimizations in settime().
microtime() and set_timecounter() are relatively expensive, and
they must be called together with clock updates blocked to get a
consistent `delta', so significant latency optimizations are not
possible.
Removed some stale comments.
in a way identically as before.) I had problems with the system properly
handling the number of vnodes when there is alot of system memory, and the
default VM_KMEM_SIZE. Two new options "VM_KMEM_SIZE_SCALE" and
"VM_KMEM_SIZE_MAX" have been added to support better auto-sizing for systems
with greater than 128MB.
vnodes, therefore vget doesn't need to do so anymore. Other minor
improvements include the temp free vnode queue obeying the VAGE
flag and a printf that warns of to-be-removed code being executed.
Highlights:
* Simple model for underlying hardware.
* Hardware basis for timekeeping can be changed on the fly.
* Only one hardware clock responsible for TOD keeping.
* Provides a real nanotime() function.
* Time granularity: .232E-18 seconds.
* Frequency granularity: .238E-12 s/s
* Frequency adjustment is continuous in time.
* Less overhead for frequency adjustment.
* Improves xntpd performance.
Reviewed by: bde, bde, bde
so_error is set, clear it before returning it. The behavior
introduced in 4.3-Reno (to not clear so_error) causes potentially
transient errors (e.g. ECONNREFUSED if the other end hasn't opened
its socket yet) to be permanent on connected datagram sockets that
are only used for writing.
(soreceive() clears so_error before returning it, as does
getsockopt(...,SO_ERROR,...).)
Submitted by: Van Jacobson <van@ee.lbl.gov>, via a comment in the vat sources.
or shrinking an open partition (by changing the label for a compatibility
slice while partitions on the corresponding real slice are open, or vice
versa).
waslocked = TRUE. This change may fix lockmgr panic in umapfs/nullfs.
PR: 5634
Reviewed by: "John S. Dyson" <toor@dyson.iquest.net>
Suggested by: Bruce Evans <bde@zeta.org.au>
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.
Realtime priority has to be restricted for reasons which should be
obvious. However, for idle priority, there is a potential for
system deadlock if an idleprio process gains a lock on a resource
that other processes need (and the idleprio process can't run
due to a CPU-bound normal process). Fix me! XXX
PR: 5639
MUST be PG_BUSY. It is bogus to free a page that isn't busy,
because it is in a state of being "unavailable" when being
freed. The additional advantage is that the page_remove code
has a better cross-check that the page should be busy and
unavailable for other use. There were some minor problems
with the collapse code, and this plugs those subtile "holes."
Also, the vfs_bio code wasn't checking correctly for PG_BUSY
pages. I am going to develop a more consistant scheme for
grabbing pages, busy or otherwise. For now, we are stuck
with the current morass.
If you want to play with it, you can find the final version of the
code in the repository the tag LFS_RETIREMENT.
If somebody makes LFS work again, adding it back is certainly
desireable, but as it is now nobody seems to care much about it,
and it has suffered considerable bitrot since its somewhat haphazard
integration.
R.I.P
Make vfs_bio buffer mgmt work better.
Buffers were being used after brelse.
Make nfs_getpages work independently of other NFS
interfaces. This eliminates some difficult
recursion problems and decreases pagefault
overhead.
Remove an erroneous vfs_unbusy_pages.
Fix a reentrancy problem, with nfs_vinvalbuf when
vnode is already being rundown.
Reassignbuf wasn't being called when needed under
certain circumstances.
(Thanks to Bill Paul for help.)
This introduce an xxxFS_BOOT for each of the rootable filesystems.
(Presently not required, but encouraged to allow a smooth move of option *FS
to opt_dontuse.h later.)
LFS is temporarily disabled, and will be re-enabled tomorrow.
1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)
a null pointer panic when the pointer for the incorrect process is
NULL. getpriority() was broken in rev.1.27. Rev.1.28 broke the
warning instead of fixing the problem.
PR: 5495
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.
The system should be much more stable, but all subtile bugs aren't fixed yet.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.
PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:
1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.
Be gentle, and please give me feedback asap.
for field widths being 2 larger than specified for "%<number>p". Only
printing of null pointers is "wrong" now (it is actually "right", but
inconsistent with printf(3)).
here, but kmem_malloc() is used and it takes the same "flags" as
malloc().
Use the mbuf allocation "flags" M_WAIT and M_DONTWAIT consistently.
There is really only one boolean flag, M_DONTWAIT, but the "flags"
were always treated as enum-like values, except in some places here
where the values are tacitly converted to boolean flags. Treat
them as enum-like values everywhere, except where we tacitly assume
that there are only two values in order to convert them to the
corresponding two kmem_malloc() "flags".
of time that the laptop was suspending. Thus, select() calls that might have
suspended rather than firing at 1hr + "time suspended" since the timer was
posted.
Adding:
options APM_FIXUP_CALLTODO
to the kernel config enables the patch.
[
This patch was slightly modified to use a consistant indent style and
I removed some unused local variables. After this has been tested a
few weeks we'll make the options the default, so for now I'm now
documenting it in LINT. Mike can later if he wants.
]
Reviewed by: Mike Smith <msmith@freebsd.org>
Submitted by: Ken Key <key@cs.utk.edu>
flag is set in the p_pfsflags field. This, essentially, prevents an SUID
proram from hanging after being traced. (E.g., "truss /usr/bin/rlogin" would
fail, but leave rlogin in a stopevent state.) Yet another case where procctl
is (hopefully ;)) no longer needed in the general case.
Reviewed by: bde (thanks bruce :))
Quite amazing that the system runs at all with this bug. Also present in
2.2.5. The bug appears to have come in with changes in rev 1.53.
PR: might fix PR#5313
Submitted by: bde
if one of the new poll types is requested; hopefully this will not break
any existing code. (This is done so that programs have a dependable
way of determining whether a filesystem supports the extended poll types
or not.)
The new poll types added are:
POLLWRITE - file contents may have been modified
POLLNLINK - file was linked, unlinked, or renamed
POLLATTRIB - file's attributes may have been changed
POLLEXTEND - file was extended
Note that the internal operation of poll() means that it is impossible
for two processes to reliably poll for the same event (this could
be fixed but may not be worth it), so it is not possible to rewrite
`tail -f' to use poll at this time.
- A nonprofiling version of s_lock (called s_lock_np) is used
by mcount.
- When profiling is active, more registers are clobbered in
seemingly simple assembly routines. This means that some
callers needed to save/restore extra registers.
- The stack pointer must have space for a 'fake' return address
in idle, to avoid stack underflow.
... fix a bug with orecvfrom() or recvfrom() called with
the MSG_COMPAT flag on kernels compiled with the COMPAT_43 option.
The symptom is that the fromaddr is not correctly returned.
This affects the Linux emulator.
Submitted by: pb@fasterix.freenix.org (Pierre Beyssac)
noticed some major enhancements available for UP situations. The number
of UP TLB flushes is decreased much more than significantly with these
changes. Since a TLB flush appears to cost minimally approx 80 cycles,
this is a "nice" enhancement, equiv to eliminating between 40 and 160
instructions per TLB flush.
Changes include making sure that kernel threads all use the same PTD,
and eliminate unneeded PTD switches at context switch time.
quite a while, but forgot to do so. For now, this code supports
most daemons running as kernel threads in UP kernels, and as
full processes in SMP. We will soon be able to run them as
threads in SMP, but not yet.
Note that an unload facility should be used to call rm_at_exit() (if
procfs is being loaded as an LKM and is subsequently removed), but it
was non-obvious how to do this in the VFS framework.
Reviewed by: Julian Elischer
surprise, procfs actually is optional, and some people truly do generate
kernels without it. Wow. I built a kernel without 'options PROCFS' and
it compiled and linked.
1) Fix the initialization of malloc structure that changed
due to perf opt.
2) Remove unneeded include.
3) An initialization assert added to malloc.
Submitted by: John Hood <cgull@smoke.marlboro.vt.us>
workaround. Note that this currently eats up two pages extra in the system;
this could be alleviated by aligning idt correctly, and then only dealing with
that (as opposed to the current method of allocated two pages and copying the
IDT table to that, and then setting that to be the IDT table).
or aio_write can return the pid of the new thread. This is due to the
way that return values from system calls being passed by side-effect in
the proc structure now. This commit fixes the problem with aio_read and
aio_write.
remove alot of overly verbose debugging statements.
ioproclist {
int aioprocflags; /* AIO proc flags */
TAILQ_ENTRY(aioproclist) list; /* List of processes */
struct proc *aioproc; /* The AIO thread */
TAILQ_HEAD (,aiocblist) jobtorun; /* suggested job to run */
};
/*
* data-structure for lio signal management
*/
struct aio_liojob {
int lioj_flags;
int lioj_buffer_count;
int lioj_buffer_finished_count;
int lioj_queue_count;
int lioj_queue_finished_count;
struct sigevent lioj_signal; /* signal on all I/O done */
TAILQ_ENTRY (aio_liojob) lioj_list;
struct kaioinfo *lioj_ki;
};
#define LIOJ_SIGNAL 0x1 /* signal on all done (lio) */
#define LIOJ_SIGNAL_POSTED 0x2 /* signal has been posted */
/*
* per process aio data structure
*/
struct kaioinfo {
int kaio_flags; /* per process kaio flags */
int kaio_maxactive_count; /* maximum number of AIOs */
int kaio_active_count; /* number of currently used AIOs */
int kaio_qallowed_count; /* maxiumu size of AIO queue */
int kaio_queue_count; /* size of AIO queue */
int kaio_ballowed_count; /* maximum number of buffers */
int kaio_queue_finished_count; /* number of daemon jobs finished */
int kaio_buffer_count; /* number of physio buffers */
int kaio_buffer_finished_count; /* count of I/O done */
struct proc *kaio_p; /* process that uses this kaio block */
TAILQ_HEAD (,aio_liojob) kaio_liojoblist; /* list of lio jobs */
TAILQ_HEAD (,aiocblist) kaio_jobqueue; /* job queue for process */
TAILQ_HEAD (,aiocblist) kaio_jobdone; /* done queue for process */
TAILQ_HEAD (,aiocblist) kaio_bufqueue; /* buffer job queue for process */
TAILQ_HEAD (,aiocblist) kaio_bufdone; /* buffer done queue for process */
};
#define KAIO_RUNDOWN 0x1 /* process is being run down */
#define KAIO_WAKEUP 0x2 /* wakeup process when there is a significant
event */
TAILQ_HEAD (,aioproclist) aio_freeproc, aio_activeproc;
TAILQ_HEAD(,aiocblist) aio_jobs; /* Async job list */
TAILQ_HEAD(,aiocblist) aio_bufjobs; /* Phys I/O job list */
TAILQ_HEAD(,aiocblist) aio_freejobs; /* Pool of free jobs */
static void aio_init_aioinfo(struct proc *p) ;
static void aio_onceonly(void *) ;
static int aio_free_entry(struct aiocblist *aiocbe);
static void aio_process(struct aiocblist *aiocbe);
static int aio_newproc(void) ;
static int aio_aqueue(struct proc *p, struct aiocb *job, int type) ;
static void aio_physwakeup(struct buf *bp);
static int aio_fphysio(struct proc *p, struct aiocblist *aiocbe, int type);
static int aio_qphysio(struct proc *p, struct aiocblist *iocb);
static void aio_daemon(void *uproc);
SYSINIT(aio, SI_SUB_VFS, SI_ORDER_ANY, aio_onceonly, NULL);
static vm_zone_t kaio_zone=0, aiop_zone=0,
aiocb_zone=0, aiol_zone=0, aiolio_zone=0;
/*
* Single AIOD vmspace shared amongst all of them
*/
static struct vmspace *aiovmspace = NULL;
/*
* Startup initialization
*/
void
aio_onceonly(void *na)
{
TAILQ_INIT(&aio_freeproc);
TAILQ_INIT(&aio_activeproc);
TAILQ_INIT(&aio_jobs);
TAILQ_INIT(&aio_bufjobs);
TAILQ_INIT(&aio_freejobs);
kaio_zone = zinit("AIO", sizeof (struct kaioinfo), 0, 0, 1);
aiop_zone = zinit("AIOP", sizeof (struct aioproclist), 0, 0, 1);
aiocb_zone = zinit("AIOCB", sizeof (struct aiocblist), 0, 0, 1);
aiol_zone = zinit("AIOL", AIO_LISTIO_MAX * sizeof (int), 0, 0, 1);
aiolio_zone = zinit("AIOLIO",
AIO_LISTIO_MAX * sizeof (struct aio_liojob), 0, 0, 1);
aiod_timeout = AIOD_TIMEOUT_DEFAULT;
aiod_lifetime = AIOD_LIFETIME_DEFAULT;
jobrefid = 1;
}
/*
* Init the per-process aioinfo structure.
* The aioinfo limits are set per-process for user limit (resource) management.
*/
void
aio_init_aioinfo(struct proc *p)
{
struct kaioinfo *ki;
if (p->p_aioinfo == NULL) {
ki = zalloc(kaio_zone);
p->p_aioinfo = ki
support was missing in the previous version of the AIO code. More
tunables added, and very efficient support for VCHR files has been added.
Kernel threads are not used for VCHR files, all work for such files is
done for the requesting process directly. Some attempt has been made to
charge the requesting process for resource utilization, but more work
is needed. aio_fsync is still missing (but the original fsync system
call can be used for now.) aio_cancel is essentially a noop, but that
is okay per POSIX. More aio_cancel functionality can be added later,
if it is found to be needed.
The functions implemented include:
aio_read, aio_write, lio_listio, aio_error, aio_return,
aio_cancel, aio_suspend.
The code has been implemented to support the POSIX spec 1003.1b
(formerly known as POSIX 1003.4 spec) features of the above. The
async I/O features are truly async, with the VCHR mode of operation
being essentially the same as physio (for appropriate files) for
maximum efficiency. This code also supports the signal capability,
is highly tunable, allowing management of resource usage, and
has been written to allow a per process usage quota.
Both the O'Reilly POSIX.4 book and the actual POSIX 1003.1b document
were the reference specs used. Any filedescriptor can be used with
these new system calls. I know of no exceptions where these
system calls will not work. (TTY's will also probably work.)
this results in a few functions becoming static, and
the SYSINITs being close to the code they are related to.
setting up the dump device is with dumpsys() and
kicking off the scheduler is with the scheduler.
Mounting root is with the code that does it.
Reviewed by: phk
b_flags, and this patch removes unneeded modifications. Only the needed b_flags
bits are modified now. (Specifically, it is usually wrong to zero b_flags.)
Submitted by: bde@freebsd.org
Fixed overflow of FFLAGS() in fcntl(F_SETFL, ...). This was not
a security hole, but gave wrong results for silly flags values.
E.g., it make fcntl(F_SETFL, -1) equivalent to fcntl(F_SETFL, 0).
POSIX requires ignoring the open mode bits in fcntl() (even if
they would be invalid for open()).
Use OID_AUTO instead of a magic number for the debug.syncprt sysctl.
(This sysctl doesn't actually work. FreeBSD nuked it, but parts
of it were mismerged from Lite2. It is not very good, but better
than nothing.)
`mount -u'. This only matters for `mount -u' competing with unmounts.
If I understand the locking correctly: if mount() blocks, then unmount()
may run and set mp->kern_flag for the same mp. Then unmount() blocks
waiting for mount() to finish. When unmount() continues, its MNTK flags
(MNTK_UNMOUNT and MNTK_MWAIT) may have been clobbered.
Didn't fix old bugs:
- restoring mp->mnt_kern_flag is wrong for the same reasons in the error
case.
- the error case of unmount() seems to be broken too:
(a) MNTK_UNMOUNT gets clobbered, although another unmount() may have
set it. Perhaps it shouldn't be set until after the full lock is
aquired.
(b) MNTK_MWAIT isn't honoured.
Fixed a nearby style bug.
Disallow wait options that are not a combination of the standard POSIX
options WUNTRACED and WNOHANG, as is required by POSIX. BSD doesn't
have any extensions here, but the code was `#ifdef notyet' for some
reason.
Fixed nonblocking mode. It was per-device instead of per-file.
Don't depend on gcc's misfeature of rewriting char args in old-style
function definitions to match wrong prototypes. Break K&R1 support
to fix this quickly.
Obtained from: Whistle Communications tree
Add an option to the way UFS works dependent on the SUID bit of directories
This changes makes things a whole lot simpler on systems running as
fileservers for PCs and MACS. to enable the new code you must
1/ enable option SUIDDIR on the kernel.
2/ mount the filesystem with option suiddir.
hopefully this makes it difficult enough for people to
do this accidentally.
see the new chmod(2) man page for detailed info.
Ever since I first say the way the mount flags were used I've hated the
fact that modes, and events, internal and exported, and short-term
and long term flags are all thrown together. Finally it's annoyed me enough..
This patch to the entire FreeBSD tree adds a second mount flag word
to the mount struct. it is not exported to userspace. I have moved
some of the non exported flags over to this word. this means that we now
have 8 free bits in the mount flags. There are another two that might
well move over, but which I'm not sure about.
The only user visible change would have been in pstat -v, except
that davidg has disabled it anyhow.
I'd still like to move the state flags and the 'command' flags
apart from each other.. e.g. MNT_FORCE really doesn't have the
same semantics as MNT_RDONLY, but that's left for another day.