and therefore we need a way for ioctl handlers to run in that thread
in GEOM. Rather than invent a complicated registration system to
recognize which ioctl handler to use for a given ioctl, we still
schedule all ioctls down the tree as bio transactions but add a
special return code that means "call me directly" and have the
geom_dev layer do that.
Use this for all ioctls that make it as far as a diskdriver to
avoid any backwards compatibility problems.
Requested by: scottl
Sponsored by: DARPA & NAI Labs
NB: But it will enable it in all kernels not having options "NO_GEOM"
Put the GEOM related options into the intended order.
Add "options NO_GEOM" to all kernel configs apart from NOTES.
In some order of controlled fashion, the NO_GEOM options will be
removed, architecture by architecture in the coming days.
There are currently three known issues which may force people to
need the NO_GEOM option:
boot0cfg/fdisk:
Tries to update the MBR while it is being used to control
slices. GEOM does not allow this as a direct operation.
SCSI floppy drives:
Appearantly the scsi-da driver return "EBUSY" if no media
is inserted. This is wrong, it should return ENXIO.
PC98:
It is unclear if GEOM correctly recognizes all variants of
PC98 disklabels. (Help Wanted! I have neither docs nor HW)
These issues are all being worked.
Sponsored by: DARPA & NAI Labs.
that this will make people use this for their future copy&paste operations.
Rework the detection of raw-disk offsets in disklabels. This actually
unearthed a number of bugs in the (now) previous version.
Also accept labels which don't have a magic RAW_PART, provided they don't
confuse us too much.
Change the order of our sanity-checks on labels found on disks to be more
robust.
Check against MAXPARTITIONS in our sanity-check and reject disklabels
we cannot cope with.
Create new g_bsd_modify() function to implment disklabel modifying
ioctls.
Implement DIOCSDINFO and DIOCWDINFO with the provision that the latter
still not writes your change back to disk. I didn't have the nerves
for that yet.
In the start routine, use g_call_me() for complex ioctls to prevent
sleeping.
Sponsored by: DARPA & NAI Labs.
with support for trying, doing and forcing.
This will eventually replace g_slice_addslice() which gets changed from
grabbing topology to requing it in this commit as well.
Sponsored by: DARPA & NAI Labs.
work.
This prevents people from sleeping in the UP/DOWN I/O path by mistake
or design (doing so almost invariably result in deadlocks since it
stalls all I/O processing in the given direction.
Sponsored by: DARPA & NAI Labs.
a disklabel modification tries to change an open device, and no
counter-examples exists.
Be less facist about when we can do Setattr, the openmodes of devices
are so loosely managed that the "exclusive" count is almost useless.
Sponsored by: DARPA & NAI Labs.
Add a __unused.
Make the 2byte decoder functions return 16 bits for the benefits
of picky lints.
No need to grab giant around a tsleep() when we have a timeout.
Sponsored by: DARPA & NAI Labs.
to be performed in the event-thread.
To do this, we need to lock the eventlist with g_eventlock (nee g_doorlock),
since g_call_me() being called from the UP/DOWN paths will not be able to
aquire g_topology_lock.
This also means that for now these events are not referenced on any
particular consumer/provider/geom.
For UP/DOWN path use, this will not become a problem since the access()
function will make sure we drain any bio's before we dismantle.
Sponsored by: DARPA & NAI Labs.
and predictable way, and I apologize if I have gotten it wrong anywhere,
getting prior review on a patch like this is not feasible, considering
the number of people involved and hardware availability etc.)
If struct disklabel is the messenger: kill the messenger.
Inside struct disk we had a struct disklabel which disk drivers used to
communicate certain metrics to the disklayer above (GEOM or the disk
mini-layer). This commit changes this communication to use four
explicit fields instead.
Amongst the benefits is that the fields do not get overwritten by
wrong or bogus on-disk disklabels.
Once that is clear, <sys/disk.h> which is included in the drivers
no longer need to pull <sys/disklabel.h> and <sys/diskslice.h> in,
the few places that needs them, have gotten explicit #includes for
them.
The disklabel inside struct disk is now only for internal use in
the disk mini-layer, so instead of embedding it, we malloc it as
we need it.
This concludes (modulus any mistakes) the series of disklabel related
commits.
I belive it all amounts to a NOP for all the rest of you :-)
Sponsored by: DARPA & NAI Labs.
them visible from userland, if need be.
I wish that the C language contained this as part of struct definintions,
but failing that, I would settle for an agreed upon set of functions for
packing/unpacking integers in various sizes from byte-streams which may
have unfriendly alignment.
This really belongs in <sys/endian.h> I guess.
is currently conditional on both the GEOM and GEOM_GPT options to
avoid getting GPT by default and having the MBR and GPT classes
clash.
The correct behaviour of the MBR class would be to back-off (reject)
a MBR if it's a Protective MBR (a MBR with a single partition of type
0xEE that spans the whole disk (as far as the MBR is concerned).
The correct behaviour if the GPT class would be to back-off (reject)
a GPT if there's a MBR that's not a Protective MBR.
At this stage it's inconvenient to destroy a good MBR when working
with GPTs that it's more convenient to have the MBR class back-off
when it detects the GPT signature on disk and have the GPT class
ignore the MBR.
In sys/gpt.h UUIDs (GUIDs) for the following FreeBSD partitions
have been defined:
GPT_ENT_TYPE_FREEBSD
FreeBSD slice with disklabel. This is the equivalent of
the well-known FreeBSD MBR partition type.
GPT_ENT_TYPE_FREEBSD_{SWAP|UFS|UFS2|VINUM}
FreeBSD partitions in the context of disklabel. This is
speculating on the idea to use the GPT to hold partitions
instead if slices and removing the fixed (and low) limits
we have on the number of partitions.
This commit lacks a GPT image for the regression suite.
"The only hard problem in cryptography is key-management."
All sectors are encrypted with AES in CBC mode using a constant key,
currently compiled in and all zero.
To activate this module, write the magic header on the partition:
echo "<<FreeBSD-GEOM-AES>>" | dd conv=sync of=/dev/md98
The encrypted device will be one sector shorter and have ".aes"
appended to its name.
Sponsored by: DARPA & NAI Labs.
Printing daddr_t's using %d format was always an error, but gcc's
warning about it was ignored for supported 64-bit arches and not printed
for supported 32-bit arches. Hundreds if not thousands thousands of
previously "fixed" daddr_t printings are now broken on 32-bit machines
by casting daddr_t's to longs. daddr_t's should be printed using %jd
format, but this fix uses %lld since %j is not implemented in the
kernel yet.
Fixed some nearby format printf errors (style bugs).
the relevant classes.
Some methods may implement various "magic spaces", this is reserved
or magic areas on the disk, set a side for various and sundry purposes.
A good example is the BSD disklabel and boot code on i386 which occupies
a total of four magic spaces: boot1, the disklabel, the padding behind
the disklabel and boot2. The reason we don't simply tell people to
write the appropriate stuff on the underlying device is that (some of)
the magic spaces might be real-time modifiable. It is for instance
possible to change a disklabel while partitions are open, provided
the open partitions do not get trampled in the process.
Sponsored by: DARPA & NAI Labs.
Notice that if the device on which the dump is set is destroyed for
any reason, the dump setting is lost. This in particular will
happen in the case of spoilage. For instance if you set dump on
ad0s1b and open ad0 for writing, ad0s* will be spoilt and the dump
setting lost. See geom(4) for more about spoiling.
Sponsored by: DARPA & NAI Labs.
3.The only thing worse than generalizing from one example
is generalizing from no examples at all.
Remove the fwcylinders attribute before anybody gets the idea that we
alone have squared the circle.
Sponsored by: DARPA & NAI Labs.
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.
Tested on: i386, alpha, sparc64
Caveats:
The new savecore program is not complete in the sense that it emulates
enough of the old savecores features to do the job, but implements none
of the options yet.
I would appreciate if a userland hacker could help me out getting savecore
to do what we want it to do from a users point of view, compression,
email-notification, space reservation etc etc. (send me email if
you are interested).
Currently, savecore will scan all devices marked as "swap" or "dump" in
/etc/fstab _or_ any devices specified on the command-line.
All architectures but i386 lack an implementation of dumpsys(), but
looking at the i386 version it should be trivial for anybody familiar
with the platform(s) to provide this function.
Documentation is quite sparse at this time, more to come.
Details:
ATA and SCSI drivers should work as the dump formatting code has been
removed. The IDA, TWE and AAC have not yet been converted.
Dumpon now opens the device and uses ioctl(DIOCGKERNELDUMP) to set
the device as dumpdev. To implement the "off" argument, /dev/null
is used as the device.
Savecore will fail if handed any options since they are not (yet)
implemented. All devices marked "dump" or "swap" in /etc/fstab
will be scanned and dumps found will be saved to diskfiles
named from the MD5 hash of the header record. The header record
is dumped in readable format in the .info file. The kernel
is not saved. Only complete dumps will be saved.
All maintainer rights for this code are disclaimed: feel free to
improve and extend.
Sponsored by: DARPA, NAI Labs
I have not been able to find very much information about the PC98
extended partition layout so this is gleaned from the source in
our pc98 architecture. Corrections and patched very welcome.
Sponsored by: DARPA and NAI Labs.
The detection code in this method is written so that it should work on
all architectures which means that you can plug a Sun disk into a i386
now and access the partitions.
We still need an endian-agnostic ufs/ffs before this is really
interresting, but the main focus was to get sparc64 onto the GEOM
trail.
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.
test and play with this.
This is not yet production quality and should be run only on dedicated
test boxes.
For people who want to develop transformations for GEOM there exist a
set of shims to run geom in userland (ask phk@freebsd.org).
Reports of all kinds to: phk@freebsd.org
Please include in report:
dmesg
sysctl debug.geomdot
sysctl debug.geomconf
Known significant limitations:
no kernel dump facility.
ioctls severely restricted.
Sponsored by: DARPA, NAI Labs
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
cloning infrastructure standard in kern_conf. Modules are now
the same with or without devfs support.
If you need to detect if devfs is present, in modules or elsewhere,
check the integer variable "devfs_present".
This happily removes an ugly hack from kern/vfs_conf.c.
This forces a rename of the eventhandler and the standard clone
helper function.
Include <sys/eventhandler.h> in <sys/conf.h>: it's a helper #include
like <sys/queue.h>
Remove all #includes of opt_devfs.h they no longer matter.
after the acquisition of any advisory locks. This fix corrects a case
in which a process tries to open a file with a non-blocking exclusive
lock. Even if it fails to get the lock it would still truncate the
file even though its open failed. With this change, the truncation
is done only after the lock is successfully acquired.
Obtained from: BSD/OS
<sys/bio.h>.
<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.
Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.
Still a few bogus uses of struct buf to track down.
Repocopy by: peter
Exceptions:
Vinum untouched. This means that it cannot be compiled.
Greg Lehey is on the case.
CCD not converted yet, casts to struct buf (still safe)
atapi-cd casts to struct buf to examine B_PHYS
(Much of this done by script)
Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.
Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.
Add bio_queue field for struct bio aware disksort.
Address a lot of stylistic issues brought up by bde.
substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)
substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)
This patch is machine generated except for the ccd.c and buf.h parts.
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.
B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.
Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.
Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.
This change is a step in the direction towards a stackable BIO capability.
A lot of this patch were machine generated (Thanks to style(9) compliance!)
Vinum users: Greg has not had time to test this yet, be careful.
The same goes for CD drivers and tape drivers. In systems with mixed IDE
and SCSI, devices in the same priority class will be sorted in attach
order.
Also, the 'CCD' priority is now the 'ARRAY' priority, and a number of
drivers have been modified to use that priority.
This includes the necessary changes to all drivers, except the ATA drivers.
Soren will modify those separately.
This does not include and does not require any change in the devstat
version number, since no known userland applications use the priority
enumerations.
Reviewed by: msmith, sos, phk, jlemon, mjacob, bde
- Move intrhook stuff into kernel.h
- Remove all occurrences of #device <device.h>
- Add kernel.h were necessary (nowhere)
- delete device.h
This file contained the structures for cfdata (old style config) and is no
longer used. It was included by most drivers.
It confuses the remote debugger as the definition of 'struct device' in
device.h is found before the one in bus_private.h.
BUF_LOCKFREE a buffer prior to physically freeing it. While these
bugs did not cause a crash, they might in the future.
Added eof handling for unlabeled partitions.
Submitted by: Tor.Egge@fast.no
have been there in the first place. A GENERIC kernel shrinks almost 1k.
Add a slightly different safetybelt under nostop for tty drivers.
Add some missing FreeBSD tags
Enhance MIRROR code. Add a few more sanity checks and implement
a zone-based disk selector to make use of both disks when reading.
Also implement a read fail-over. If a read error occurs on one
disk, the I/O is retried on the other.
NOTE: CCD's mirroring support cannot deal with write errors properly
in regards to recovery, meaning that 'old' data under a write error may
be read non-deterministically if you reboot after a write error, and CCD
certainly cannot deal with a disk changeout. And it still can't. Use
vinum if you are really serious about mirroring. CCD basically just
implements a poor-man's mirror.
sum the total amount of I/O issued to determine when all the I/O
has completed. This fails when the EOF boundry occurs in the middle
of an I/O. Using cbp->cb_buf.b_bufsize works better.
a quick think and discussion among various people some form of some of
these changes will probably be recommitted.
The reversion requested was requested by dg while discussions proceed.
PHK has indicated that he can live with this, and it has been agreed
that some form of some of these changes may return shortly after further
discussion.
the highly non-recommended option ALLOW_BDEV_ACCESS is used.
(bdev access is evil because you don't get write errors reported.)
Kill si_bsize_best before it kills Matt :-)
Use the specfs routines rather having cloned copies in devfs.
Introduce BUF_STRATEGY(struct buf *, int flag) macro, and use it throughout.
please see comment in sys/conf.h about the flag argument.
Remove strategy argument from all the diskslice/label/bad144
implementations, it should be found from the dev_t.
Remove bogus and unused strategy1 routines.
Remove open/close arguments from dssize(). Pick them up from dev_t.
Remove unused and unfinished setgeom support from diskslice/label/bad144 code.
Reformat and initialize correctly all "struct cdevsw".
Initialize the d_maj and d_bmaj fields.
The d_reset field was not removed, although it is never used.
I used a program to do most of this, so all the files now use the
same consistent format. Please keep it that way.
Vinum and i4b not modified, patches emailed to respective authors.
udev_t in the kernel but still called dev_t in userland.
Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()
For now they're functions, they will become in-line functions
after one of the next two steps in this process.
Return major/minor/makedev to macro-hood for userland.
Register a name in cdevsw[] for the "filedescriptor" driver.
In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.
In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).
A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.
Without DEVT_FASCIST I belive this patch is a no-op.
Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.
Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).
Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline)
function.
Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
to the order of the cmaj/bmaj arguments!)
Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
(ditto!)
(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)
peripheral drivers can determine where in the devstat(9) list they are
inserted.
This requires recompilation of libdevstat, systat, vmstat, rpc.rstatd, and
any ports that depend on the devstat code, since the size of the devstat
structure has changed. The devstat version number has been incremented as
well to reflect the change.
This sorts devices in the devstat list in "more interesting" to "less
interesting" order. So, for instance, da devices are now more important
than floppy drives, and so will appear before floppy drives in the default
output from systat, iostat, vmstat, etc.
The order of devices is, for now, kept in a central table in devicestat.h.
If individual drivers were able to make a meaningful decision on what
priority they should be at attach time, we could consider splitting the
priority information out into the various drivers. For now, though, they
have no way of knowing that, so it's easier to put them in an easy to find
table.
Also, move the checkversion() call in vmstat(8) to a more logical place.
Thanks to Bruce and David O'Brien for suggestions, for reviewing this, and
for putting up with the long time it has taken me to commit it. Bruce did
object somewhat to the central priority table (he would rather the
priorities be distributed in each driver), so his objection is duly noted
here.
Reviewed by: bde, obrien
There is only cdevsw (which should be renamed in a later edit to deventry
or something). cdevsw contains the union of what were in both bdevsw an
cdevsw entries. The bdevsw[] table stiff exists and is a second pointer
to the cdevsw entry of the device. it's major is in d_bmaj rather than
d_maj. some cleanup still to happen (e.g. dsopen now gets two pointers
to the same cdevsw struct instead of one to a bdevsw and one to a cdevsw).
rawread()/rawwrite() went away as part of this though it's not strictly
the same patch, just that it involves all the same lines in the drivers.
cdroms no longer have write() entries (they did have rawwrite (?)).
tapes no longer have support for bdev operations.
Reviewed by: Eivind Eklund and Mike Smith
Changes suggested by eivind.
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.
The prototype FreeBSD/alpha machdep will follow in a couple of days
time.
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
Saves about 280 butes of source per driver, 56 bytes in object size
and another 56 bytes moves from data to bss.
No functional change intended nor expected.
GENERIC should be about one k smaller now :-)
use sd87a or sd237e even if they start at the beginning of the slice.
You can also use sd85c if you prefer, although you need to change the
type field in the disklabel to "4.2BSD".
mailing list.
When initiating a write, ccdbuffer() returns two "struct ccdbuf *"s
linked together by the cb_mirror field. "cb_pflags &
CCDPF_MIRROR_DONE" is set to 0 on both of them.
When a component returns to ccdiodone(), it checks if "cb_pflags &
CCDPF_MIRROR_DONE" is set or not. If not, it sets the partner's
flag and returns. If it is, it means its partner has already
returned, so it will go to the regular cleanup (which is in the
fallthrough code).
There should be no performance or functionality changes unless the
higher-level scsi driver does something with the resid value. The change
is purely aesthetical and prepares us for the parity implementation.
caused by a different reason):
. #ifndef __FreeBSD__ around check for negative size, FreeBSD size_t is
unsigned
. Disable mirror/parity if interleave size is 0 (i.e., serial concatenation).
(1) The reads are always done from the first n/2 disks.
(2) Each write is done twice, to the "data" disk (in the first half) and
the "mirror" disk (in the second half).
ccdbuffer() now takes an extra argument (struct ccdbuf **) and stores
the pointer to ccdbuf in there. In case of a mirrored write, it
allocates and stores two pointers. The "residual" is also doubled
for mirrored writes so that ccdiodone() can correctly tell when all
the writes are done.