via %s
Most of the cases looked harmless, but this is done for the sake of
correctness. In one case it even allowed to drop an intermediate buffer.
Found by: clang
MFC after: 2 week
deadlock fixed in r207671.
- Wait for worker process to exit at class unload. The worker process
was not guaranteed to exit before the linker unloaded the module.
- Use 0 as the worker process exit status instead of ENXIO and style
the NOTREACHED comment.
Reviewed by: lulf
X-MFC after: r207671
proceed while g_unload_class() blocks the event thread. Fix this by not
running g_unload_class() as a GEOM event and dropping the topology lock
when withering needs to proceed.
PR: kern/139847
Silence on: freebsd-geom
do not constitute user-visible or active partitions and as such should
not prevent undoing pending operations.
While here, initialize the last usable sector for the placeholder geom
based on the null scheme, created to allow undoing the destruction of
a scheme. This gives consistent output with "gpart show".
Based on a patch from: "Andrey V. Elsukov" <bu7cher@yandex.ru>
Previsouly this condition was reported with EIO by bio_offset > mediasize
check.
Perhaps that check should be extended to bio_offset+bio_length > mediasize.
MFC after: 1 week
in a device independent manner. Also include an example anticipatory
scheduler, gsched_rr, which gives very nice performance improvements
in presence of competing random access patterns.
This is joint work with Fabio Checconi, developed last year
and presented at BSDCan 2009. You can find details in the
README file or at
http://info.iet.unipi.it/~luigi/geom_sched/
In other words, deny multiple read-only mounts of the same device.
Shared read-only mounts should theoretically be possible, but,
unfortunately, can not be implemented correctly using current
buffer cache code/interface and results in an eventual system crash.
Also, using nullfs seems to be a more efficient way to achieve the same
goal.
This gets us back to where we were before GEOM and where other BSDs are.
Submitted by: pjd (idea for checking for shared mounting)
Discussed with: phk, pjd
Silence from: fs@, geom@
MFC after: 2 weeks
In r205860 I missed the fact that there is code that strongly assumes
that devvp bo_bsize is equal to underlying provider's sectorsize.
In those places it is hard to obtain the sectorsize in an alternative
way if devvp bo_bsize is set to something else.
So, I am reverting bo_bsize assigment in g_vfs_open.
Instead, in getblk I use DEV_BSIZE block size for b_offset calculation
if vp is a disk vp as reported by vn_isdisk. This should coinside with
vp being a devvp.
Reported by: Mykola Dzham <i@levsha.me>
Tested by: Mykola Dzham <i@levsha.me>
Pointyhat to: avg
MFC after: 2 weeks
X-ToDo: convert bread(devvp) in all fs to use bo_bsize-d blocks
Because of how breadn -> bufstrategy -> g_vfs_strategy are currently
implemented, bread on devvp always expects DEV_BSIZE block size.
Thus, devvp bo_bsize must always be DEV_BSIZE irrespective of media
properties or filesystem implementation details.
Reviewed by: mckusick
MFC after: 2 weeks
to support various storage boxes which really aren't active-active.
We only write the label on the *first* provider. For all other providers
we just "add" the disk. This also allows for an "add" verb.
A usage implication is that you should specificy the currently active
storage path as the first provider.
Note that this does not add RDAC-like functionality, but better allows for
autovolumefailover configurations (additional checkins elsewhere will support
this).
Sponsored by: Panasas
MFC after: 1 month
provider names.
- Characters in range 0x01-0x1f except '\t', '\n', and '\r' are replaced
with '?'. Those characters are disallowed in XML.
- '&', '<', '>', '\'', '"' and characters in range 0x7f-0xff are
replaced with XML numeric character reference.
If the kern.geom.confxml sysctl provides invalid XML, libgeom
geom_xml2tree() fails and utilities using it do not work. Unsafe
characters are common in msdosfs and cd9660 labels.
PR: kern/104389
Submitted by: Doug Steinwand (original version)
Reviewed by: pjd
Discussed on: freebsd-geom
MFC after: 3 weeks
HAST allows to transparently store data on two physically separated machines
connected over the TCP/IP network. HAST works in Primary-Secondary
(Master-Backup, Master-Slave) configuration, which means that only one of the
cluster nodes can be active at any given time. Only Primary node is able to
handle I/O requests to HAST-managed devices. Currently HAST is limited to two
cluster nodes in total.
HAST operates on block level - it provides disk-like devices in /dev/hast/
directory for use by file systems and/or applications. Working on block level
makes it transparent for file systems and applications. There in no difference
between using HAST-provided device and raw disk, partition, etc. All of them
are just regular GEOM providers in FreeBSD.
For more information please consult hastd(8), hastctl(8) and hast.conf(5)
manual pages, as well as http://wiki.FreeBSD.org/HAST.
Sponsored by: FreeBSD Foundation
Sponsored by: OMCnet Internet Service GmbH
Sponsored by: TransIP BV
Note that due to e.g. write throttling ('wdrain'), it can stall all the disk
I/O instead of just the device it's configured for. Using it for removable
media is therefore not a good idea.
Reviewed by: pjd (earlier version)
zero stripeoffset in such case (as if device has no stripes), report offset
from the beginning of the media (as if device has single infinite stripe).
This gives partitioning tools information, required to guess better
partition alignment, in case if hardware doesn't report it's stripe size.
For example, it should give disklabel info about odd offset made by fdisk.
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.
PR: 137213
Submitted by: Eygene Ryabinkin (initial version)
MFC after: 1 month
- For SSDs use TRIM feature of DATA SET MANAGEMENT command, as defined by
ACS-2 specification working draft.
- For CompactFlash use CFA ERASE command, same as ad(4) does.
With this patch, `newfs -E /dev/ada1` was able to restore write speed of
my heavily weared OCZ Vertex SSD (firmware 1.4) up to the initial level
for the most part of it's capacity. Previous 1.3 firmware, even reportiong
TRIM capabilty bit set, was not working, reporting ABORT error for every
DSM command.
I have no idea whether it is normal, but for some reason it takes 200ms
to handle any TRIM command on this drive, that was making delete extremely
slow. But TRIM command is able to accept long list of LBAs and the length of
that list seems doesn't affect it's execution time. Implemented request
clusting algorithm allowed me to rise delete rate up to reasonable numbers,
when many parallel DELETE requests running.
- Instead of measuring last request execution time for each drive and
choosing one with smallest time, use averaged number of requests, running
on each drive. This information is more accurate and timely. It allows to
distribute load between drives in more even and predictable way.
- For each drive track offset of the last submitted request. If new request
offset matches previous one or close for some drive, prefer that drive.
It allows to significantly speedup simultaneous sequential reads.
PR: kern/113885
Reviewed by: sobomax
for specific "kinds" of disk labels - for example, GPT UUIDs. Reason
for this is that sometimes, other GEOM classes attach to these device
nodes instead of the proper ones - e.g. they attach to /dev/gptid/XXX
instead of /dev/ada0p2, which is annoying.
Reviewed by: pjd (earlier version)
MFC after: 1 month
This fixes a null pointer dereference with "gpart create -s GPT" after
the previous commit.
Reported by: Yuri Pankov
Pointyhat to: me
MFC after: 1 week
It is valid for an on-disk GPT header to report a header size which is
greater than 92 bytes. Previously, we would read in the sector and copy
only the 92 bytes that we know how to deal with before calculating the
checksum for comparison. This meant that when we did the checksum, we
overshot the buffer and took in random memory, so the checksum would fail.
We now determine the size of the header and allocate enough space to
preserve the entire on-disk contents. This allows us to be correctly
calculate the checksum and be able to modify and write the header back
to the disk, while preserving data that we might not understand.
Reported by: Kris Weston
Approved by: marcel@
MFC after: 2 weeks
depend on on-disk metadata. This was we won't attach to providers that are used
by other classes. For example we don't want to configure partitions on da0 if
it is part of gmirror, what we really want is partitions on mirror/foo.
During regular work it works like this: if provider is open for writing a class
receives the spoiled event from GEOM and detaches, once provider is closed the
taste event is send again and class can rediscover its metadata if it is still
there. This doesn't work that way when new class arrives, because GEOM gives
all existing providers for it to taste, also those open for writing. Classes
have to decided on their own if they want to deal with such providers (eg.
geom_dev) or not (classes modified by this commit).
Reported by: des, Oliver Lehmann <lehmann@ans-netz.de>
Tested by: des, Oliver Lehmann <lehmann@ans-netz.de>
Discussed with: phk, marcel
Reviewed by: marcel
MFC after: 3 days
code that merely emits an error and waits for a key press before
rebooting. The error being that extended partitions are not
bootable. The origin is presumed to be Windows 2000; Windows XP
does not do this...
For now, ignore the first 96 bytes when checking that the EBR is
(for the most part) all zeroes.
Tested by: Mario Lobo <mlobo@digiart.art.br>
MFC after: 1 week
It will be checked any way later by g_io_check() in g_io_schedule_down().
It is only needed here to not trigger panic from additional check, when
INVARIANTS enabled. So cover it with #ifdef INVARIANTS. It saves two
64bit divisions per request.
Remove msleep() timeout from g_io_schedule_up/down(). It works fine
without it, saving few percents of CPU on high request rates without
need to rearm callout twice per request.
by CHS addressing. Don't define these fields as 0xff, but rather define
them correctly. This prevents boot problems on PCs where GPT is being
used.
PR: 115406
Submitted by: Kent Hauser <kent@khauser.net>
Approved by: re (kib)
If the access counts were not increased and decreased in equal numbers by
gvinum consumers, the read access count would be inconsistent with the write
access count. Instead, modify the read access count with the write access
count directly to prevent any inconsistencies.
Approved by: re (kib)
is invalid because the ioctl happens without prior open. The ioctl
got introduced to provide backward compatibility for extended
partitions, but it ended up not being used because it didn't work
as expected. Since there are no consumers of the ioctl and the
implementation is broken, the best fix is to remove the code
entirely.
Spotted by: phk
Approved by: re (kensmith)
with I/O requests in flight on kernels compiled with "options INVARIANTS".
Also, make it obvious it's not right to call g_valid_obj() (and macros
using it, e.g. G_VALID_CONSUMER()) without topology lock held.
Approved by: re (kib)
Reported by: pho
struct bio to store classification information, and a hook
for classifier functions that can be called by g_io_request().
This code is from Fabio Checconi as part of his GSOC work.
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex. Jails may
have their own host information, or they may inherit it from the
parent/system. The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL. The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.
The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.
Approved by: bz (mentor)
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.
In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.
While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.
VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.
This is necessary for two reasons:
1) In order to avoid collisions with the use of a BIOs flags set by a consumer
or a provider
2) Because GV_BIO_DONE was used to mark a BIO as done, not enough flags was
available, so the consumer flags of a BIO had to be misused in order to
support enough flags. The new queue makes it possible to recycle the
GV_BIO_DONE flag into GV_BIO_GROW.
As a consequence, gvinum will now work with any other GEOM class under it or
on top of it.
- Use bio_pflags for storing internal flags on downgoing BIOs, as the requests
appear to come from a consumer of a gvinum volume. Use bio_cflags only for
cloned BIOs.
- Move gv_post_bio to be used internally for maintenance requests.
- Remove some cases where flags where set without need.
PR: kern/133604
as they do not really relate and to prepare for an additional queue to be
covered by the BIO queue mutex.
- Implement wrappers for fetching the next element from the event queue as well
as for putting a new element into the BIO queue.
as the renaming only changes internal gvinum names and will not alter the geom
topology.
- The topology lock was not held when calling g_wither_geom after renaming.
naming of the partitions (GEOM_PART_EBR_COMPAT). When
compatibility is enabled, changes to the partitioning are
disallowed.
Remove the device name aliasing added previously to provide
backward compatibility, but which in practice doesn't give
us anything.
Enable compatibility on amd64 and i386.
o PC98 uses 32-bit block numbers. Limit the scheme to 2^32-1
blocks when the media is larger. The 32-bit block numbers
are implicit (16-bit cylinder * 8-bit head * 8-bit sector).
o EBR uses 32-bit block numbers. Limit the scheme to 2^32-1
blocks when the media is larger.
o Calculate the number of entries based on the rounded media
size, rather than the raw media size.
The work have been under testing and fixing since then, and it is mature enough
to be put into HEAD for further testing.
A lot have changed in this time, and here are the most important:
- Gvinum now uses one single workerthread instead of one thread for each
volume and each plex. The reason for this is that the previous scheme was
very complex, and was the cause of many of the bugs discovered in gvinum.
Instead, gvinum now uses one worker thread with an event queue, quite
similar to what used in gmirror.
- The rebuild/grow/initialize/parity check routines no longer runs in
separate threads, but are run as regular I/O requests with special flags.
This made it easier to support mounted growing and parity rebuild.
- Support for growing striped and raid5-plexes, meaning that one can extend the
volumes for these plex types in addition to the concat type. Also works while
the volume is mounted.
- Implementation of many of the missing commands from the old vinum:
attach/detach, start (was partially implemented), stop (was partially
implemented), concat, mirror, stripe, raid5 (shortcuts for creating volumes
with one plex of these organizations).
- The parity check and rebuild no longer goes between userland/kernel, meaning
that the gvinum command will not stay and wait forever for the rebuild to
finish. You can instead watch the status with the list command.
- Many problems with gvinum have been reported since 5.x, and some has been hard
to fix due to the complicated architecture. Hopefully, it should be more
stable and better handle edge cases that previously made gvinum crash.
- Failed drives no longer disappears entirely, but now leave behind a dummy
drive that makes sure the original state is not forgotten in case the system
is rebooted between drive failures/swaps.
- Update manpage to reflect new commands and extend it with some examples.
Sponsored by: Google Summer of Code 2007
Mentored by: le
Tested by: Rick C. Petty <rick-freebsd2008 -at- kiwi-computer.com>
o Don't create an APM scheme underneath another scheme when
the probe doesn't allow it.
o APM uses 32-bit block numbers. Limit the scheme to 2^32-1
blocks when the media is larger.
"raw" names. While there, change the formatting of extended MSDOS partitions
so that the dot (".") is not used to separate two numbers (which kind of
looks like the whole is a decimal number). Use "+" instead, which also
hints that the second part of the name is the offset from the start of
the partition in the first part of the name. Also change the offset from
decimal to hexadecimal notation, simply for aesthetic reasons and future
compatibility.
GEOM_PART is the default in 8-CURRENT but not yet in 7-STABLE so this
changeset can be MFC-ed without causing major problems from the second
part.
Reviewed by: marcel
Approved by: gnn (mentor)
MFC after: 2 weeks
to resurrect (maybe honor foot shooting bit in kern.geom_debugflags)
o fix match macro so we now recognize we want to merge FIS dir with RedBoot
config parameters even if we don't actually do it
properly. Otherwise the minimum of 1 is used and you can
only insert a single partition/slice and only at sector
0 (index 1).
o When adding a partition/slice, recalculate the index after
the start and size of the partition/slice are adjusted to
make them a multiple of the track size. Since the precheck
method sets the index based on the start of the partition
as provided by the user, we know that we're off by at most
1 and adjusting the index is safe.
1. Extend geom_dev by having it create the symlink (i.e. call
make_dev_alias) based on the DIOCGPROVIDERALIAS ioctl.
In this way the functionaility is generic and thus usable
by any geom/provider.
2. Have g_part handle said ioctl through the devalias method,
so that it's under control of the scheme itself. By design
the alias will not be created for newly added partitions.
method allows schemes to reject the ctl request, pre-check
the parameters and/or modify/set parameters. There are 2
use cases that triggered the addition:
1. When implementing a R/O scheme, deletes will still
happen to the in-memory representation. The scheme is
not involved in that operation. The pre-check method
can be used to fail the delete up-front. Without this
the write to disk will typically fail, but at that
time the delete already happened.
2. The EBR scheme uses a linked list to record slices.
There's no index. The EBR scheme defines the index
as a function of the start LBA of the partition. The
add verb picks an index for the range and then invokes
the add method of the scheme to fill in the blanks. It
is too late for the add method to change the index.
The pre-check is used to set the index up-front. This
also (silently) overrides/nullifies any (pointless)
user-specified index value.
found inside extended partitions and used to create logical partitions.
At this time write/modify support is not (yet) present.
The EBR and MBR schemes both check the parent scheme. The MBR will
back-off when nested under another MBR, whereas the EBR only nests
under a MBR.
underlying partitioning scheme.
o Put the start and end of the partition in the XML configuration.
The start and end are the LBAs of the first and last sector
(resp.) of the partition. They are currently identical to the
offset and size attributes, which describe the partition as an
offset and size in bytes, but may not in the future. The start
and end will be used for the logical partition boundaries and
may include metadata. The offset and size will always represent
the useful storage space within the partition. Typically these
two notions are the same, but for logical partitions in an
extended partition, the EBR is more naturally treated as being
part of the partition.
Now that make_dev() doesn't require unit numbers to be unique, there is
no need to use an unrhdr here to generate the numbers. Remove the entire
init-routine, because it is optional.
allowing the full 16-bit width of the corresponding fields in the
VTOC8 label to be used. The removed limits basically only held
true for providers labeled using the synthetic geometry provided
by cam_calc_geometry(9) but neither SCSI disks labeled with Solaris
nor sufficiently large ATA disks.
- Given that providers (originally) labeled with Solaris typically
use the native geometry as reported by the target while FreeBSD
typically uses a synthetic one put the message complaining about
mismatching geometries between what the label indicates and what
GEOM thinks the provider has, which we generally can't help,
under bootverbose in order to not unnecessarily scare users.
- For informational purposes add the non-matching values to the
message complaining about them, similar to what r186501 did for
g_part_bsd_read() except also indicating the origin of the
values.
- Make it clear that the messages emitted by this code refer to
the VTOC8 support rather than to another existing scheme or to
VTOC32.
o Don't check the dummy fields.
o The entry is unused if either dp_mid is 0 or dp_sid is 0.
o The start or end cylinder cannot be 0.
o The start CHS cannot be equal to the end CHS.
Submitted by: nyan
plex. If the plex is a raid5 plex, and is being written to, parity data might
have to be read from the underlying disks, requiring them to be opened for
reading as well as writing.
MFC after: 1 week
the device, which means refcount on periph drivers never drops,
which means cam_sim_free() never returns, which results in umass
sleeping there ad infinitum.
Submitted by: pjd
Reviewed by: scottl, pjd
Approved by: rwatson (mentor)
Sponsored by: FreeBSD Foundation
an unclean shutdown would make it impossible to mount rootfs at boot.
PR: kern/128529
Reviewed by: pjd
Approved by: rwatson (mentor)
Sponsored by: FreeBSD Foundation
practice, but it is a good programming practice and allows the kernel to not
depend on userland correctness.
- While there, make sizeof usage match the rest of the code.
Found with: Coverity Prevent(tm)
CID: 660, 662
practice, but it is a good programming practice nontheless and it allows the
kernel to not depend on userland correctness.
Found with: Coverity Prevent(tm)
CID: 655-659, 664-667
still valid. We were checking the state of the header and
not the table.
PR: 119868
Based on a patch from: Jaakko Heinonen <jh@saunalahti.fi>
MFC after: 1 week
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.
This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.
Discussed with: kib
Tested by: pho
a little refinement, but is good enough to commit as is.
# Should look to see if I should move swab(3) into the kernel or just
# provide the unoptimized routine here.
Reviewed by: marcel@
are possibly still being created. The d_secperunit field
contains the number of sectors of the disk and not of the
slice/partition to which the disklabel applies.
Rather than reject the disklabel, we now silently adjust
the field. Existing code, like bslabel(8), does not seem
to check the label that extensively and seems to adjust
fields as a side-effect as well.
In other words, it's not that important apparently, so
gpart should not be too strict about it.
Reported by: nyan@
Reported by: Andriy Gapon <avg@icyb.net.ua>
Export the active and bootable flags as attributes in
the configuration XML and allow them to be manipulated
with the set/unset commands.
Since libdisk treats the flags as part of the partition
type, preserve behavior by keeping them included in the
configuration text.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()
and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()
Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.
As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP
Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
the gvinum header in fields of fixed size and in a big endian byte order
rather than the size and byte order of the actual platform.
Note that the change is backwards compatible with the old gvinum configuration
format, but will save the configuration in the new format when the 'saveconfig'
command is executed.
Submitted by: Rick C. Petty <rick-freebsd -at- kiwi-computer.com>
if the probe succeeds. This guarantees that the BSD scheme
wins over the MBR scheme when MBR gets to probe first. Build-
or link-time conditions can cause schemes to end up in the
linker set in a different order. Normally BSD is before MBR
in the linker set and as such get to probe first. But typically
when the kernel gets rebuild or relinked, this can change.
that a nested partition (typically the BSD disklabel)
is not done tasting while the root file system is being
mounted. While this is rare, it's still possible.
When I changed kern_conf.c three months ago I made device unit numbers
equal to (unneeded) device minor numbers. We used to require
bitshifting, because there were eight bits in the middle that were
reserved for a device major number. Not very long after I turned
dev2unit(), minor(), unit2minor() and minor2unit() into macro's.
The unit2minor() and minor2unit() macro's were no-ops.
We'd better not remove these four macro's from the kernel, because there
is a lot of (external) code that may still depend on them. For now it's
harmless to remove all invocations of unit2minor() and minor2unit().
Reviewed by: kib
- Add a routine for looking up a device and checking if it is a valid geom
provider given a partial or full path to its device node.
Reviewed by: phk
Approved by: pjd (mentor)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
to global hostname and domainname variables. Where necessary, copy
to or from a stack-local buffer before performing copyin() or
copyout(). A few uses, such as in cd9660 and daemon_saver, remain
under-synchronized and will require further updates.
Correct a bug in which a failed copyin() of domainname would leave
domainname potentially corrupted.
MFC after: 3 weeks
Even though we got rid of device major numbers some time ago, device
drivers still need to provide unique device minor numbers to make_dev().
These numbers are only used inside the kernel. They are not related to
device major and minor numbers which are visible in devfs. These are
actually based on the inode number of the device.
It would eventually be nice to remove minor numbers entirely, but we
don't want to be too agressive here.
Because the 8-15 bits of the device number field (si_drv0) are still
reserved for the major number, there is no 1:1 mapping of the device
minor and unit numbers. Because this is now unused, remove the
restrictions on these numbers.
The MAXMAJOR definition was actually used for two purposes. It was used
to convert both the userspace and kernelspace device numbers to their
major/minor pair, which is why it is now named UMINORMASK.
minor2unit() and unit2minor() have now become useless. Both minor() and
dev2unit() now serve the same purpose. We should eventually remove some
of them, at least turning them into macro's. If devfs would become
completely minor number unaware, we could consider using si_drv0 directly,
just like si_drv1 and si_drv2.
Approved by: philip (mentor)
the method for the (indent == NULL) case (i.e. the kern.geom.conftxt
sysctl). The purpose is to extend the conftxt output with scheme-
specific fields which can be used by libdisk. In particular, have
the schemes dump the xs and xt fields, which contain the backward
compatible values for class type and partition type. This allows
libdisk to work with the legacy slicers as well as with gpart and
helps/promotes migration.
SI_SUB_DRIVERS) to avoid loading schemes before all the GEOM
classes have been loaded and initialized. Otherwise we may
end up using mutexes that haven't been initialized (due to
g_retaste() posting an event).
allows the class to create a different GEOM for the same provider
as well as avoid that we end up with multiple GEOMs of the same
class with the same name.
For example, when a disk contains a PC98 partition table but
only MBR is supported, then the partition table can be treated
as a MBR. If support for PC98 is later loaded as a module, the
MBR scheme is pre-empted for the PC98 scheme as expected.
to declaring a proper module. The module event handler is part of the
gpart core and will add the scheme to an internal list on module load
and will remove the scheme from the internal list on module unload.
This makes it possible to dynamically load and unload partitioning
schemes.
to it for tasting. This is useful when the class, through means outside
the scope of GEOM, can claim providers previously unclaimed.
The g_retaste() function posts an event which is handled by the
g_retaste_event().
Event suggested by: phk
not have VTOC information about the partitions, it will be created.
This is because the VTOC information is used for the partition type
and FreeBSD's sunlabel(8) does not create nor use VTOC information.
For this purpose, new tags have been added to support FreeBSD's
partition types.
partition table is empty, check to see if we have something that
looks sufficiently like a BPB. On non-i386 machines, the boot
sector typically doesn't contain boot code; the end of the boot
sector is all zeroes. This is also where the partition table is
for MBRs.
We only check the sector size and cluster size, as that seems to
be the most reliable across implementations, BPB versions and
platforms.
only because there's a partition table where the boot sector has
boot code. Boot sectors without boot code look like a MBR for all
practical purposes. This change adds a check for the partition table
and fails the probe when it's obvously invalid. The assumption being
that the sector contains a boot sector and not a MBR.
More checks are needed to distinguish a boot secto without boot code
from a (empty) MBR.
The logical disks will appear as /dev/lvm/<vol group>-<logical vol>, for
instance /dev/lvm/vg0-home. G_LINUX_LVM currently supports linear stripes with
segments on multiple physical disks. The metadata is read only, logical
volumes can not be allocated or resized.
Reviewed by: Ivan Voras
Previously known as geom_lvm(4), rename requested by des, phk.
The logical disks will appear as /dev/lvm/<vol group>-<logical vol>, for
instance /dev/lvm/vg0-home. GLVM currently supports linear stripes with
segments on multiple physical disks. The metadata is read only, logical
volumes can not be allocated or resized.
Reviewed by: Ivan Voras
o BSD disklabels have relative offsets. Even for the BSD in MBR slice
setup, except when the mbroffset ioctl is supported. Since we don't
support that ioctl, bsdlabel(8) expects relative offsets. So, when
reading an existing disklabel, correct for disklabels that mistakenly
have the mbroffset offsets.
o Don't take the geometry seriously, because it's untrustworthy. We do
expect the numbers to be within range. This means that the secperunit
field will not be computed from secpercyl and ncyls, but simply is
the mediasize in sectors.
o Don't enforce partitions to be aligned to track boundaries. The
default label, constructed by bsdlabel(8), puts partition a at offset
BBSIZE bytes, which commonly means sector 16.
or any other bio chopping geom a reasonable size of work.
Check for delivered signals between chunks, because the request size
and service time is unbounded.
XXX: This only works currently with GEOM_GPT which only exists in 6.x.
XXX: I didn't add 'mbroffset' support for a GPT partition holding a BSD
label as I'm not sure if they use relative or absolute offsets.
MFC after: 3 days
o Disklabels can have between 8 and 20 partitions (inclusive).
o No device special file is created for the raw partition.
o Switch ia64 to use this backend.
o No support for boot code yet.
on i386 and amd64 machines. The overall process is that /boot/pmbr lives
in the PMBR (similar to /boot/mbr for MBR disks) and is responsible for
locating and loading /boot/gptboot. /boot/gptboot is similar to /boot/boot
except that it groks GPT rather than MBR + bsdlabel. Unlike /boot/boot,
/boot/gptboot lives in its own dedicated GPT partition with a new
"FreeBSD boot" type. This partition does not have a fixed size in that
/boot/pmbr will load the entire partition into the lower 640k. However,
it is limited in that it can only be 545k. That's still a lot better than
the current 7.5k limit for boot2 on MBR. gptboot mostly acts just like
boot2 in that it reads /boot.config and loads up /boot/loader. Some more
details:
- Include uuid_equal() and uuid_is_nil() in libstand.
- Add a new 'boot' command to gpt(8) which makes a GPT disk bootable using
/boot/pmbr and /boot/gptboot. Note that the disk must have some free
space for the boot partition.
- This required exposing the backend of the 'add' function as a
gpt_add_part() function to the rest of gpt(8). 'boot' uses this to
create a boot partition if needed.
- Don't cripple cgbase() in the UFS boot code for /boot/gptboot so that
it can handle a filesystem > 1.5 TB.
- /boot/gptboot has a simple loader (gptldr) that doesn't do any I/O
unlike boot1 since /boot/pmbr loads all of gptboot up front. The
C portion of gptboot (gptboot.c) has been repocopied from boot2.c.
The primary changes are to parse the GPT to find a root filesystem
and to use 64-bit disk addresses. Currently gptboot assumes that the
first UFS partition on the disk is the / filesystem, but this algorithm
will likely be improved in the future.
- Teach the biosdisk driver in /boot/loader to understand GPT tables.
GPT partitions are identified as 'disk0pX:' (e.g. disk0p2:) which is
similar to the /dev names the kernel uses (e.g. /dev/ad0p2).
- Add a new "freebsd-boot" alias to g_part() for the new boot UUID.
MFC after: 1 month
Discussed with: marcel (some things might still change, but am committing
what I have so far)
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.
I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.
Without this change the following situation was possible:
1. Provider is orphaned from within class' access() method on last write
close - orphan provider event is send.
2. GEOM detects last write close on a provider and sends new provider event.
3. g_orphan_register() is called, and calls all orphan methods of attached
consumers.
4. New provider event is executed on orphaned provider, all classes can
taste already orphaned provider, and some may attach consumers to it.
Those consumers will never go away, because the g_orphan_register()
was already called.
We end up with a zombie provider.
With this change, at step 3, we will cancel new provider event.
How to repeat this problem:
# mdconfig -a -t malloc -s 10m
# geli init -i 0 md0
# geli attach md0
# newfs -L test /dev/md0.eli
# mount /dev/ufs/test /mnt/tmp
# geli detach -l md0.eli
# umount /mnt/tmp
# glabel status
Name Status Components
ufs/test N/A N/A
Reviewed by: phk
Approved by: re (kensmith)
providers with limited physical storage and add physical storage as
needed.
Submitted by: Ivan Voras
Sponsored by: Google Summer of Code 2006
Approved by: re (kensmith)
don't have it. Some partitioning schemes, as well as file systems,
operate on the geometry and without it such schemes (e.g. MBR)
and file systems (e.g. FAT) can't be created. This is useful for
memory disks.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
sysctl_handle_int is not sizeof the int type you want to export.
The type must always be an int or an unsigned int.
Remove the instances where a sizeof(variable) is passed to stop
people accidently cut and pasting these examples.
In a few places this was sysctl_handle_int was being used on 64 bit
types, which would truncate the value to be exported. In these
cases use sysctl_handle_quad to export them and change the format
to Q so that sysctl(1) can still print them.
exists and contains the 'C' flag.
o The partition label can be the empty string. It's how labels are
cleared.
o When an action fails, lower permissions when they were raised
in order to allow the action. A failed action will not result
in any uncommitted changes.
o Allow the flags paremeter to be present but empty. It's the
equivalent of not being present.
119373: o Remove the query verb, along with the request and response
parameters.
o Add the version and output parameters.
119390: [APM,GPT] Properly clear deleted entries.
119394: o Make the alias the standard and use the '!' to prefix
literal partition types.
o Treat schemes and partition types as case insensitive.
119462: [GPT] Fix a page fault caused when modifying a partition entry
without a new partition type.
DIOCGFLUSH - Flush write cache (sends BIO_FLUSH).
DIOCGDELETE - Delete data (mark as unused) (sends BIO_DELETE).
DIOCGIDENT - Get provider's uniqe and fixed identifier (asks for
GEOM::ident attribute).
First two are self-explanatory, but the last one might not be. Here are
properties of provider's ident:
- ident value is preserved between reboots,
- provider can be detached/attached and ident is preserved,
- provider's name can change - ident can't,
- ident value should not be based on on-disk metadata; in other words
copying whole data from one disk to another should not yield the same
ident for the other disk,
- there could be more than one provider with the same ident, but only if
they point at exactly the same physical storage, this is the case for
multipathing for example,
- GEOM classes that consumes single providers and provide single providers,
like geli, gbde, should just attach class name to the ident of the
underlying provider,
- ident is an ASCII string (is printable),
- ident is optional and applications can't relay on its presence.
The main purpose for this is that application and remember provider's ident
and once it tries to open provider by its name again, it may compare idents
to be sure this is the right provider. If it is not (idents don't match),
then it can open provider by its ident.
OK'ed by: phk
o make all crypto drivers have a device_t; pseudo drivers like the s/w
crypto driver synthesize one
o change the api between the crypto subsystem and drivers to use kobj;
cryptodev_if.m defines this api
o use the fact that all crypto drivers now have a device_t to add support
for specifying which of several potential devices to use when doing
crypto operations
o add new ioctls that allow user apps to select a specific crypto device
to use (previous ioctls maintained for compatibility)
o overhaul crypto subsystem code to eliminate lots of cruft and hide
implementation details from drivers
o bring in numerous fixes from Michale Richardson/hifn; mostly for
795x parts
o add an optional mechanism for mmap'ing the hifn 795x public key h/w
to user space for use by openssl (not enabled by default)
o update crypto test tools to use new ioctl's and add cmd line options
to specify a device to use for tests
These changes will also enable much future work on improving the core
crypto subsystem; including proper load balancing and interposing code
between the core and drivers to dispatch small operations to the s/w
driver as appropriate.
These changes were instigated by the work of Michael Richardson.
Reviewed by: pjd
Approved by: re
to problems when the geli device is used with file system or as a swap.
Hopefully will prevent problems like kern/98742 in the future.
MFC after: 1 week
arrangement that has no intrinsic internal knowledge of whether devices
it is given are truly multipath devices. As such, this is a simplistic
approach, but still a useful one.
The basic approach is to (at present- this will change soon) use camcontrol
to find likely identical devices and and label the trailing sector of the
first one. This label contains both a full UUID and a name. The name is
what is presented in /dev/multipath, but the UUID is used as a true
distinguishor at g_taste time, thus making sure we don't have chaos
on a shared SAN where everyone names their data multipath as "Fred".
The first of N identical devices (and N *may* be 1!) becomes the active
path until a BIO request is failed with EIO or ENXIO. When this occurs,
the active disk is ripped away and the next in a list is picked to
(retry and) continue with.
During g_taste events new disks that meet the match criteria for existing
multipath geoms get added to the tail end of the list.
Thus, this active/passive setup actually does work for devices which
go away and come back, as do (now) mpt(4) and isp(4) SAN based disks.
There is still a lot to do to improve this- like about 5 of the 12
recommendations I've received about it, but it's been functional enough
for a while that it deserves a broader test base.
Reviewed by: pjd
Sponsored by: IronPort Systems
MFC: 2 months
flash card reader.
Also remove an 'Opened da0 -> <random number>' which is not needed on a daily
basis (available through bootverbose).
Reviewed by: phk, ken
MFC after: 1 week
partitioning class that supports multiple schemes. Current
schemes supported are APM (Apple Partition Map) and GPT.
Change all GEOM_APPLE anf GEOM_GPT options into GEOM_PART_APM
and GEOM_PART_GPT (resp).
The ctlreq interface supports verbs to create and destroy
partitioning schemes on a disk; to add, delete and modify
partitions; and to commit or undo changes made.
We can't bind to a CPU which is not yet on-line, so add code that wait for
CPUs to go on-line before binding to them.
Reported by: Alin-Adrian Anton <aanton@spintech.ro>
MFC after: 2 weeks
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
gmirror and graid3 in a way that it is not resynchronized after a
power failure or system crash.
It is safe when gjournal is running on top of gmirror/graid3.
we won't be able to exit from the thread.
Function g_eli_cpu_is_disabled() stoled from kern_pmc.c.
PR: 104669
Reported by: Nikolay Mirin <nik@optim.com.ru>
MFC after: 1 week
- Do not modify mnt_flag without mount interlock held.
- Do not touch MNT_ASYNC flag, as this can lead to a race with nmount(2).
Pointed out by: tegge
Reviewed by: tegge
journaling and can be tought about marking file system as clean before
doing journal switch, which easly allows to add journaling to file
systems that don't have this feature.
Sponsored by: home.pl
read requests to its consumer. It has been developed to address
the problem of a horrible read performance of a 64k blocksize FS
residing on a RAID3 array with 8 data components, where a single
disk component would only get 8k read requests, thus effectively
killing disk performance under high load. Documentation will be
provided later. I'd like to thank Vsevolod Lobko for his bright
ideas, and Pawel Jakub Dawidek for helping me fix the nasty bug.
request can still have bio_to set to sc_provider (this is READ part of a
synchronization request) and in this case g_{mirror,raid3}_sync() wasn't
called as it should be.
MFC after: 1 week
This way GEOM classes can safely detach from provider when an orphan
event is received. This fixes 'detach with active requests' panic for
gstripe/gconcat under load.
PR: kern/102766
Submitted by: mjacob
OK'ed by: phk
MFC after: 1 week
add count of active and total components to the launched line so you can
see at a glance if your mirror/raid3 is complete...
now:
GEOM_MIRROR: Device mirror/sam launched (2/2).
Reviewed by: pjd
- hold/release device in start/done routines, this will probably slow
down things a bit, but previous code was racy;
- only release device if g_gate_destroy() failed - if it succeeded device
is dead and there is nothing to release;
- various other changes which makes forcible destruction reliable.
MFC after: 3 days
created on Windows XP (and others maybe) were not detected.
We detected only those created with newfs_msdos(8).
Submitted by: Tobias Reifenberger <treif@mayn.de>
style(9)ified by: pjd
This way one will be able to use provider encrypted on eg. i386 on
eg. sparc64. This doesn't really buy us much today, because UFS isn't
endian agnostic.
We retain backward compatibility by setting G_ELI_FLAG_NATIVE_BYTE_ORDER
flag on devices with version number less than 2 and not converting the
offset.
o PMBR partitions count to the number of partitions on the disk, which
means that if a PMBR entry is invalid we will not treat the MBR as a
PMBR by virtue of it not describing any partitions.
Previously the checks were inconsistent in that an invalid PMBR entry
would be harmless when no other partitions exist (we would treat the
MBR as a PMBR by virtue of it being empty), but it would be fatal when
there is at least one other partition.
o The partition size of a PMBR partition is one less than the media size
because the GPT starts at the second sector (LBA 1) and extends to
the end of the media. For backward bug-compatibility we accept a size
that's exactly the media size (FreeBSD bug).
Also, when the partition size can not be represented in a 32-bit
integral, the partition size in the MBR is to be set to 0xFFFFFFFF.
Accept this as a valid size, even if the size can be represented.
we obtained access. It is possible that GPT gets to taste a disk
first, which means the disk has not been opened before and it will
not get opened until after we checked the mediasize and sectorsize.
However, since the mediasize and sectorsize are determined at open
and that happens when access is optained, checking the mediasize
and sectorsize before obtaining access may result in GPT rejecting
the disk.