ZFS large block support.
Please note that booting from datasets that have recordsize greater
than 128KB is not supported (but it's Okay to enable the feature on
the pool). This *may* remain unchanged because of memory constraint.
Limited safety belt is provided for mounted root filesystem but use
caution is advised.
Illumos issue:
5027 zfs large block support
MFC after: 1 month
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
illumos/illumos-gate@43466aae47
NOTE: Make sure the boot code is updated if a zpool upgrade is
done on boot zpool.
MFC after: 2 weeks
.. so that consistent compilation algorithms are used for both
architectures as in practice the binaries are expected to be
interchangeable (for time being).
Previously i386 used default setting which were equivalent to
-march=i486 -mtune=generic.
The only difference is using smaller but slower "leave" instructions.
Discussed with: jhb, dim
MFC after: 29 days
- only filesystem datasets are supported
- children names are printed to stdout
To do: allow to iterate over the list and fetch names programatically
MFC after: 17 days
The previous one was totally bogus as it used hash value of
_output_ variable as an index for searching...
The only reliable way to do a reverse lookup here is to iterate
over all entries.
MFC after: 15 days
often modified directory created symbolic links points to - it cause
unnecessary full rebuilds each time make runs when directory is changed.
So do it only if symbolic link does not exists, which usually means that
objdir is clean anyway.
MFC after: 1 week
This is to silence warnings that result from different definitions of
uint64_t on different architectures, specifically i386 and sparc64.
MFC after: 1 month
In zfs loader zfs device name format now is "zfs:pool/fs",
fully qualified file path is "zfs:pool/fs:/path/to/file"
loader allows accessing files from various pools and filesystems as well
as changing currdev to a different pool/filesystem.
zfsboot accepts kernel/loader name in a format pool:fs:path/to/file or,
as before, pool:path/to/file; in the latter case a default filesystem
is used (pool root or bootfs). zfsboot passes guids of the selected
pool and dataset to zfsloader to be used as its defaults.
zfs support should be architecture independent and is provided
in a separate library, but architectures wishing to use this zfs support
still have to provide some glue code and their devdesc should be
compatible with zfs_devdesc.
arch_zfs_probe method is used to discover all disk devices that may
be part of ZFS pool(s).
libi386 unconditionally includes zfs support, but some zfs-specific
functions are stubbed out as weak symbols. The strong definitions
are provided in libzfsboot.
This change mean that the size of i386_devspec becomes larger
to match zfs_devspec.
Backward-compatibility shims are provided for recently added sparc64
zfs boot support. Currently that architecture still works the old
way and does not support the new features.
TODO:
- clear up pool root filesystem vs pool bootfs filesystem distinction
- update sparc64 support
- set vfs.root.mountfrom based on currdev (for zfs)
Mid-future TODO:
- loader sub-menu for selecting alternative boot environment
Distant future TODO:
- support accessing snapshots, using a snapshot as readonly root
Reviewed by: marius (sparc64),
Gavin Mu <gavin.mu@gmail.com> (sparc64)
Tested by: Florian Wagner <florian@wagner-flo.net> (x86),
marius (sparc64)
No objections: fs@, hackers@
MFC after: 1 month
V100, the firmware is known to be broken and not allowing to simultaneously
open disk devices, causing attempts to boot from a mirror or RAIDZ to cause
a crash. This will be worked around later. The firmwares of newer sun4u models
don't seem to exhibit this problem though.
Steps for ZFS booting:
1. create VTOC8 label
# gpart create -s vtoc8 da0
2. add partitions, f.e.:
# gpart add -t freebsd-zfs -s 60g da0
# gpart add -t freebsd-swap da0
resulting in something like:
# gpart show
=> 0 143331930 da0 VTOC8 (68G)
0 125821080 1 freebsd-zfs (60G)
125821080 17510850 2 freebsd-swap (8.4G)
3. create zpool
# zpool create bunker da0a
or for mirror/RAIDZ (after preparing additional disks as in steps 1. + 2.):
# zpool create bunker mirror da0a da1a
# zpool create bunker raidz da0a da1a da2a ...
4. set bootfs
# zpool set bootfs=bunker bunker
5. install zfsboot
# zpool export bunker
# gpart bootcode -p /boot/zfsboot da0
6. write zfsloader to the ZFS Boot Block (so far, there's no dedicated tool
for this, so dd(1) has to be used for this purpose)
When using mirror/RAIDZ, step 4. and the dd(1) invocation should be repeated
for the additional disks in order to be able to boot from another disk in
case of failure.
# sysctl kern.geom.debugflags=0x10
# dd if=/boot/zfsloader of=/dev/da0a bs=512 oseek=1024 conv=notrunc
# zpool import bunker
7. install system on ZFS filesystem
Don't forget to set 'zfs_load="YES"' and vfs.root.mountfrom="zfs:bunker" in
loader.conf as well as 'zfs_enable="YES"'in rc.conf.
8. copy zpool.cache to the ZFS filesystem
cp -p /boot/zfs/zpool.cache /bunker/boot/zfs/zpool.cache
9. set mountpoint
# zfs set mountpoint=/ bunker
10. Now, given that aliases for all disks in the zpool exists (check with
the `devalias` command on the boot monitor prompt) and disk0 corresponds
to da0 (likewise for additional disks), the system can be booted from the
ZFS with:
{1} ok boot disk0
PR: 165025
Submitted by: Gavin Mu
- Decompress assembled gang block data if compressed.
- Verify checksum of a gang header.
- Verify checksum of assembled gang block data.
- Verify checksum of uber block.
Submitted by: avg
MFC after: 3 days
physical block size declared in bp may not always be what we want.
For example in case of gang block header physical block size declared
in bp is much larger than SPA_GANGBLOCKSIZE (512 bytes) and checksum
calculation failed. This bug could lead to accessing unallocated
memory and resets/failures during boot.
MFC after: 3 days
handled by lower layers like vdev_raidz, which uses bp for checksum
verification. This bug could lead to NULL pointer reference and resets
during boot.
MFC after: 3 days
The utility is not connected to the build, so it should be safe
to update it.
To do: move the utility to tools/.
Some code is provided by Peter Jeremy <peterjeremy@acm.org>
Tested by: Sebastian Chmielewski <chmielsster@gmail.com>,
Peter Jeremy <peterjeremy@acm.org> (earlier versions)
Approved by: re (kib)
MFC after: 4 days
Few new things available from now on:
- Data deduplication.
- Triple parity RAIDZ (RAIDZ3).
- zfs diff.
- zpool split.
- Snapshot holds.
- zpool import -F. Allows to rewind corrupted pool to earlier
transaction group.
- Possibility to import pool in read-only mode.
MFC after: 1 month
clean up most layering violations:
sys/boot/i386/common/rbx.h:
RBX_* defines
OPT_SET()
OPT_CHECK()
sys/boot/common/util.[ch]:
memcpy()
memset()
memcmp()
bcpy()
bzero()
bcmp()
strcmp()
strncmp() [new]
strcpy()
strcat()
strchr()
strlen()
printf()
sys/boot/i386/common/cons.[ch]:
ioctrl
putc()
xputc()
putchar()
getc()
xgetc()
keyhit() [now takes number of seconds as an argument]
getstr()
sys/boot/i386/common/drv.[ch]:
struct dsk
drvread()
drvwrite() [new]
drvsize() [new]
sys/boot/common/crc32.[ch] [new]
sys/boot/common/gpt.[ch] [new]
- Teach gptboot and gptzfsboot about new files. I haven't touched the
rest, but there is still a lot of code duplication to be removed.
- Implement full GPT support. Currently we just read primary header and
partition table and don't care about checksums, etc. After this change we
verify checksums of primary header and primary partition table and if
there is a problem we fall back to backup header and backup partition
table.
- Clean up most messages to use prefix of boot program, so in case of an
error we know where the error comes from, eg.:
gptboot: unable to read primary GPT header
- If we can't boot, print boot prompt only once and not every five
seconds.
- Honour newly added GPT attributes:
bootme - this is bootable partition
bootonce - try to boot from this partition only once
bootfailed - we failed to boot from this partition
- Change boot order of gptboot to the following:
1. Try to boot from all the partitions that have both 'bootme'
and 'bootonce' attributes one by one.
2. Try to boot from all the partitions that have only 'bootme'
attribute one by one.
3. If there are no partitions with 'bootme' attribute, boot from
the first UFS partition.
- The 'bootonce' functionality is implemented in the following way:
1. Walk through all the partitions and when 'bootonce'
attribute is found without 'bootme' attribute, remove
'bootonce' attribute and set 'bootfailed' attribute.
'bootonce' attribute alone means that we tried to boot from
this partition, but boot failed after leaving gptboot and
machine was restarted.
2. Find partition with both 'bootme' and 'bootonce' attributes.
3. Remove 'bootme' attribute.
4. Try to execute /boot/loader or /boot/kernel/kernel from that
partition. If succeeded we stop here.
5. If execution failed, remove 'bootonce' and set 'bootfailed'.
6. Go to 2.
If whole boot succeeded there is new /etc/rc.d/gptboot script coming
that will log all partitions that we failed to boot from (the ones with
'bootfailed' attribute) and will remove this attribute. It will also
find partition with 'bootonce' attribute - this is the partition we
booted from successfully. The script will log success and remove the
attribute.
All the GPT updates we do here goes to both primary and backup GPT if
they are valid. We don't touch headers or partition tables when
checksum doesn't match.
Reviewed by: arch (Message-ID: <20100917234542.GE1902@garage.freebsd.pl>)
Obtained from: Wheel Systems Sp. z o.o. http://www.wheelsystems.com
MFC after: 2 weeks
Before the change it wasn't possible and the following error was printed:
ZFS: can only boot from disk, mirror or raidz vdevs
Now if the original vdev (the one we are replacing) is still present we will
read from it, but if it is not present we won't read from the new vdev, as it
might not have enough valid data yet.
MFC after: 2 weeks
This fixes booting from a ZFS mirror with a unavailable primary device.
PR: kern/148655
Reviewed by: avg
Approved by: delphij (mentor)
MFC after: 3 days
- use correct size (512) while reading a gang block
- skip holes while reading child blocks
- advance buffer pointer while reading child blocks
PR: 144214
MFC after: 10 days