gmirror's sc_flags is shared between some on-disk state and some runtime
only state. There's no real reason for that and they could probably be
split up. Until they are, locate all of the flags for the same field
nearby each other in the source, for clarity.
No functional change.
Sponsored by: Dell EMC Isilon
g_handleattr() fills out bp->bio_completed; otherwise, g_getattr()
returns an error in response to the query. This caused BIO_DELETE
support to not be propagated through stacked configurations, e.g.,
a gconcat of gmirror volumes would not handle BIO_DELETE even when
the gmirrors do. g_io_getattr() was not affected by the problem.
PR: 232676
Reported and tested by: noah.bergbauer@tum.de
MFC after: 1 week
Mutexes in I/O path there were used twice per I/O to atomically access
several variables to close and/or destroy the device on last request
completion. I found the way to fit all required info into one integer,
suitable for atomic operations. It opened race window on device close,
but addition of timeout to the msleep() there should cover it.
Profiling shows removal of significant spinning time on those mutexes
and IOPS increase from ~600K to >800K to NVMe on 72-core systems.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
I mistakenly added a lock assertion to this routine at the last minute
without confirming it was held during g_mirror_create. It isn't (it isn't
even initialized yet). Mea culpa. Access is exclusive in both callers,
just not always by that particular lock.
Reported by: lwhsu
X-MFC-With: r341840, r341674
r341674 inadvertently introduced a bug where newer mirror components being
tasted would clear the high sc_flags that are not controlled by component
metadata, such as G_MIRROR_DEVICE_FLAG_TASTING. This could plausibly expose
a small window of time during STARTING where device destruction might race
with mirror component addition, probably resulting in a crash.
Reviewed by: markj
X-MFC-With: r341674
Differential Revision: https://reviews.freebsd.org/D18521
Re-apply r341665 with format strings fixed.
If we happen to taste a stale mirror component first, don't reject valid,
newer components that have differing metadata from the stale component
(during STARTING). Instead, update our view of the most recent metadata as
we taste components.
Like mediasize beforehand, remove some checks from g_mirror_check_metadata
which would evict valid components due to metadata that can change over a
mirror's lifetime. g_mirror_check_metadata is invoked long before we check
genid/syncid and decide which component(s) are newest and whether or not we
have quorum.
Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW
component is added, first remove any known stale or inconsistent disks from
the mirrorset, rather than removing them *after* deciding we have quorum.
Check if we have quorum after removing these components.
Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to
force gmirrors to wait out the full timeout (kern.geom.mirror.timeout)
before transitioning from STARTING to RUNNING. This is a kludge to help
ensure all eligible, boot-time available mirror components are tasted before
RUNNING a gmirror.
Add a basic test case for STARTING -> RUNNING startup behavior around stale
genids.
PR: 232671, 232835
Submitted by: Cindy Yang <cyang AT isilon.com> (previous version)
Reviewed by: markj (kernel portions)
Discussed with: asomers, Cindy Yang
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D18062
If we happen to taste a stale mirror component first, don't reject valid,
newer components that have differing metadata from the stale component
(during STARTING). Instead, update our view of the most recent metadata as
we taste components.
Like mediasize beforehand, remove some checks from g_mirror_check_metadata
which would evict valid components due to metadata that can change over a
mirror's lifetime. g_mirror_check_metadata is invoked long before we check
genid/syncid and decide which component(s) are newest and whether or not we
have quorum.
Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW
component is added, first remove any known stale or inconsistent disks from
the mirrorset, rather than removing them *after* deciding we have quorum.
Check if we have quorum after removing these components.
Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to
force gmirrors to wait out the full timeout (kern.geom.mirror.timeout)
before transitioning from STARTING to RUNNING. This is a kludge to help
ensure all eligible, boot-time available mirror components are tasted before
RUNNING a gmirror.
When we are instructed to forget mirror components, bump the generation id
to avoid confusion with such stale components later.
Add a basic test case for STARTING -> RUNNING startup behavior around stale
genids.
PR: 232671, 232835
Submitted by: Cindy Yang <cyang AT isilon.com> (previous version)
Reviewed by: markj (kernel portions)
Discussed with: asomers, Cindy Yang
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D18062
superblock has a check-hash error, an error message noting the
superblock check-hash failure is printed and the mount fails. The
administrator then runs fsck to repair the filesystem and when
successful, the filesystem can once again be mounted.
This approach fails if the filesystem in question is a root filesystem
from which you are trying to boot. Here, the loader fails when trying
to access the filesystem to get the kernel to boot. So it is necessary
to allow the loader to ignore the superblock check-hash error and make
a best effort to read the kernel. The filesystem may be suffiently
corrupted that the read attempt fails, but there is no harm in trying
since the loader makes no attempt to write to the filesystem.
Once the kernel is loaded and starts to run, it attempts to mount its
root filesystem. Once again, failure means that it breaks to its prompt
to ask where to get its root filesystem. Unless you have an alternate
root filesystem, you are stuck.
Since the root filesystem is initially mounted read-only, it is
safe to make an attempt to mount the root filesystem with the failed
superblock check-hash. Thus, when asked to mount a root filesystem
with a failed superblock check-hash, the kernel prints a warning
message that the root filesystem superblock check-hash needs repair,
but notes that it is ignoring the error and proceeding. It does
mark the filesystem as needing an fsck which prevents it from being
enabled for writing until fsck has been run on it. The net effect
is that the reboot fails to single user, but at least at that point
the administrator has the tools at hand to fix the problem.
Reported by: Rick Macklem (rmacklem@)
Discussed with: Warner Losh (imp@)
Sponsored by: Netflix
handling slightly out-of-bound requests properly (r340187).
Perform range check here rather then rely on g_delete_data() to DTRT.
The g_delete_data() would always return success for requests
starting just the next byte after providers media boundary.
MFC after: 4 weeks
from setting the volume serial number. This unbreaks older boot blocks
that don't support serial numbers, and allows boot0cfg to set the serial
number itself if requested by the user.
Submitted by: lev@, yuripv@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17386
i/o into last_sector+N is handled differently for N==1 and N>1 cases to
accomodate that, so some other approach would be needed to fix DIOCGDELETE
ioctl(2).
fully beyond the end of providers media. The only exception is made
for the zero length transfers which are allowed to be just on the
boundary. Previously, any requests starting on the boundary (i.e. next
byte after the last one) have been allowed to go through.
No response from: freebsd-geom@, phk
MFC after: 1 month
GEOM's stripeoffset overflows at 4 gigabyte margin (2^32)
because of its u_int type. This leads to incorrect data in the output
generated by "sysctl kern.geom.confxml" command, "graid list" etc.
when GEOM array has volumes larger than 4G, for example.
This change does not affect ABI but changes KBI. No MFC planned.
Differential Revision: https://reviews.freebsd.org/D13426
In r332361 and r333439, two new parameters were added to geli attach
verb using gctl_get_paraml, which requires the value to be present.
This would prevent old geli(8) binary from attaching geli(4) device
as they have no knowledge about the new parameters.
Restore backward compatibility by treating the absense of these two
values as seeing the default value supplied by userland.
PR: 232595
Reviewed by: oshogbo
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D17680
Track session objects in the framework, and pass handles between the
framework (OCF), consumers, and drivers. Avoid redundancy and complexity in
individual drivers by allocating session memory in the framework and
providing it to drivers in ::newsession().
Session handles are no longer integers with information encoded in various
high bits. Use of the CRYPTO_SESID2FOO() macros should be replaced with the
appropriate crypto_ses2foo() function on the opaque session handle.
Convert OCF drivers (in particular, cryptosoft, as well as myriad others) to
the opaque handle interface. Discard existing session tracking as much as
possible (quick pass). There may be additional code ripe for deletion.
Convert OCF consumers (ipsec, geom_eli, krb5, cryptodev) to handle-style
interface. The conversion is largely mechnical.
The change is documented in crypto.9.
Inspired by
https://lists.freebsd.org/pipermail/freebsd-arch/2018-January/018835.html .
No objection from: ae (ipsec portion)
Reported by: jhb
If the 'n' flag is provided the provided key number will be used to
decrypt device. This can be used combined with dryrun to verify if the key
is set correctly. This can be also used to determine which key slot we want to
change on already attached device.
Reviewed by: allanjude
Differential Revision: https://reviews.freebsd.org/D15309
Doing so introduces races which can lead to a use-after-free when
grabbing a snapshot of the GEOM mesh.
To ensure that a mirror's disk list remains stable, change its locking
protocol: both the softc lock and the topology lock are now required
to modify the list, so either lock is sufficient for traversal.
Tested by: pho
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
FAT32 partition with LBA addressing.
Reviewed by: marcel
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D15266
GEOM ELI may double ask the password during boot. Once at loader time, and
once at init time.
This happens due a module loading bug. By default GEOM ELI caches the
password in the kernel, but without the MODULE_VERSION annotation, the
kernel loads over the kernel module, even if the GEOM ELI was compiled into
the kernel. In this case, the newly loaded module
purges/invalidates/overwrites the GEOM ELI's password cache, which causes
the double asking.
MFC Note: There's a pc98 component to the original submission that is
omitted here due to pc98 removal in head. This part will need to be revived
upon MFC.
Reviewed by: imp
Submitted by: op
Obtained from: opBSD
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14992
This will allow us to verify if passphrase and key is valid without
decrypting whole device.
Reviewed by: cem@, allanjude@
Differential Revision: https://reviews.freebsd.org/D15000
It's had a good life, but it's not really configurable and not really used.
Obtained from: opBSD (with some changes)
Differential Revision: https://reviews.freebsd.org/D14991
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
The problem is that g_access() must be called with the GEOM topology
lock held. And that gives a false impression that the lock is indeed
held across the call. But this isn't always true because many classes,
ZVOL being one of the many, need to drop the lock. It's either to
perform an I/O on the first open or to acquire a different lock (like in
g_mirror_access).
That, of course, can break many assumptions. For example,
g_slice_access() adds an extra exclusive count on the first open. As
described above, an underlying geom may drop the topology lock and that
would open a race with another thread that would also request another
extra exclusive count. In general, two consumers may be granted
incompatible accesses.
To avoid this problem the code is changed to mark a geom with special
flag before calling its access method and clear the flag afterwards. If
another thread sees that flag, then it means that the topology lock has
been dropped (either by the geom in question or downstream from it), so
it is not safe to make another access call. So, the second thread would
use g_topology_sleep() to wait until the flag is cleared and only then
would it proceed with the access.
Also see http://docs.freebsd.org/cgi/mid.cgi?809d9254-ee56-59d8-69a4-08838e985cea
PR: 225960
Reported by: asomers
Reviewed by: markj, mav
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D14533
If g_part_gpt_read() encountered a disk with bad primary and secondary
tables, it could leak memory.
Reported by: Coverity
Sponsored by: Dell EMC Isilon
to fix the memory leak that I introduced in r328426. Instead of
trying to clear up the possible memory leak in all the clients, I
ensure that it gets cleaned up in the source (e.g., ffs_sbget ensures
that memory is always freed if it returns an error).
The original change in r328426 was a bit sparse in its description.
So I am expanding on its description here (thanks cem@ and rgrimes@
for your encouragement for my longer commit messages).
In preparation for adding check hashing to superblocks, r328426 is
a refactoring of the code to get the reading/writing of the superblock
into one place. Unlike the cylinder group reading/writing which
ends up in two places (ffs_getcg/ffs_geom_strategy in the kernel
and cgget/cgput in libufs), I have the core superblock functions
just in the kernel (ffs_sbfetch/ffs_sbput in ffs_subr.c which is
already imported into utilities like fsck_ffs as well as libufs to
implement sbget/sbput). The ffs_sbfetch and ffs_sbput functions
take a function pointer to do the actual I/O for which there are
four variants:
ffs_use_bread / ffs_use_bwrite for the in-kernel filesystem
g_use_g_read_data / g_use_g_write_data for kernel geom clients
ufs_use_sa_read for the standalone code (stand/libsa/ufs.c
but not stand/libsa/ufsread.c which is size constrained)
use_pread / use_pwrite for libufs
Uses of these interfaces are in the UFS filesystem, geoms journal &
label, libsa changes, and libufs. They also permeate out into the
filesystem utilities fsck_ffs, newfs, growfs, clri, dump, quotacheck,
fsirand, fstyp, and quot. Some of these utilities should probably be
converted to directly use libufs (like dumpfs was for example), but
there does not seem to be much win in doing so.
Tested by: Peter Holm (pho@)
ffs_sbget() may return a superblock buffer even if it fails, so the
caller must be prepared to free it in this case. Moreover, when tasting
alternate superblock locations in a loop, ffs_sbget()'s readfunc
callback must free the previously allocated buffer.
Reported and tested by: pho
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D14390
If the underlying provider's physical path is null, then the gpart device's
physical path will be, too. Otherwise, it will append the partition name,
such as "/p1" or "/s1/a". This will make gpart work better with zfsd(8).
PR: 224965
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D14010
If the underlying provider's physical path is null, then the geli device's
physical path will be, too. Otherwise, it will append "/eli". This will make
geli work better with zfsd(8).
PR: 224962
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D13979
Some GEOM partition tables may be destroyed with incomplete partition
entries. Guard against this with NULL checks.
Reported by: pholm,others
Reviewed by: markj
Tested by: pholm
A race in g_part_wither() can lead to I/O being performed with a freed GEOM
when the device disappears. Close the race as best as we can for now,
following the code patterns from g_part_ctl_destroy() and g_part_ctl_undo().
This also fixes a leak, as g_wither_geom() does not wither providers, it
only orphans them, so the partition entries would never get destroyed in
g_wither_washer().
Note, this is not a complete fix, it can still race with g_part_start(), the
race has merely been narrowed.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Since synchronization reads are performed by submitting a request to
the external mirror provider, we know that the request returns with an
error only when gmirror was unable to read a copy of the block from any
mirror. Thus, there is no need to retry the request from the
synchronization error handler.
Tested by: pho
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Most consumers of g_metadata_store were passing in partially unallocated
memory, resulting in stack garbage being written to disk labels. Fix them by
zeroing the memory first.
gvirstor repeated the same mistake, but in the kernel.
Also, glabel's label contained a fixed-size string that wasn't
initialized to zero.
PR: 222077
Reported by: Maxim Khitrov <max@mxcrypt.com>
Reviewed by: cem
MFC after: 3 weeks
X-MFC-With: 323314
X-MFC-With: 323338
Differential Revision: https://reviews.freebsd.org/D14164