zfsd uses a device's physical path attribute to automatically replace a
missing ZFS disk when a blank disk is inserted into the same physical
slot. Currently gmultipath passes through its underlying providers'
physical path attribute. That may cause zfsd to replace a missing
gmultipath provider with a newly arrived, single-path disk. That would
be bad.
This commit fixes that problem by simply appending "/mp" to the
underlying providers' physical path, in a manner similar to what geli
already does.
Sponsored by: Axcient
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D29941
Currently, OpenCrypto consumers can request asynchronous dispatch by
setting a flag in the cryptop. (Currently only IPSec may do this.) I
think this is a bit confusing: we (conditionally) set cryptop flags to
request async dispatch, and then crypto_dispatch() immediately examines
those flags to see if the consumer wants async dispatch. The flag names
are also confusing since they don't specify what "async" applies to:
dispatch or completion.
Add a new KPI, crypto_dispatch_async(), rather than encoding the
requested dispatch type in each cryptop. crypto_dispatch_async() falls
back to crypto_dispatch() if the session's driver provides asynchronous
dispatch. Get rid of CRYPTOP_ASYNC() and CRYPTOP_ASYNC_KEEPORDER().
Similarly, add crypto_dispatch_batch() to request processing of a tailq
of cryptops, rather than encoding the scheduling policy using cryptop
flags. Convert GELI, the only user of this interface (disabled by
default) to use the new interface.
Add CRYPTO_SESS_SYNC(), which can be used by consumers to determine
whether crypto requests will be dispatched synchronously. This is just
a helper macro. Use it instead of looking at cap flags directly.
Fix style in crypto_done(). Also get rid of CRYPTO_RETW_EMPTY() and
just check the relevant queues directly. This could result in some
unnecessary wakeups but I think it's very uncommon to be using more than
one queue per worker in a given workload, so checking all three queues
is a waste of cycles.
Reviewed by: jhb
Sponsored by: Ampere Computing
Submitted by: Klara, Inc.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D28194
This fixes a failed assertion in scenario where the provider
disappears, disk_gone() gets called, and at the exact same
time something else closes the device node triggering a retaste.
Reviewed By: mav
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27330
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
On 32-bit platforms, the computed size of the BIO_SPEEDUP requested by
softdep_request_cleanup() may be negative when assigned to bp->b_bcount,
which has type "long".
Clamp the size to LONG_MAX. Also convert the unused g_io_speedup() to
use an off_t for the magnitude of the shortage for consistency with
softdep_send_speedup().
Reviewed by: chs, kib
Reported by: pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27081
Nothing implements this in the tree. Remove the ioctl and the
conversion to the geom atttribute stuff.
This was introduced in r94287 in 2002 and was retired in r113390
2003. It appeared in FreeBSD 5.0, but no other releases. This is a
vestige that was missed at the time and overlooked until now. No
compat is provided for this reason. And there's no implementation of
it today. And it was never part of a release from a stable branch.
Reviewed by: phk@
Differential Revision: https://reviews.freebsd.org/D26967
Before this GEOM passed bio pointer to transaction start, but not end.
It was irrelevant until devstat(9) got DTrace hooks, that appeared to
provide bio pointer on I/O completion, but not on submission.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
- Remove "opt_geom.h", no kernel options are used.
- Remove <sys/sysctl.h>, no sysctl functionality is used here.
- Remove <sys/bio.h>, requirements for bio moved out in r112534.
- Remove <sys/lock.h> and <sys/mutex.h>, last used by DROP_GIANT() and
PICKUP_GIANT(), which were removed in r115624.
- Remove <sys/disk.h> and <sys/kernel.h>, not used.
Reviewed by: phk, kevans (mentor)
Approved by: phk, kevans (mentor)
Differential Revision: https://reviews.freebsd.org/D26805
The kernel globals for kenv are confined to 2 files that need them and
a few that likely shouldn't (but as written the code does). Move them
from sys/systm.h to sys/kenv.h. This removed a XXX from systm.h and
cleans it up a little bit...
This is followup to r365477.
If pre-formatted device has GPT and a partition covering
last available LBAs and the device is attached using
a bridge reducing amount of LBAs, then it could be not enough
forcing GEOM to use primary GPT. Also, we should make it possible
to recover GPT and this requires either deleting or resizing the partition.
This change enables "gpart delete" and "gpart resize" commands
on corrupted GPT with following "gpart recover".
It still does not allow modifying corrupted GPT without
preliminary setting sysctl kern.geom.part.check_integrity=0
For example:
# gpart show da0
=> 34 3906963389 da0 GPT (1.8T) [CORRUPT]
34 262144 1 ms-reserved (128M)
262178 2014 - free - (1.0M)
264192 3906764943 2 freebsd-swap (1.8T)
# gpart resize -i 2 -s 3900000000 da0
# gpart recover da0
Reported by: Alex Korchmar
MFC after: 3 days
When we bring in geli into the boot loader, we are single threaded so
we don't have to worry about locking. We have no mutexes, and don't need
to use them, so comment it out.
MFC After: 3 days
There are multiple USB/SATA bridges on the market that unconditionally
cut some LBAs off connected media. This could be a problem
for pre-partitioned drives so GEOM complains and does not create
devices in /dev for slices/partitions preventing access to existing data.
We have kern.geom.part.check_integrity that allows us to correct
partitioning if changed from default 1 to 0 but it works for MBR only.
If backup copy of GPT is unavailable due to decreases number of LBAs,
kernel still does not give access to partitions and prints to dmesg:
GEOM: md0: corrupt or invalid GPT detected.
GEOM: md0: GPT rejected -- may not be recoverable.
This change makes it work for GPT too, so it created partitions in /dev
and prints to dmesg this instead:
GEOM: md0: the secondary GPT table is corrupt or invalid.
GEOM: md0: using the primary only -- recovery suggested.
Then "gpart recover" re-created backup copy of GPT
and allows further manipulations with partitions.
This change is no-op for default configuration having
kern.geom.part.check_integrity=1
Reported by: Alex Korchmar
MFC after: 3 days.
devctl_notify_f isn't needed, so retire it. The flags argument is now
unused, so rather than keep it around, retire it. Convert all old
users of it to devctl_notify(). This path no longer sleeps, so is safe
to call from any context. Since it doesn't sleep, it doesn't need to
know if it is OK to sleep or not.
Reviewed by: markj@
Differential Revision: https://reviews.freebsd.org/D26140
Use unmapped I/O for geli. Unlike most geom providers, geli needs to
manipulate data on every read or write. Previously it would always map bios.
On my 16-core, dual socket server using geli atop md(4) devices, with 512B
sectors, this change increases geli IOPs by about 3x.
Note that geli still can't use unmapped I/O when data integrity verification
is enabled (but it could, with a little more work). And it can't use
unmapped I/O in combination with ZFS, because ZFS uses mapped bios.
Reviewed by: markj, kib, jhb, mjg, mat, bcr (manpages)
MFC after: 1 week
Sponsored by: Axcient
Differential Revision: https://reviews.freebsd.org/D25671
Introduce G_PART_ALIAS_SOLARIS_RESERVED, GPT_ENT_TYPE_SOLARIS_RESERVED et al.,
to make gpart show output more convenient on systems with illumos/openindiana
disks visible.
Submitted by: Juraj Lutter <otis AT sk.FreeBSD.org>
Reviewed by: bcr(manpages), delphij, myself
Differential Revision: https://reviews.freebsd.org/D26012
The caller from kernel is expected to provide an valid g_class
pointer, instead of traversing the global g_class list, just
use that pointer directly instead.
Reviewed by: mav
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D25811
The two classes do not take any verbs and always gctl_error for
all requests, so don't bother to provide a ctlreq handler.
Reviewed by: mav
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D25810
routines out.
While there, also simplify the creation of label paths a little bit
by requiring the / suffix for label directory prefixes (ld_dir renamed
to ld_dirprefix to indicate the change) and stop defining macros for
these when they are only used once.
Reviewed by: cem
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D25597
geli does all of its crypto operations in a separate thread pool, so
g_eli_start, g_eli_read_done, and g_eli_write_done don't actually do very
much work. Enabling direct dispatch eliminates the g_up/g_down bottlenecks,
doubling IOPs on my system. This change does not affect the thread pool.
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: Axcient
Differential Revision: https://reviews.freebsd.org/D25587
Take advantage of Warner's nice new real GEOM aliasing system and use it for
aliased partition names that actually work.
Our canonical EBR partition name is the weird, not-default-on-x86-prior-to-
this-revision "da1p4+00001234." However, if compatibility mode (tunable
kern.geom.part.ebr.compat_aliases) is enabled (1, default), we continue to
provide the alias names like "da1p5" in addition to the weird canonical
names.
Naming partition providers was just one aspect of the COMPAT knob; in
addition it limited mutability, in part because it did not preserve existing
EBR header content aside from that of LBA 0. This change saves the EBR
header for LBA 0, as well as for every EBR partition encountered. That way,
when we write out the EBR partition table on modification, we can restore
any bootloader or other metadata in both LBA0 (the first data-containing EBR
may start after 0) as well as every logical EBR we read from the disk, and
only update the geometry metadata and linked list pointers that describe the
actual partitioning.
(This change does not add support for the 'bootcode' verb to EBR.)
PR: 232463
Reported by: Manish Jain <bourne.identity AT hotmail.com>
Discussed with: ae (no objection)
Relnotes: maybe
Differential Revision: https://reviews.freebsd.org/D24939
These bzero's should have been explicit_bzero's.
Reviewed by: cem, delphij
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25437
In addition to reducing lines of code, this also ensures that the full
allocation is always zeroed avoiding possible bugs with incorrect
lengths passed to explicit_bzero().
Suggested by: cem
Reviewed by: cem, delphij
Approved by: csprng (cem)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25435
fs_summary_info structure. This change was originally done
by the CheriBSD project as they need larger pointers that
do not fit in the existing superblock.
This cleanup of the superblock eases the task of the commit
that immediately follows this one.
Suggested by: brooks
Reviewed by: kib
PR: 246983
Sponsored by: Netflix
Use this in GELI to print out a different message when accelerated
software such as AESNI is used vs plain software crypto.
While here, simplify the logic in GELI a bit for determing which type
of crypto driver was chosen the first time by examining the
capabilities of the matched driver after a single call to
crypto_newsession rather than making separate calls with different
flags.
Reviewed by: delphij
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D25126
For synthetic aliases (just pseudonyms inferred from metadata like GPT or
UFS labels, GPT UUIDs, etc), use the GEOM provider aliasing system to create
a symlink to the real device instead of creating an independent device.
This makes it more clear which labels and devices correspond, and we can
safely have multiple labels to a single device accessed at once.
The confusingly named geom_label on-disk construct continues to behave
identically to how it did before.
This requires teaching GEOM's provider aliasing about the possibility
that aliases might be added later in time, and GEOM's devfs interaction
layer not to worry about existing aliases during retaste.
Discussed with: imp
Relnotes: sure, if we don't end up reverting it
Differential Revision: https://reviews.freebsd.org/D24968
This allows partitions to create additional aliases of their own. The
default method implementations preserve the existing behavior.
No functional change.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D24938
During any kind of shutdown, kern_reboot calls geli's pre_sync event hook,
which tries to destroy all unused geli devices. But during a panic, geli
can't destroy any devices, because the scheduler is stopped, so it can't
switch threads. A livelock results, and the system never dumps core.
This commit fixes the problem by refusing to destroy any devices during
panic, used or otherwise.
PR: 246207
Reviewed by: jhb
MFC after: 2 weeks
Sponsored by: Axcient
Differential Revision: https://reviews.freebsd.org/D24697
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.
The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.
There are two cases for disk I/O errors:
- ENXIO, which means that this disk is gone and the lower layers
of the storage stack already guarantee that no future I/O to
this disk will succeed.
- EIO (or most other errors), which means that this particular
I/O request has failed but subsequent I/O requests to this
disk might still succeed.
For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.
This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.
Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.
Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.
The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.
Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.
Coauthered by: mckusick
Reviewed by: kib
Tested by: Peter Holm
Approved by: mckusick (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24088
Some crypto consumers such as GELI and KTLS for file-backed sendfile
need to store their output in a separate buffer from the input.
Currently these consumers copy the contents of the input buffer into
the output buffer and queue an in-place crypto operation on the output
buffer. Using a separate output buffer avoids this copy.
- Create a new 'struct crypto_buffer' describing a crypto buffer
containing a type and type-specific fields. crp_ilen is gone,
instead buffers that use a flat kernel buffer have a cb_buf_len
field for their length. The length of other buffer types is
inferred from the backing store (e.g. uio_resid for a uio).
Requests now have two such structures: crp_buf for the input buffer,
and crp_obuf for the output buffer.
- Consumers now use helper functions (crypto_use_*,
e.g. crypto_use_mbuf()) to configure the input buffer. If an output
buffer is not configured, the request still modifies the input
buffer in-place. A consumer uses a second set of helper functions
(crypto_use_output_*) to configure an output buffer.
- Consumers must request support for separate output buffers when
creating a crypto session via the CSP_F_SEPARATE_OUTPUT flag and are
only permitted to queue a request with a separate output buffer on
sessions with this flag set. Existing drivers already reject
sessions with unknown flags, so this permits drivers to be modified
to support this extension without requiring all drivers to change.
- Several data-related functions now have matching versions that
operate on an explicit buffer (e.g. crypto_apply_buf,
crypto_contiguous_subsegment_buf, bus_dma_load_crp_buf).
- Most of the existing data-related functions operate on the input
buffer. However crypto_copyback always writes to the output buffer
if a request uses a separate output buffer.
- For the regions in input/output buffers, the following conventions
are followed:
- AAD and IV are always present in input only and their
fields are offsets into the input buffer.
- payload is always present in both buffers. If a request uses a
separate output buffer, it must set a new crp_payload_start_output
field to the offset of the payload in the output buffer.
- digest is in the input buffer for verify operations, and in the
output buffer for compute operations. crp_digest_start is relative
to the appropriate buffer.
- Add a crypto buffer cursor abstraction. This is a more general form
of some bits in the cryptosoft driver that tried to always use uio's.
However, compared to the original code, this avoids rewalking the uio
iovec array for requests with multiple vectors. It also avoids
allocate an iovec array for mbufs and populating it by instead walking
the mbuf chain directly.
- Update the cryptosoft(4) driver to support separate output buffers
making use of the cursor abstraction.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24545