Follow-up on r322318 and r322319 and remove the deprecated modules.
Shift some now-unused kernel files into userspace utilities that incorporate
them. Remove references to removed GEOM classes in userspace utilities.
Reviewed by: imp (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21249
- Use new zlib headers;
- Removed z_alloc and z_free to use the common sys/dev/zlib version.
- Replace z_compressBound with compressBound from zlib.
While there, limit LZMA CFLAGS to apply only for g_uzip_lzma.c.
PR: 229763
Submitted by: Yoshihiro Ota <ota j email ne jp> (with changes,
bugs are mine)
Differential Revision: https://reviews.freebsd.org/D20271
Similar to what was done for device_printfs in r347229.
Convert g_print_bio() to a thin shim around g_format_bio(), which acts on an
sbuf; documented in g_bio.9.
Reviewed by: markj
Discussed with: rlibby
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21165
This allows to simulated disk that is responding slowly to the IO requests.
Reviewed by: markj, bcr, pjd (previous version)
Differential Revision: https://reviews.freebsd.org/D21052
If g_mirror_taste encountered an error at g_mirror_add_disk, it might
try to g_mirror_destroy the device with the G_MIRROR_DEVICE_FLAG_TASTING
flag still set. This would wait on a worker to complete the destruction
with g_mirror_try_destroy, but that function bails out if the tasting
flag is set, resulting in a deadlock. Clear the tasting flag before
trying to destroy the device.
Test Plan:
sysctl debug.fail_point.mnowait="1%return"
kyua test -k /usr/tests/sys/geom/class/mirror/Kyuafile
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20744
NANDFS has been broken for years. Remove it. The NAND drivers that
remain are for ancient parts that are no longer relevant. They are
polled, have terrible performance and just for ancient arm
hardware. NAND parts have evolved significantly from this early work
and little to none of it would be relevant should someone need to
update to support raw nand. This code has been off by default for
years and has violated the vnode protocol leading to panics since it
was committed.
Numerous posts to arch@ and other locations have found no actual users
for this software.
Relnotes: Yes
No Objection From: arch@
Differential Revision: https://reviews.freebsd.org/D20745
When it comes to megabytes of text, difference between sbuf_printf() and
sbuf_cat() becomes substantial.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
On large systems those sysctls may generate megabytes of output. Before
this change sbuf(9) code was resizing buffer by 4KB each time many times,
generating tons of TLB shootdowns. Unfortunately in this case existing
sbuf_new_for_sysctl() mechanism, supposed to help with this issue, is not
applicable, since all the sbuf writes are done in different kernel thread.
This change improves situation in two ways:
- on first sysctl call, not providing any output buffer, it sets special
sbuf drain function, just counting the data and so not needing big buffer;
- on second sysctl call it uses as initial buffer size value saved on
previous call, so that in most cases there will be no reallocation, unless
GEOM topology changed significantly.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
rename the source to gsb_crc32.c.
This is a prerequisite of unifying kernel zlib instances.
PR: 229763
Submitted by: Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision: https://reviews.freebsd.org/D20193
operations already in its queue were not being properly drained.
The GEOM framework does the queue draining, but the module needs
to wait for the draining to happen. The waiting is done by adding
a g_nop_providergone() function to wait for the I/O operations to
finish up. This change is similar to change -r345758 made to the
memory-disk driver.
Submitted by: Chuck Silvers
Tested by: Chuck Silvers
MFC after: 1 week
Sponsored by: Netflix
- Triple DES has been formally deprecated in Kerberos (RFC 8429)
and is soon to be deprecated in IPsec (RFC 8221).
- Blowfish is deprecated. FreeBSD doesn't support its successor
(Twofish).
- MD5 is generally considered a weak digest that has known attacks.
geli refuses to create new volumes using these algorithms via 'geli
init'. It also warns when attaching to existing volumes or creating
temporary volumes via 'geli onetime' . The plan is to fully remove
support for these algorithms in FreeBSD 13.
Note that none of these algorithms have ever been the default
algorithm used by geli(8). Users would have had to explicitly select
these algorithms when creating volumes in the past.
Reviewed by: cem, delphij
MFC after: 3 days
Relnotes: yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D20344
Allow users to specify multiple dump configurations in a prioritized list.
This enables fallback to secondary device(s) if primary dump fails. E.g.,
one might configure a preference for netdump, but fallback to disk dump as a
second choice if netdump is unavailable.
This change does not list-ify netdump configuration, which is tracked
separately from ordinary disk dumps internally; only one netdump
configuration can be made at a time, for now. It also does not implement
IPv6 netdump.
savecore(8) is already capable of scanning and iterating multiple devices
from /etc/fstab or passed on the command line.
This change doesn't update the rc or loader variables 'dumpdev' in any way;
it can still be set to configure a single dump device, and rc.d/savecore
still uses it as a single device. Only dumpon(8) is updated to be able to
configure the more complicated configurations for now.
As part of revving the ABI, unify netdump and disk dump configuration ioctl
/ structure, and leave room for ipv6 netdump as a future possibility.
Backwards-compatibility ioctls are added to smooth ABI transition,
especially for developers who may not keep kernel and userspace perfectly
synced.
Reviewed by: markj, scottl (earlier version)
Relnotes: maybe
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19996
There's a race between the initialization of devsoftc.mtx (by devinit)
and the creation of the geom worker thread g_run_events, which calls
devctl_queue_data_f. Both of those are initialized at SI_SUB_DRIVERS
and SI_ORDER_FIRST, which means the geom worked thread can be created
before the mutex has been initialized, leading to the panic below:
wpanic: mtx_lock() of spin mutex (null) @ /usr/home/osstest/build.135317.build-amd64-freebsd/freebsd/sys/kern/subr_bus.c:620
cpuid = 3
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe003b968710
vpanic() at vpanic+0x19d/frame 0xfffffe003b968760
panic() at panic+0x43/frame 0xfffffe003b9687c0
__mtx_lock_flags() at __mtx_lock_flags+0x145/frame 0xfffffe003b968810
devctl_queue_data_f() at devctl_queue_data_f+0x6a/frame 0xfffffe003b968840
g_dev_taste() at g_dev_taste+0x463/frame 0xfffffe003b968a00
g_load_class() at g_load_class+0x1bc/frame 0xfffffe003b968a30
g_run_events() at g_run_events+0x197/frame 0xfffffe003b968a70
fork_exit() at fork_exit+0x84/frame 0xfffffe003b968ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe003b968ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 13 tid 100029 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why
Fix this by initializing geom at SI_ORDER_SECOND instead of
SI_ORDER_FIRST.
Sponsored by: Citrix Systems R&D
Reviewed by: kevans, markj
Differential revision: https://reviews.freebsd.org/D20148
destroy_dev_sched_cb() is excessively asynchronous, and during media change
retaste new provider may appear sooner then device of the previous one get
destroyed.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
provider grows, GELI will expand automatically and will move the metadata
to the new location of the last sector.
This functionality is turned on by default. It can be turned off with the
-R flag, but it is not recommended - if the underlying provider grows and
automatic expansion is turned off, it won't be possible to attach this
provider again, as the metadata is no longer located in the last sector.
If the automatic expansion is turned off and the underlying provider grows,
GELI will only log a message with the previous size of the provider, so
recovery can be easier.
Obtained from: Fudo Security
providers mediasize changes.
While here, use GEOM nomenclature to describe providers instead of calling
them device nodes.
Obtained from: Fudo Security
Tested in: AWS
While geom_flashmap has always supported label names for its slices, it does
so by appending "s.labelname" to the provider device name, meaning you still
have to know the name and unit of the hardware device to use the labels.
These changes add support for device-independent geom_flashmap labels, using
the standard geom_label infrastructure. geom_flashmap now creates a softc
struct attached to its geom, and as it creates slices it stores the label
into an array in the softc. The new geom_label_flashmap uses those labels
when tasting a geom_flashmap provider.
Differential Revision: https://reviews.freebsd.org/D19535
In revision 254095, gpt_entries is not set to match the on-disk
hdr_entries, but rather is computed based on available space.
There are 2 problems with this:
1. The GPT backend respects hdr_entries and only reads and writes
that number of partition entries. On top of that, CRC32 is
computed over the table that has hdr_entries elements. When
the common code works on what is possibly a larger number, the
behaviour becomes inconsistent and problematic. In particular,
it would be possible to add a new partition that on a reboot
isn't there anymore.
2. The calculation of gpt_entries is based on flawed assumptions.
The GPT specification does not dictate that sectors are layed
out in a particular way that the available space can be
determined by looking at LBAs. In practice, implementations
do the same thing, because there's no reason to do it any
other way. Still, GPT allows certain freedoms that can be
exploited in some form or shape if the need arises.
PR: 229977
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19438
Embedded lzma decompression library becomes a module usable by other
consumers, in addition to geom_uzip.
Most important code changes are
- removal of XZ_DEC_SINGLE define, we need the code to work
with XZ_DEC_DYNALLOC;
- xz_crc32_init() call is removed from geom_uzip, xz module handles
initialization on its own.
xz is no longer embedded into geom_uzip, instead the depend line for
the module is provided, and corresponding kernel option is added to
each MIPS kernel config file using geom_uzip.
The commit also carries unrelated cleanup by removing excess "device geom_uzip"
in places which were missed in r344479.
Reviewed by: cem, hselasky, ray, slavash (previous versions)
Sponsored by: Mellanox Technologies
Differential revision: https://reviews.freebsd.org/D19266
MFC after: 3 weeks
The DIOCGETZONE ioctl can be used to fetch the zone list of an SMR
drive, and the caller specifies the number of entries it wants to fetch.
Clamp the caller's request to a sane limit so that a user cannot attempt
large allocations. Callers already need to invoke the ioctl multiple
times to fetch the full list in general, so there's no harm in limiting
the number of entries returned.
Fix style while here.
admbug: 807
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by: asomers, ken
Tested by: ken
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19249
Otherwise a privileged user can trigger a memory allocation of
unbounded size, or an integer overflow in the subsequent
geom_alloc_copyin() call, leading to out-of-bounds accesses.
Hard-code a large limit to circumvent this problem.
admbug: 854
Reported by: Anonymous of the Shellphish Grill Team
Reviewed by: ae
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19251
gmirror's sc_flags is shared between some on-disk state and some runtime
only state. There's no real reason for that and they could probably be
split up. Until they are, locate all of the flags for the same field
nearby each other in the source, for clarity.
No functional change.
Sponsored by: Dell EMC Isilon
g_handleattr() fills out bp->bio_completed; otherwise, g_getattr()
returns an error in response to the query. This caused BIO_DELETE
support to not be propagated through stacked configurations, e.g.,
a gconcat of gmirror volumes would not handle BIO_DELETE even when
the gmirrors do. g_io_getattr() was not affected by the problem.
PR: 232676
Reported and tested by: noah.bergbauer@tum.de
MFC after: 1 week
Mutexes in I/O path there were used twice per I/O to atomically access
several variables to close and/or destroy the device on last request
completion. I found the way to fit all required info into one integer,
suitable for atomic operations. It opened race window on device close,
but addition of timeout to the msleep() there should cover it.
Profiling shows removal of significant spinning time on those mutexes
and IOPS increase from ~600K to >800K to NVMe on 72-core systems.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
I mistakenly added a lock assertion to this routine at the last minute
without confirming it was held during g_mirror_create. It isn't (it isn't
even initialized yet). Mea culpa. Access is exclusive in both callers,
just not always by that particular lock.
Reported by: lwhsu
X-MFC-With: r341840, r341674
r341674 inadvertently introduced a bug where newer mirror components being
tasted would clear the high sc_flags that are not controlled by component
metadata, such as G_MIRROR_DEVICE_FLAG_TASTING. This could plausibly expose
a small window of time during STARTING where device destruction might race
with mirror component addition, probably resulting in a crash.
Reviewed by: markj
X-MFC-With: r341674
Differential Revision: https://reviews.freebsd.org/D18521
Re-apply r341665 with format strings fixed.
If we happen to taste a stale mirror component first, don't reject valid,
newer components that have differing metadata from the stale component
(during STARTING). Instead, update our view of the most recent metadata as
we taste components.
Like mediasize beforehand, remove some checks from g_mirror_check_metadata
which would evict valid components due to metadata that can change over a
mirror's lifetime. g_mirror_check_metadata is invoked long before we check
genid/syncid and decide which component(s) are newest and whether or not we
have quorum.
Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW
component is added, first remove any known stale or inconsistent disks from
the mirrorset, rather than removing them *after* deciding we have quorum.
Check if we have quorum after removing these components.
Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to
force gmirrors to wait out the full timeout (kern.geom.mirror.timeout)
before transitioning from STARTING to RUNNING. This is a kludge to help
ensure all eligible, boot-time available mirror components are tasted before
RUNNING a gmirror.
Add a basic test case for STARTING -> RUNNING startup behavior around stale
genids.
PR: 232671, 232835
Submitted by: Cindy Yang <cyang AT isilon.com> (previous version)
Reviewed by: markj (kernel portions)
Discussed with: asomers, Cindy Yang
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D18062
If we happen to taste a stale mirror component first, don't reject valid,
newer components that have differing metadata from the stale component
(during STARTING). Instead, update our view of the most recent metadata as
we taste components.
Like mediasize beforehand, remove some checks from g_mirror_check_metadata
which would evict valid components due to metadata that can change over a
mirror's lifetime. g_mirror_check_metadata is invoked long before we check
genid/syncid and decide which component(s) are newest and whether or not we
have quorum.
Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW
component is added, first remove any known stale or inconsistent disks from
the mirrorset, rather than removing them *after* deciding we have quorum.
Check if we have quorum after removing these components.
Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to
force gmirrors to wait out the full timeout (kern.geom.mirror.timeout)
before transitioning from STARTING to RUNNING. This is a kludge to help
ensure all eligible, boot-time available mirror components are tasted before
RUNNING a gmirror.
When we are instructed to forget mirror components, bump the generation id
to avoid confusion with such stale components later.
Add a basic test case for STARTING -> RUNNING startup behavior around stale
genids.
PR: 232671, 232835
Submitted by: Cindy Yang <cyang AT isilon.com> (previous version)
Reviewed by: markj (kernel portions)
Discussed with: asomers, Cindy Yang
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D18062
superblock has a check-hash error, an error message noting the
superblock check-hash failure is printed and the mount fails. The
administrator then runs fsck to repair the filesystem and when
successful, the filesystem can once again be mounted.
This approach fails if the filesystem in question is a root filesystem
from which you are trying to boot. Here, the loader fails when trying
to access the filesystem to get the kernel to boot. So it is necessary
to allow the loader to ignore the superblock check-hash error and make
a best effort to read the kernel. The filesystem may be suffiently
corrupted that the read attempt fails, but there is no harm in trying
since the loader makes no attempt to write to the filesystem.
Once the kernel is loaded and starts to run, it attempts to mount its
root filesystem. Once again, failure means that it breaks to its prompt
to ask where to get its root filesystem. Unless you have an alternate
root filesystem, you are stuck.
Since the root filesystem is initially mounted read-only, it is
safe to make an attempt to mount the root filesystem with the failed
superblock check-hash. Thus, when asked to mount a root filesystem
with a failed superblock check-hash, the kernel prints a warning
message that the root filesystem superblock check-hash needs repair,
but notes that it is ignoring the error and proceeding. It does
mark the filesystem as needing an fsck which prevents it from being
enabled for writing until fsck has been run on it. The net effect
is that the reboot fails to single user, but at least at that point
the administrator has the tools at hand to fix the problem.
Reported by: Rick Macklem (rmacklem@)
Discussed with: Warner Losh (imp@)
Sponsored by: Netflix
handling slightly out-of-bound requests properly (r340187).
Perform range check here rather then rely on g_delete_data() to DTRT.
The g_delete_data() would always return success for requests
starting just the next byte after providers media boundary.
MFC after: 4 weeks
from setting the volume serial number. This unbreaks older boot blocks
that don't support serial numbers, and allows boot0cfg to set the serial
number itself if requested by the user.
Submitted by: lev@, yuripv@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17386
i/o into last_sector+N is handled differently for N==1 and N>1 cases to
accomodate that, so some other approach would be needed to fix DIOCGDELETE
ioctl(2).
fully beyond the end of providers media. The only exception is made
for the zero length transfers which are allowed to be just on the
boundary. Previously, any requests starting on the boundary (i.e. next
byte after the last one) have been allowed to go through.
No response from: freebsd-geom@, phk
MFC after: 1 month
GEOM's stripeoffset overflows at 4 gigabyte margin (2^32)
because of its u_int type. This leads to incorrect data in the output
generated by "sysctl kern.geom.confxml" command, "graid list" etc.
when GEOM array has volumes larger than 4G, for example.
This change does not affect ABI but changes KBI. No MFC planned.
Differential Revision: https://reviews.freebsd.org/D13426
In r332361 and r333439, two new parameters were added to geli attach
verb using gctl_get_paraml, which requires the value to be present.
This would prevent old geli(8) binary from attaching geli(4) device
as they have no knowledge about the new parameters.
Restore backward compatibility by treating the absense of these two
values as seeing the default value supplied by userland.
PR: 232595
Reviewed by: oshogbo
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D17680
Track session objects in the framework, and pass handles between the
framework (OCF), consumers, and drivers. Avoid redundancy and complexity in
individual drivers by allocating session memory in the framework and
providing it to drivers in ::newsession().
Session handles are no longer integers with information encoded in various
high bits. Use of the CRYPTO_SESID2FOO() macros should be replaced with the
appropriate crypto_ses2foo() function on the opaque session handle.
Convert OCF drivers (in particular, cryptosoft, as well as myriad others) to
the opaque handle interface. Discard existing session tracking as much as
possible (quick pass). There may be additional code ripe for deletion.
Convert OCF consumers (ipsec, geom_eli, krb5, cryptodev) to handle-style
interface. The conversion is largely mechnical.
The change is documented in crypto.9.
Inspired by
https://lists.freebsd.org/pipermail/freebsd-arch/2018-January/018835.html .
No objection from: ae (ipsec portion)
Reported by: jhb
If the 'n' flag is provided the provided key number will be used to
decrypt device. This can be used combined with dryrun to verify if the key
is set correctly. This can be also used to determine which key slot we want to
change on already attached device.
Reviewed by: allanjude
Differential Revision: https://reviews.freebsd.org/D15309
Doing so introduces races which can lead to a use-after-free when
grabbing a snapshot of the GEOM mesh.
To ensure that a mirror's disk list remains stable, change its locking
protocol: both the softc lock and the topology lock are now required
to modify the list, so either lock is sufficient for traversal.
Tested by: pho
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
FAT32 partition with LBA addressing.
Reviewed by: marcel
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D15266
GEOM ELI may double ask the password during boot. Once at loader time, and
once at init time.
This happens due a module loading bug. By default GEOM ELI caches the
password in the kernel, but without the MODULE_VERSION annotation, the
kernel loads over the kernel module, even if the GEOM ELI was compiled into
the kernel. In this case, the newly loaded module
purges/invalidates/overwrites the GEOM ELI's password cache, which causes
the double asking.
MFC Note: There's a pc98 component to the original submission that is
omitted here due to pc98 removal in head. This part will need to be revived
upon MFC.
Reviewed by: imp
Submitted by: op
Obtained from: opBSD
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D14992
This will allow us to verify if passphrase and key is valid without
decrypting whole device.
Reviewed by: cem@, allanjude@
Differential Revision: https://reviews.freebsd.org/D15000
It's had a good life, but it's not really configurable and not really used.
Obtained from: opBSD (with some changes)
Differential Revision: https://reviews.freebsd.org/D14991
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
The problem is that g_access() must be called with the GEOM topology
lock held. And that gives a false impression that the lock is indeed
held across the call. But this isn't always true because many classes,
ZVOL being one of the many, need to drop the lock. It's either to
perform an I/O on the first open or to acquire a different lock (like in
g_mirror_access).
That, of course, can break many assumptions. For example,
g_slice_access() adds an extra exclusive count on the first open. As
described above, an underlying geom may drop the topology lock and that
would open a race with another thread that would also request another
extra exclusive count. In general, two consumers may be granted
incompatible accesses.
To avoid this problem the code is changed to mark a geom with special
flag before calling its access method and clear the flag afterwards. If
another thread sees that flag, then it means that the topology lock has
been dropped (either by the geom in question or downstream from it), so
it is not safe to make another access call. So, the second thread would
use g_topology_sleep() to wait until the flag is cleared and only then
would it proceed with the access.
Also see http://docs.freebsd.org/cgi/mid.cgi?809d9254-ee56-59d8-69a4-08838e985cea
PR: 225960
Reported by: asomers
Reviewed by: markj, mav
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D14533
If g_part_gpt_read() encountered a disk with bad primary and secondary
tables, it could leak memory.
Reported by: Coverity
Sponsored by: Dell EMC Isilon
to fix the memory leak that I introduced in r328426. Instead of
trying to clear up the possible memory leak in all the clients, I
ensure that it gets cleaned up in the source (e.g., ffs_sbget ensures
that memory is always freed if it returns an error).
The original change in r328426 was a bit sparse in its description.
So I am expanding on its description here (thanks cem@ and rgrimes@
for your encouragement for my longer commit messages).
In preparation for adding check hashing to superblocks, r328426 is
a refactoring of the code to get the reading/writing of the superblock
into one place. Unlike the cylinder group reading/writing which
ends up in two places (ffs_getcg/ffs_geom_strategy in the kernel
and cgget/cgput in libufs), I have the core superblock functions
just in the kernel (ffs_sbfetch/ffs_sbput in ffs_subr.c which is
already imported into utilities like fsck_ffs as well as libufs to
implement sbget/sbput). The ffs_sbfetch and ffs_sbput functions
take a function pointer to do the actual I/O for which there are
four variants:
ffs_use_bread / ffs_use_bwrite for the in-kernel filesystem
g_use_g_read_data / g_use_g_write_data for kernel geom clients
ufs_use_sa_read for the standalone code (stand/libsa/ufs.c
but not stand/libsa/ufsread.c which is size constrained)
use_pread / use_pwrite for libufs
Uses of these interfaces are in the UFS filesystem, geoms journal &
label, libsa changes, and libufs. They also permeate out into the
filesystem utilities fsck_ffs, newfs, growfs, clri, dump, quotacheck,
fsirand, fstyp, and quot. Some of these utilities should probably be
converted to directly use libufs (like dumpfs was for example), but
there does not seem to be much win in doing so.
Tested by: Peter Holm (pho@)
ffs_sbget() may return a superblock buffer even if it fails, so the
caller must be prepared to free it in this case. Moreover, when tasting
alternate superblock locations in a loop, ffs_sbget()'s readfunc
callback must free the previously allocated buffer.
Reported and tested by: pho
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D14390
If the underlying provider's physical path is null, then the gpart device's
physical path will be, too. Otherwise, it will append the partition name,
such as "/p1" or "/s1/a". This will make gpart work better with zfsd(8).
PR: 224965
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D14010
If the underlying provider's physical path is null, then the geli device's
physical path will be, too. Otherwise, it will append "/eli". This will make
geli work better with zfsd(8).
PR: 224962
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D13979
Some GEOM partition tables may be destroyed with incomplete partition
entries. Guard against this with NULL checks.
Reported by: pholm,others
Reviewed by: markj
Tested by: pholm
A race in g_part_wither() can lead to I/O being performed with a freed GEOM
when the device disappears. Close the race as best as we can for now,
following the code patterns from g_part_ctl_destroy() and g_part_ctl_undo().
This also fixes a leak, as g_wither_geom() does not wither providers, it
only orphans them, so the partition entries would never get destroyed in
g_wither_washer().
Note, this is not a complete fix, it can still race with g_part_start(), the
race has merely been narrowed.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Since synchronization reads are performed by submitting a request to
the external mirror provider, we know that the request returns with an
error only when gmirror was unable to read a copy of the block from any
mirror. Thus, there is no need to retry the request from the
synchronization error handler.
Tested by: pho
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Most consumers of g_metadata_store were passing in partially unallocated
memory, resulting in stack garbage being written to disk labels. Fix them by
zeroing the memory first.
gvirstor repeated the same mistake, but in the kernel.
Also, glabel's label contained a fixed-size string that wasn't
initialized to zero.
PR: 222077
Reported by: Maxim Khitrov <max@mxcrypt.com>
Reviewed by: cem
MFC after: 3 weeks
X-MFC-With: 323314
X-MFC-With: 323338
Differential Revision: https://reviews.freebsd.org/D14164
superblock, and the kernel will fail to link when UFS is not built
in. This commit makes it depend on a small portion of FFS bits and
thereby fixes build for this situation.
This is intended as an interim bandaid, and the actual superblock
reading code should probably be made independent of UFS, so we do
not need to depend on it (see kib@'s comment in the review for
details), and we will revisit this once the superblock check hashes
are all in place.
Differential Revision: https://reviews.freebsd.org/D14092
Specifically reading is done if ffs_sbget() and writing is done
in ffs_sbput(). These functions are exported to libufs via the
sbget() and sbput() functions which then used in the various
filesystem utilities. This work is in preparation for adding
subperblock check hashes.
No functional change intended.
Reviewed by: kib
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
Differential revision: https://reviews.freebsd.org/D13837
Ths change consists of two parts.
geom_disk: deny opening a disk for writing if it's marked as
write-protected. A new disk(9) flag is added to mark write protected
disks. A possible alternative could be to add another parameter to d_open,
so that the open mode could be passed to it and the disk drivers could
make the decision internally, but the flag required less churn.
scsi_da: add a new phase of disk probing to query the all pages mode
sense page. We can determine if the disk is write protected using bit 7
of the device specific field in the mode parameter header returned by
MODE SENSE.
PR: 224037
Reviewed by: mav
MFC after: 4 weeks
Differential Revision: https://reviews.freebsd.org/D13360
We would previously just free the request BIO, which would either cause
the disk to stay stuck in the SYNCHRONIZING state, or result in
synchronization completing without having copied the block which
returned an error.
With this change, if the disk which returned an error is the only active
disk in the mirror, the synchronizing disk is kicked out. Otherwise, the
read is retried.
Reported and tested by: pho (previous version)
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
g_mirror_regular_request() may free the gmirror consumer for a disk
if that disk is being disconnected, after which we must not dereference
the consumer pointer.
CID: 1384280
X-MFC with: r327496
- BIO_FLUSH requests were dispatched to the disks directly from
g_mirror_start() rather than going through the mirror's I/O request
queue, so they could have been reordered with preceding writes.
Address this by processing such requests from the queue, avoiding
direct dispatch.
- Handling for collisions with synchronization requests was too
fine-grained and could cause reordering of writes. In particular,
BIO_ORDERED was not being honoured. Address this by effectively
freezing the request queue any time a collision with a synchronization
request occurs. The queue is unfrozen once the collision with the
first frozen request is over.
- The above-mentioned collision handling allowed reads to jump ahead
of writes to the same offset. Address this by freezing all request
types when a collision occurs, not just BIO_WRITEs and BIO_DELETEs.
Also add some more fail points for use in testing error handling.
Reviewed by: imp
MFC after: 3 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13559
are places where the "main thread" of the booting kernel (either the
thread which later becomes swapper or the thread which later becomes
init) has to stop and wait for action to take place in another thread
before continuing.
There are currently three such holds:
1. The intr_config_hooks SYSINIT waits for hooks registered via the
config_intrhook_establish function; this allows (typically) devices
which need interrupts enabled to complete their initialization to do
so before root is mounted.
2. The g_waitidle function waits for the GEOM event queue to be empty;
this ensures that all of the disks which have been attached have been
tasted before we attempt to mount root.
3. The vfs_mountroot_wait function (in addition to calling g_waitidle)
waits for holds registered via root_mount_hold; among other things, this
is used by the USB subsystem to ensure that we don't fail to mount root
if it's located on a USB disk which takes a while to probe.
The license merging in r109471 didn't take into account that licensing
could change. Just removing the 3rd clause obviates the copyright
assignment to the NetBSD Foundation.
We do have plenty of files that have two or more licensing as in this
case, so fix this properly by splitting back the licenses as they are
upstream.
Obtained from: NetBSD
Part of this file originated in NetBSD, with the original file
carrying two versions of 4-clause BSD licenses. r109471 attempted to
simplify the situation by putting both licenses together.
Meanwhile, NetBSD dropped Clauses 3 and 4 from their own license, and
eventually NetBSD got permission from the University of Utah to drop the
3rd clause.
Keep the license "simple" by dropping the third clause since both TNF,
Utah/Berkeley and phk agree in principle that it can be dropped.
Obtained from: NetBSD (ccd.c CVS 1.128, 1.138)
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.
Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
gmirror does not perform any sorting of I/O requests, so the bioq API
doesn't provide any advantages over plain TAILQs. The API also does not
provide operations needed by an upcoming change.
No functional change intended. The diff shrinks the geom_mirror.ko
text and the gmirror softc slightly.
Tested by: pho (part of a larger patch)
MFC after: 1 week
Sponsored by: Dell EMC Isilon
g_mirror_event_send() acquires the I/O queue lock to deliver a wakeup
to the worker thread, and this is done after enqueuing the event.
So it's sufficient to check the event queue before atomically releasing
the queue lock and going to sleep.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Otherwise a gmirror that has received a BIO_DELETE request will never be
marked clean (unless sc_writes overflows).
MFC after: 1 week
Sponsored by: Dell EMC Isilon
We periodically record synchronization progress in the metadata
block of the disk being synchronized; this allows an interrupted
synchronization to be resumed. However, the frequency of these
updates heavily pessimized synchronization time on some media. This
change modifies gmirror to update metadata based on a time period,
and adds a sysctl to control that period. The default value results
in a much lower update frequency and increases the completion time
for an interrupted rebuild only marginally.
Reported by: Andre Albsmeier <andre@fbsd.e4m.org>
MFC after: 3 weeks
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
A negative value can be used to suppress all prints from the gmirror
kernel code, which can be useful when attempting to trigger race
conditions using stress tests.
MFC after: 1 week
produces hex numbers for the dsn. Since that come is from EDK2, change
this for symmetry, by generating the dsn as a hex number.
Noticed by: gpart list | grep efimedia | awk -F: '{print $2;}' | \
sed -e 's/^ *//g;s/,,/,/' | grep MBR | efidp -p | efidp -f
Sponsored by: Netflix
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Initially, only tag files that use BSD 4-Clause "Original" license.
RelNotes: yes
Differential Revision: https://reviews.freebsd.org/D13133
This geom does not immediately detach its consumer relying on the
wither-washer to do that. Since that happens asynchronously we may get
additional spoiling events. So, we need to account for that.
There are multiple options for fixing this issue like detaching
immediately or checking for G_CF_ORPHAN in g_slice_spoiled().
The most reliable and least intrusive fix seems to be setting
geom->softc to NULL on the first call and checking for NULL on
subsequent calls. This is something that the code did before r325227.
Reported by: David Wolfskill <david@catwhisker.org>,
O. Hartmann <o.hartmann@walstatt.org>
Tested by: David Wolfskill <david@catwhisker.org> (earlier version)
Discussed with: mav
MFC after: 1 week
X-MFC with: r325227
At present, g_slice_orphan and g_slice_spoiled destroy the softc
(struct g_slicer) even before calling g_wither_geom, so there can
be active and incoming io requests at that time and g_slice_start
can access the softc.
This commit changes the code to destroy the softc only after all
providers are closed.
While there, a couple of small cleanups.
Reported by: Ben RUBSON <ben.rubson@gmail.com>
Tested by: Ben RUBSON <ben.rubson@gmail.com>
Reviewed by: mav, smh (earlier version)
MFC after: 2 weeks
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D12809
g_mirror_destroy() is supposed to unlock the softc before indicating
success, but it wasn't doing so if the caller raced with another
thread destroying the mirror.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
When using a kernel built with the GZIO config option, dumpon -z can be
used to configure gzip compression using the in-kernel copy of zlib.
This is useful on systems with large amounts of RAM, which require a
correspondingly large dump device. Recovery of compressed dumps is also
faster since fewer bytes need to be copied from the dump device.
Because we have no way of knowing the final size of a compressed dump
until it is written, the kernel will always attempt to dump when
compression is configured, regardless of the dump device size. If the
dump is aborted because we run out of space, an error is reported on
the console.
savecore(8) is modified to handle compressed dumps and save them to
vmcore.<index>.gz, as it does when given the -z option.
A new rc.conf variable, dumpon_flags, is added. Its value is added to
the boot-time dumpon(8) invocation that occurs when a dump device is
configured in rc.conf.
Reviewed by: cem (earlier version)
Discussed with: def, rgrimes
Relnotes: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11723
GEOM consumer can be orphaned, and then reattach to another provider.
From a user point of view, this makes gmountver(4) work again.
Reviewed by: avg, mav
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D12228
Like r266444, g_resize_provider_event can attempt to orphan an already
orphaned geom_dev consumer. This will cause a panic in g_dev_orphan. Apply
the same fix as was applied to g_orphan_register.
Reviewed by: ae
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12469
In theory, all data access errors mean that a member is out of sync
at most. But they were treated as more serious errors to avoid the
situation where a flaky disk gets repeatedly disconnected, re-synchronized,
reconnected and then disconnected again.
ENXIO is a special error that means that the member disk disappeared,
so it should get the same handling as the GEOM orphaning event.
There is a better chance that when the disk is reconnected, it will be
a good member again.
When ENXIO happens on a read we use the exisiting G_MIRROR_BUMP_SYNCID
mechanism which means that the mirror's syncid is increased as soon
as there is a write to the mirror. That's because no data has got out
of sync yet, but the problematic memeber is disconnected, so the future
write will make it stale.
When ENXIO happens on a write we use a new G_MIRROR_BUMP_SYNCID_NOW
mechanism which means that we update the mirror metadata as soon as
possible because the problematic memeber is already behind.
Reviewed by: markj, imp
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D9463
In integrity mode, a larger logical sector (e.g., 4096 bytes) spans several
physical sectors (e.g., 512 bytes) on the backing device. Due to hash
overhead, a 4096 byte logical sector takes 8.5625 512-byte physical sectors.
This means that only 288 bytes (256 data + 32 hash) of the last 512 byte
sector are used.
The memory allocation used to store the encrypted data to be written to the
physical sectors comes from malloc(9) and does not use M_ZERO.
Previously, nothing initialized the final physical sector backing each
logical sector, aside from the hash + encrypted data portion. So 224 bytes
of kernel heap memory was leaked to every block :-(.
This patch addresses the issue by initializing the trailing portion of the
physical sector in every logical sector to zeros before use. A much simpler
but higher overhead fix would be to tag the entire allocation M_ZERO.
PR: 222077
Reported by: Maxim Khitrov <max AT mxcrypt.com>
Reviewed by: emaste
Security: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12272
the g_journal level needs to check whether it is holding a newer
copy of the block than that which exists on the disk. If so, it
needs to return its copy. If not, it should pass the request down
to the disk to fulfill. It currently considers six queues:
0) delayed queue,
1) unsent (current queue),
2) in-flight to the journal (flush queue),
3) active journal (active queue),
4) inactive journal (inactive queue), and
5) inflight to the disk (copy queue).
Checking on two of these queues is unnecessary:
0) The delayed requests should not be used for reads because they
have not yet been entered into the journal, so their value should
reflect the disk contents, not the future contents that are not
yet committed.
2) Because all the bio's in the flush queue are also found on the
active queue, there is no need to inspect the flush queue for
reads since they will be found when searching the active queue.
Submitted by: Dr. Andreas Longwitz <longwitz@incore.de>
Discussed with: kib
MFC after: 1 week
geom_bsd, geom_mbr and geom_sunlabel have been obsolete since Marcel
Moolenaar's geom_part was in FreeBSD 7. They haven't been in GENERIC
since FreeBSD 8. Add warning when used.
geom_vol_ffs has been obsolete since ufs support to geom_label was
committed in FreeBSD 5. It hasn't been in GENERIC since FreeBSD 5.
Add warning when used.
geom_fox has been obsolete since gmultipath was committed in FreeBSD 7.
(no warning added, since this is a very obscure class).
These will all be removed in FreeBSD 12.
MFC After: 3 days
Differential Revision: https://reviews.freebsd.org/D11935
Note: Classes will be removed after MFC
No need to set any fields in the cloned device. devfs uses symlinks,
so the adev entries returned won't be presented to the drivers. Since
we don't save copies, nothing else will see them. This code came from
the old compat code, and it appears to be obsolete or never needed.
Submitted by: kib@
Differential Review: https://reviews.freebsd.org/D11919
Implement disk_add_alias to allow aliases to be added to disks. All
disk have a primary name (say "foo") can also have secondary names
(say "bar") such that all instances of "foo" also have a "bar"
alias. So if you have foo0, foo0p1, foo1, foo1s1 and foo1s1a nodes
created by the foo driver and gpart, device nodes bar0, bar0p1, bar1,
bar1s1 and bar1s1a will appear as symlinks back to the original nodes.
This generalizes to multiple aliases. However, since the unit number
follows the primary name, multiple device drivers can't create the
same aliases unless those drives coorinate the unit number space (eg
you couldn't add an alias 'disk' to both 'da' and 'ada' because it's
possible to have da0 and ada0, because 'disk0' is ambiguous).
Differential Revision: https://reviews.freebsd.org/D11873
When we're creating new providers for each of the partitions, add
aliases to the geom before we create the provider so when geom_dev
tastes the provider, the aliases are in place so the proper /dev
entries are created. So foo5p6 gets created as an alias for bar5p6
when foo is an alias for bar in the geom we're partitioning with
g_part. This also copies aliases from the container geom (eg disk) to
the label geom (the disk with GPT partitioning) so that aliases nest
properly.
Differential Revision: https://reviews.freebsd.org/D11873
Add an alias name list to geoms. Use them in geom_dev to create
aliases. Previously, geom_dev would create an device node for the name
of the geom. Now, additional nodes are created pointing back to the
primary node with make_dev_alias_p. Aliases must be in place on the
geom before any tasting occurs.
Differential Revision: https://reviews.freebsd.org/D11873
in the flush_queue:
1 2 3 4 5 6 7 8 9 10
and another 10 bio's go into the flush queue after only the first five
bio's are removed from the flush queue, the queue should look like:
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,
but because of the bug we end up with
6 11 12 13 14 15 16 17 18 19 20 7 8 9 10.
So the sequence of the bio's is damaged in the flush queue (and
therefore in the journal on disk !). This error can be triggered by
ffs_snapshot() when a block is read with readblock() and gjournal finds
this block in the broken flush queue before it goes to the correct
active queue.
The fix is to place all new blocks at the end of the queue.
Submitted by: Dr. Andreas Longwitz <longwitz@incore.de>
Discussed with: kib
MFC after: 1 week
system having over 4GB RAM. That's due to:
1) the limit being u_int instead of u_long like vm.kmem_size (the limit is
half of vm.kmem_size by default for amd64);
2) sysctl handler g_journal_cache_limit_sysctl() using u_int instead of u_long.
The fix is to replace u_int with u_long for the kern.geom.journal.cache.limit
sysctl variable.
PR: 198500
Submitted by: Dr. Andreas Longwitz <longwitz@incore.de>
Reported by: Eugene Grosbein
Discussed with: kib
MFC after: 1 week
RPI1-B, Alix and APU2 boards as well as NanoBSD with the following message:
vnode_pager_generic_getpages_done: I/O read error 5
Seems the breakage was because it was missed to include acr in glabel update.
Reported by: Peter Blok <pblok@bsd4all.org>,
madpilot, imp and trasz.
Reviewed by: trasz
Tested by: Peter Blok and madpilot.
MFC after: 3 days.
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D11365
Add -o [no]verify option to mdconfig (and document in man page.)
Implement GEOM attribute MNT::verified to ask md if the backing vnode is
verified.
Check for MNT::verified in cd9660 mount to flag the mount as MNT_VERIFIED if
the underlying device has been verified.
Reviewed by: rwatson
Approved by: sjg (mentor)
Obtained from: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D2902