freebsd-dev

Author	SHA1	Message	Date
Jessica Clarke	34fb1c133c	Fix intra-object buffer overread for labeled msdosfs volumes Volume labels, like directory entries, are padded with spaces and so have no NUL terminator. Whilst the MIN for the dsize argument to strlcpy ensures that the copy does not overflow the destination, strlcpy is defined to return the number of characters in the source string, regardless of the provided dsize, and so keeps reading until it finds a NUL, which likely exists somewhere within the following fields, but On CHERI with the subobject bounds enabled in the compiler this buffer overread will be detected and trap with a bounds violation. Found by: CHERI Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D32579	2021-10-27 18:38:37 +01:00
Mark Johnston	f0a08fa9f5	geom_label: Add more validation for NTFS volume tasting - Ensure that the computed MFT record size isn't negative or larger than maxphys before trying to read $Volume. - Guard against truncated records in volume metadata. - Ensure that the record length is large enough to contain the volume name. - Verify that the (UTF-16-encoded) volume name's length is a multiple of two. PR: 258833, 258914 MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-10-04 18:15:06 -04:00
Gleb Smirnoff	b984d153e0	Don't set GELI UMA zone as UMA_ZONE_NOFREE. That fixes memory leak on last GELI provider destroyed, introduced in `2dbc9a388e`. This patch was originally developed late 2019 and the flag was necessary to prevent zone drainage under memory pressure. Today, with `f09cbea31a` the UMA is fixed not to drain into reserves. Discussed with: jtl, markj Fixes: `2dbc9a388e` PR: 258787	2021-10-01 10:31:17 -07:00
Gleb Smirnoff	2dbc9a388e	Fix memory deadlock when GELI partition is used for swap. When we get low on memory, the VM system tries to free some by swapping pages. However, if we are so low on free pages that GELI allocations block, then the swapout operation cannot complete. This keeps the VM system from being able to free enough memory so the allocation can complete. To alleviate this, keep a UMA pool at the GELI layer which is used for data buffer allocation in the fast path, and reserve some of that memory for swap operations. If an IO operation is a swap, then use the reserved memory. If the allocation still fails, return ENOMEM instead of blocking. For non-swap allocations, change the default to using M_NOWAIT. In general, this should be better, since it gives upper layers a signal of the memory pressure and a chance to manage their failure strategy appropriately. However, a user can set the kern.geom.eli.blocking_malloc sysctl/tunable to restore the previous M_WAITOK strategy. Submitted by: jtl Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D24400	2021-09-28 11:23:52 -07:00
Gleb Smirnoff	c6213beff4	Add flag BIO_SWAP to mark IOs that are associated with swap. Submitted by: jtl Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D24400	2021-09-28 11:23:51 -07:00
Mark Johnston	5402baa5b5	g_label: Handle small sector sizes when tasting Make sure that the provider sector size is large enough to contain a valid label before trying to read it. We performed this check already for most label types, but not for several filesystem labels. Reported by: syzbot+f52918174cdf193ae29c@syzkaller.appspotmail.com MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-09-07 11:19:29 -04:00
Mark Johnston	9e9ba9c73d	graid: Avoid tasting devices with small sector sizes The RAID metadata parsers effectively assume a sector size of 512 bytes or larger, but md(4) devices can be created with a sector size that's any power of 2. Add some seatbelts to graid tasting routines to ensure that the requested sector(s) are large enough for the device to plausibly contain RAID metadata. Reported by: syzbot+f43583c9bf8357c8b56f@syzkaller.appspotmail.com Reported by: syzbot+537dd9f22b91b698e161@syzkaller.appspotmail.com Reported by: syzbot+51509dd48871c57c6e47@syzkaller.appspotmail.com Reported by: syzbot+c882a31037ea2a54ff63@syzkaller.appspotmail.com MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-08-31 17:09:52 -04:00
Gordon Bergling	5bdf58e196	Fix some common typos in source code comments - s/priviledged/privileged/ - s/funtion/function/ - s/doens't/doesn't/ - s/sychronization/synchronization/ MFC after: 3 days	2021-08-28 18:57:23 +02:00
Mark Johnston	645b7efd49	geom_disk: Add KMSAN checks - In g_disk_start(), verify that the data to be written is initialized according to KMSAN shadow state. - In g_disk_done(), verify that the block driver updated shadow state as expected, so as to catch sources of false positives early. Sponsored by: The FreeBSD Foundation	2021-08-11 16:33:41 -04:00
Alexander Motin	c2da954203	geom(4): Mark all sysctls as CTLFLAG_MPSAFE. This code does not use Giant lock for very long time. MFC after: 2 weeks	2021-08-10 20:18:46 -04:00
John Baldwin	419d406e4e	geom_vfs: Pre-allocate event for g_vfs_destroy. When an active g_vfs is orphaned due to an underlying disk going away the destroy is deferred until the filesystem is unmounted in g_vfs_done(). However, g_vfs_done() is invoked from a non-sleepable context and cannot use M_WAITOK to allocate the event. Instead, allocate the event in g_vfs_orphan() and save it in the softc to be retrieved by the last call to g_vfs_done(). Reported by: Jithesh Arakkan @ Chelsio Reviewed by: imp Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D31354	2021-07-29 17:09:23 -07:00
John Baldwin	5b5d78897c	Use a more specific type for geom_disk.d_event. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D31353	2021-07-29 16:34:46 -07:00
Warner Losh	47aeda7b70	geom_disk: use a preallocated geom_event for disk destruction. Preallocate a geom_event (using the new geom_alloc_event) when we create a disk. When we create the disk, we're going to be in a sleepable context, so we can always allocate this extra bit of memory. Then use this preallocated memory to free the disk. CAM can try to free the disk from an unsleepable context if there was I/O outstanding when the disk was destroyted (say because the SIM said it had gone away). The I/O context isn't sleepable. Rather than trying to invent a retry mechanism and making sure all the other geom_disk consumers did it properly, preallocating the event ensure that the geom_disk will be properly torn down, even when there's memory pressure when the disk departs. Reviewd by: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30544	2021-07-23 18:08:52 -06:00
Warner Losh	380710a5c8	geom: create an API to allocate events, and use that storage to send them g_alloc_event will allocate storage for an opaque event. g_post_event_ep can use memory returned by g_alloc_event to send an event from a context that might not be able to allocate the event. Occasionally, we can alloate memory when we create an object, but not while we're destroy it. This allows one to allocate at creation time memory to use when destorying the object. Reviewed by: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30544	2021-07-23 18:08:45 -06:00
Jessica Clarke	e77ef47d36	geom_label: Partially reinstate old sysinstall(8) workaround This partially reverts commit `af433832f7`. Since such bogus disklabels still exist in the wild, we now probe for a disklabel to decide whether to ignore the UFS partition or not; if there is a label then we use the old behaviour, and if there isn't one then we use the new behaviour. Reviewed by: cy, mckusick Differential Revision: https://reviews.freebsd.org/D31068	2021-07-21 02:51:25 +01:00
Mark Johnston	0fcafe8516	eli: Zero pad bytes that arise when certain auth algorithms are used When authentication is configured, GELI ensures that the amount of data per sector is a multiple of 16 bytes. This is done in eli_metadata_softc(). When the digest size is not a multiple of 16 bytes, this leaves some extra pad bytes at the end of every sector, and they were not being zeroed before being written to disk. In particular, this happens with the HMAC/SHA1, HMAC/RIPEMD160 and HMAC/SHA384 data authentication algorithms. This change ensures that they are zeroed before being written to disk. Reported by: KMSAN Reviewed by: delphij, asomers MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31170	2021-07-15 12:23:04 -04:00
Mark Johnston	39552dff7b	graid3: Zero the metadata block before writing Ensure that string buffers and pad bytes are zero-filled before writing graid3 metadata. Reported by: KMSAN MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-07-13 17:46:02 -04:00
Mark Johnston	0f09ab89cc	gconcat: Zero the metadata block before writing Ensure that string buffers and pad bytes are zero-filled before writing gconcat metadata. Also make sure to zero the full block buffer before encoding the metadata and writing. Fix some style bugs in g_concat_write_metadata() while here. Reported by: KMSAN MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-07-13 17:45:59 -04:00
Mark Johnston	7f053a44ae	gmirror: Zero the metadata block before writing The mirror metadata fields contain string buffers and pad bytes, neither were being zeroed before metadata was written to disk. Also, the metadata structure is smaller than the sector size, and in one case gmirror was failing to zero-fill the full buffer before writing. Fix these problems by pre-zeroing the metadata structure and the sector buffer. Reported by: KMSAN MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2021-07-13 17:45:57 -04:00
Jessica Clarke	af433832f7	geom_label: Remove an old sysinstall(8) workaround We removed sysinstall(8) back in 2011, so this workaround should be long since unnecessary. This workaround can end up breaking cases that are hit in the real world, such as dd'ing a small pre-built disk image to a large partition that you intend to grow on first boot and uses a UFS disk label for / in its /etc/fstab (as the only reliable thing a raw UFS image can reference). Reviewed by: imp, mckusick Differential Revision: https://reviews.freebsd.org/D30825	2021-07-05 16:15:32 +01:00
Noah Bergbauer	d575e81fbc	gconcat: Implement new online append feature Implement the "gconcat append" command which can be used to append a disk to the end of an existing gconcat device without unmounting. If the gconcat device is using the "automatic" method, i.e., stores metadata on the devices, new metadata is written to all existing components, as well as to the newly added one. Pull Request: https://github.com/freebsd/freebsd-src/pull/472 Reviewed by: imp@	2021-06-14 11:42:03 -06:00
Noah Bergbauer	56fd97660a	gconcat: Add new lock to allow modifications to the disk list in preparation for online append In addition, rename existing sc_lock to sc_append_lock Reviewed by: imp@ Pull Request: https://github.com/freebsd/freebsd-src/pull/447 Sponsored by: Netflix	2021-06-02 15:59:25 -06:00
Noah Bergbauer	e61e072f3b	gconcat: Switch array to TAILQ to prepare for online append Reviewed by: imp@ Pull Request: https://github.com/freebsd/freebsd-src/pull/447 Sponsored by: Netflix	2021-06-02 15:50:27 -06:00
Alan Somers	420dbe763f	gmultipath: make physpath distinct from the underlying providers' zfsd uses a device's physical path attribute to automatically replace a missing ZFS disk when a blank disk is inserted into the same physical slot. Currently gmultipath passes through its underlying providers' physical path attribute. That may cause zfsd to replace a missing gmultipath provider with a newly arrived, single-path disk. That would be bad. This commit fixes that problem by simply appending "/mp" to the underlying providers' physical path, in a manner similar to what geli already does. Sponsored by: Axcient MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D29941	2021-05-06 12:32:27 -06:00
Mark Johnston	2f1cfb7f63	gmirror: Pre-allocate the timeout event structure We can't call malloc(M_WAITOK) in a callout handler. Reviewed by: imp Reported by: pho Tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29223	2021-03-11 15:45:15 -05:00
Mark Johnston	68f6800ce0	opencrypto: Introduce crypto_dispatch_async() Currently, OpenCrypto consumers can request asynchronous dispatch by setting a flag in the cryptop. (Currently only IPSec may do this.) I think this is a bit confusing: we (conditionally) set cryptop flags to request async dispatch, and then crypto_dispatch() immediately examines those flags to see if the consumer wants async dispatch. The flag names are also confusing since they don't specify what "async" applies to: dispatch or completion. Add a new KPI, crypto_dispatch_async(), rather than encoding the requested dispatch type in each cryptop. crypto_dispatch_async() falls back to crypto_dispatch() if the session's driver provides asynchronous dispatch. Get rid of CRYPTOP_ASYNC() and CRYPTOP_ASYNC_KEEPORDER(). Similarly, add crypto_dispatch_batch() to request processing of a tailq of cryptops, rather than encoding the scheduling policy using cryptop flags. Convert GELI, the only user of this interface (disabled by default) to use the new interface. Add CRYPTO_SESS_SYNC(), which can be used by consumers to determine whether crypto requests will be dispatched synchronously. This is just a helper macro. Use it instead of looking at cap flags directly. Fix style in crypto_done(). Also get rid of CRYPTO_RETW_EMPTY() and just check the relevant queues directly. This could result in some unnecessary wakeups but I think it's very uncommon to be using more than one queue per worker in a given workload, so checking all three queues is a waste of cycles. Reviewed by: jhb Sponsored by: Ampere Computing Submitted by: Klara, Inc. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D28194	2021-02-08 09:19:19 -05:00
Edward Tomasz Napierala	123019739c	geom(4): make g_newprovider_event() return if G_P_WITHER is set This fixes a failed assertion in scenario where the provider disappears, disk_gone() gets called, and at the exact same time something else closes the device node triggering a retaste. Reviewed By: mav Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D27330	2020-12-29 14:29:59 +00:00
Konstantin Belousov	cd85379104	Make MAXPHYS tunable. Bump MAXPHYS to 1M. Replace MAXPHYS by runtime variable maxphys. It is initialized from MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys. Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer cache buffers exactly to atop(maxbcachebuf) (currently it is sized to atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1. The +1 for pbufs allow several pbuf consumers, among them vmapbuf(), to use unaligned buffers still sized to maxphys, esp. when such buffers come from userspace (). Overall, we save significant amount of otherwise wasted memory in b_pages[] for buffer cache buffers, while bumping MAXPHYS to desired high value. Eliminate all direct uses of the MAXPHYS constant in kernel and driver sources, except a place which initialize maxphys. Some random (and arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted straight. Some drivers, which use MAXPHYS to size embeded structures, get private MAXPHYS-like constant; their convertion is out of scope for this work. Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs, dev/siis, where either submitted by, or based on changes by mav. Suggested by: mav () Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D27225	2020-11-28 12:12:51 +00:00
Mateusz Guzik	d5127d1ae2	gbde: replace malloc_last_fail with a kludge This facilitates removal of malloc_last_fail without really impacting anything.	2020-11-12 20:20:57 +00:00
Mark Johnston	f44994874b	ffs: Clamp BIO_SPEEDUP length On 32-bit platforms, the computed size of the BIO_SPEEDUP requested by softdep_request_cleanup() may be negative when assigned to bp->b_bcount, which has type "long". Clamp the size to LONG_MAX. Also convert the unused g_io_speedup() to use an off_t for the magnitude of the shortage for consistency with softdep_send_speedup(). Reviewed by: chs, kib Reported by: pho Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D27081	2020-11-11 13:48:07 +00:00
Warner Losh	a3f4217ec0	Remove frontstuff Nothing implements this in the tree. Remove the ioctl and the conversion to the geom atttribute stuff. This was introduced in r94287 in 2002 and was retired in r113390 2003. It appeared in FreeBSD 5.0, but no other releases. This is a vestige that was missed at the time and overlooked until now. No compat is provided for this reason. And there's no implementation of it today. And it was never part of a release from a stable branch. Reviewed by: phk@ Differential Revision: https://reviews.freebsd.org/D26967	2020-10-27 06:43:24 +00:00
Alexander Motin	8b220f8915	Fix asymmetry in devstat(9) calls by GEOM. Before this GEOM passed bio pointer to transaction start, but not end. It was irrelevant until devstat(9) got DTrace hooks, that appeared to provide bio pointer on I/O completion, but not on submission. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2020-10-24 21:07:10 +00:00
Robert Wing	a2b559df1e	geom_ctl.c: remove stale header files - Remove "opt_geom.h", no kernel options are used. - Remove <sys/sysctl.h>, no sysctl functionality is used here. - Remove <sys/bio.h>, requirements for bio moved out in r112534. - Remove <sys/lock.h> and <sys/mutex.h>, last used by DROP_GIANT() and PICKUP_GIANT(), which were removed in r115624. - Remove <sys/disk.h> and <sys/kernel.h>, not used. Reviewed by: phk, kevans (mentor) Approved by: phk, kevans (mentor) Differential Revision: https://reviews.freebsd.org/D26805	2020-10-20 20:59:13 +00:00
Edward Tomasz Napierala	3001e97deb	Fix fallout from r366811. PR: 250442 Reported by: lwhsu Reviewed by: mav MFC after: 2 weeks Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26855	2020-10-19 20:26:37 +00:00
Edward Tomasz Napierala	d22ff249d9	Make g_attach() return ENXIO for orphaned providers; update various classes to add missing error checking. Reviewed by: imp MFC after: 2 weeks Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26658	2020-10-18 16:24:08 +00:00
Warner Losh	bc683a89a3	Move kernel env global variables, etc to sys/kenv.h The kernel globals for kenv are confined to 2 files that need them and a few that likely shouldn't (but as written the code does). Move them from sys/systm.h to sys/kenv.h. This removed a XXX from systm.h and cleans it up a little bit...	2020-10-07 06:16:37 +00:00
Eugene Grosbein	b2b5d4c07d	geom_part: make it possible recovering broken GPT after some LBAs cut off This is followup to r365477. If pre-formatted device has GPT and a partition covering last available LBAs and the device is attached using a bridge reducing amount of LBAs, then it could be not enough forcing GEOM to use primary GPT. Also, we should make it possible to recover GPT and this requires either deleting or resizing the partition. This change enables "gpart delete" and "gpart resize" commands on corrupted GPT with following "gpart recover". It still does not allow modifying corrupted GPT without preliminary setting sysctl kern.geom.part.check_integrity=0 For example: # gpart show da0 => 34 3906963389 da0 GPT (1.8T) [CORRUPT] 34 262144 1 ms-reserved (128M) 262178 2014 - free - (1.0M) 264192 3906764943 2 freebsd-swap (1.8T) # gpart resize -i 2 -s 3900000000 da0 # gpart recover da0 Reported by: Alex Korchmar MFC after: 3 days	2020-09-17 04:39:39 +00:00
Warner Losh	0c97af56a7	We don't need the sc_ekeys_lock in standalone environment. When we bring in geli into the boot loader, we are single threaded so we don't have to worry about locking. We have no mutexes, and don't need to use them, so comment it out. MFC After: 3 days	2020-09-14 23:51:14 +00:00
Edward Tomasz Napierala	60f083efe2	Move TDP_GEOM check from userret() to ast(); this code path is quite infrequent. Reviewed by: kib No objections: mav Tested by: pho MFC after: 2 weeks Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26374	2020-09-14 10:14:03 +00:00
Eugene Grosbein	cea05ed9a9	geom_part: extend kern.geom.part.check_integrity to work on GPT There are multiple USB/SATA bridges on the market that unconditionally cut some LBAs off connected media. This could be a problem for pre-partitioned drives so GEOM complains and does not create devices in /dev for slices/partitions preventing access to existing data. We have kern.geom.part.check_integrity that allows us to correct partitioning if changed from default 1 to 0 but it works for MBR only. If backup copy of GPT is unavailable due to decreases number of LBAs, kernel still does not give access to partitions and prints to dmesg: GEOM: md0: corrupt or invalid GPT detected. GEOM: md0: GPT rejected -- may not be recoverable. This change makes it work for GPT too, so it created partitions in /dev and prints to dmesg this instead: GEOM: md0: the secondary GPT table is corrupt or invalid. GEOM: md0: using the primary only -- recovery suggested. Then "gpart recover" re-created backup copy of GPT and allows further manipulations with partitions. This change is no-op for default configuration having kern.geom.part.check_integrity=1 Reported by: Alex Korchmar MFC after: 3 days.	2020-09-08 22:23:53 +00:00
Mateusz Guzik	d40bc60752	geom: clean up empty lines in .c and .h files	2020-09-01 22:14:09 +00:00
Warner Losh	887611b122	Retire devctl_notify_f() devctl_notify_f isn't needed, so retire it. The flags argument is now unused, so rather than keep it around, retire it. Convert all old users of it to devctl_notify(). This path no longer sleeps, so is safe to call from any context. Since it doesn't sleep, it doesn't need to know if it is OK to sleep or not. Reviewed by: markj@ Differential Revision: https://reviews.freebsd.org/D26140	2020-08-29 04:30:06 +00:00
Alan Somers	7d874f0f36	geli: use unmapped I/O Use unmapped I/O for geli. Unlike most geom providers, geli needs to manipulate data on every read or write. Previously it would always map bios. On my 16-core, dual socket server using geli atop md(4) devices, with 512B sectors, this change increases geli IOPs by about 3x. Note that geli still can't use unmapped I/O when data integrity verification is enabled (but it could, with a little more work). And it can't use unmapped I/O in combination with ZFS, because ZFS uses mapped bios. Reviewed by: markj, kib, jhb, mjg, mat, bcr (manpages) MFC after: 1 week Sponsored by: Axcient Differential Revision: https://reviews.freebsd.org/D25671	2020-08-26 02:44:35 +00:00
Warner Losh	773e541e8d	Use devctl.h instead of bus.h to reduce newbus pollution. There's no need for these parts of the kernel to know about newbus, so narrow what is included to devctl.h for device_notify_*. Suggested by: kib@	2020-08-21 00:03:24 +00:00
Conrad Meyer	cb1480f8d4	gpart(8): Recognize apple-zfs and solaris-reserved partition ids Introduce G_PART_ALIAS_SOLARIS_RESERVED, GPT_ENT_TYPE_SOLARIS_RESERVED et al., to make gpart show output more convenient on systems with illumos/openindiana disks visible. Submitted by: Juraj Lutter <otis AT sk.FreeBSD.org> Reviewed by: bcr(manpages), delphij, myself Differential Revision: https://reviews.freebsd.org/D26012	2020-08-17 17:07:05 +00:00
John Baldwin	e2bbd168ad	Fix indentation.	2020-07-27 16:31:21 +00:00
Xin LI	a450ecfdbd	gctl_get_geom: Skip validation of g_class. The caller from kernel is expected to provide an valid g_class pointer, instead of traversing the global g_class list, just use that pointer directly instead. Reviewed by: mav MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D25811	2020-07-26 22:30:55 +00:00
Xin LI	178d88fa39	geom_map and geom_redboot: Remove unused ctlreq handler. The two classes do not take any verbs and always gctl_error for all requests, so don't bother to provide a ctlreq handler. Reviewed by: mav MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D25810	2020-07-26 22:30:01 +00:00
Xin LI	7201590bbf	Use snprintf instead of sprintf. MFC after: 2 weeks	2020-07-26 01:45:26 +00:00
Xin LI	795c5f365e	geom_label: Make glabel labels more trivial by separating the tasting routines out. While there, also simplify the creation of label paths a little bit by requiring the / suffix for label directory prefixes (ld_dir renamed to ld_dirprefix to indicate the change) and stop defining macros for these when they are only used once. Reviewed by: cem MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D25597	2020-07-26 00:44:59 +00:00
Xin LI	fcf69f3dbc	Consistently use gctl_get_provider instead of home-grown variants. Reviewed by: cem, imp MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D25739	2020-07-22 02:15:21 +00:00
Xin LI	0ab851aac3	gctl_get_class, gctl_get_geom and gctl_get_provider: provide feedback when the requested argument is missing. Reviewed by: cem MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D25738	2020-07-22 02:14:27 +00:00
Alan Somers	aafaa8b794	Fix geli's null cipher, and add a test case PR: 247954 Submitted by: jhb (sys), asomers (tests) Reviewed by: jhb (tests), asomers (sys) MFC after: 2 weeks Sponsored by: Axcient	2020-07-21 19:18:29 +00:00
Xin LI	82b17c8e91	Fix indent for if clause. MFC after: 2 weeks	2020-07-20 01:55:19 +00:00
Xin LI	b23a7fbaab	g_concat_find_device: trim /dev/ if it is present, like other GEOM classes. Reviewed by: cem MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D25596	2020-07-09 08:00:46 +00:00
Xin LI	8510f61acd	sys/geom: consistently use _PATH_DEV instead of hardcoding "/dev/". Reviewed by: cem MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D25565	2020-07-09 02:52:39 +00:00
Alan Somers	6f818c1fb0	geli: enable direct dispatch geli does all of its crypto operations in a separate thread pool, so g_eli_start, g_eli_read_done, and g_eli_write_done don't actually do very much work. Enabling direct dispatch eliminates the g_up/g_down bottlenecks, doubling IOPs on my system. This change does not affect the thread pool. Reviewed by: markj MFC after: 2 weeks Sponsored by: Axcient Differential Revision: https://reviews.freebsd.org/D25587	2020-07-08 17:12:12 +00:00
Conrad Meyer	64612d4e44	geom(4): Kill GEOM_PART_EBR_COMPAT option Take advantage of Warner's nice new real GEOM aliasing system and use it for aliased partition names that actually work. Our canonical EBR partition name is the weird, not-default-on-x86-prior-to- this-revision "da1p4+00001234." However, if compatibility mode (tunable kern.geom.part.ebr.compat_aliases) is enabled (1, default), we continue to provide the alias names like "da1p5" in addition to the weird canonical names. Naming partition providers was just one aspect of the COMPAT knob; in addition it limited mutability, in part because it did not preserve existing EBR header content aside from that of LBA 0. This change saves the EBR header for LBA 0, as well as for every EBR partition encountered. That way, when we write out the EBR partition table on modification, we can restore any bootloader or other metadata in both LBA0 (the first data-containing EBR may start after 0) as well as every logical EBR we read from the disk, and only update the geometry metadata and linked list pointers that describe the actual partitioning. (This change does not add support for the 'bootcode' verb to EBR.) PR: 232463 Reported by: Manish Jain <bourne.identity AT hotmail.com> Discussed with: ae (no objection) Relnotes: maybe Differential Revision: https://reviews.freebsd.org/D24939	2020-07-01 02:16:36 +00:00
John Baldwin	6572e5ff66	Use explicit_bzero() instead of bzero() for sensitive data. Reviewed by: delphij Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D25441	2020-06-25 20:25:35 +00:00
John Baldwin	b172f23dd7	Use zfree() instead of bzero() and free(). These bzero's should have been explicit_bzero's. Reviewed by: cem, delphij Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D25437	2020-06-25 20:20:22 +00:00
John Baldwin	4a711b8d04	Use zfree() instead of explicit_bzero() and free(). In addition to reducing lines of code, this also ensures that the full allocation is always zeroed avoiding possible bugs with incorrect lengths passed to explicit_bzero(). Suggested by: cem Reviewed by: cem, delphij Approved by: csprng (cem) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D25435	2020-06-25 20:17:34 +00:00
Kirk McKusick	9407f25df2	Optimize g_journal's superblock update by noting that the summary information is neither read nor written so it need not be written out when updating the superblock. PR: 247425 Sponsored by: Netflix	2020-06-23 21:44:00 +00:00
Baptiste Daroussin	5b990a9463	Revert r362466 Such change should not have happen without prior discussion and review. With hat: transitioning core	2020-06-22 07:46:24 +00:00
Hans Petter Selasky	7747001b12	Improve wording to be more precise and clear. No functional change intended. s/Master Boot/Main Boot/ (also called MBR) MFC after: 1 week Sponsored by: Mellanox Technologies	2020-06-21 13:34:08 +00:00
Kirk McKusick	34816cb9ae	Move the pointers stored in the superblock into a separate fs_summary_info structure. This change was originally done by the CheriBSD project as they need larger pointers that do not fit in the existing superblock. This cleanup of the superblock eases the task of the commit that immediately follows this one. Suggested by: brooks Reviewed by: kib PR: 246983 Sponsored by: Netflix	2020-06-19 01:02:53 +00:00
John Baldwin	a3d565a118	Add a crypto capability flag for accelerated software drivers. Use this in GELI to print out a different message when accelerated software such as AESNI is used vs plain software crypto. While here, simplify the logic in GELI a bit for determing which type of crypto driver was chosen the first time by examining the capabilities of the matched driver after a single call to crypto_newsession rather than making separate calls with different flags. Reviewed by: delphij Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D25126	2020-06-09 22:26:07 +00:00
Conrad Meyer	a9ca503b52	Revert r361838 Reported by: delphij	2020-06-06 14:19:16 +00:00
Conrad Meyer	5b9b571cb3	geom_label: Use provider aliasing to alias upstream geoms For synthetic aliases (just pseudonyms inferred from metadata like GPT or UFS labels, GPT UUIDs, etc), use the GEOM provider aliasing system to create a symlink to the real device instead of creating an independent device. This makes it more clear which labels and devices correspond, and we can safely have multiple labels to a single device accessed at once. The confusingly named geom_label on-disk construct continues to behave identically to how it did before. This requires teaching GEOM's provider aliasing about the possibility that aliases might be added later in time, and GEOM's devfs interaction layer not to worry about existing aliases during retaste. Discussed with: imp Relnotes: sure, if we don't end up reverting it Differential Revision: https://reviews.freebsd.org/D24968	2020-06-05 16:12:21 +00:00
Conrad Meyer	c726a670df	geom: Don't re-add duplicate aliases Reviewed by: imp (informal +1; extracted from phab 24968)	2020-06-05 16:05:09 +00:00
Conrad Meyer	b71dc87559	geom_part: Dispatch to partitions to create providers and aliases This allows partitions to create additional aliases of their own. The default method implementations preserve the existing behavior. No functional change. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D24938	2020-05-29 19:44:18 +00:00
Alan Somers	2a2306099d	geli: fix a livelock during panic During any kind of shutdown, kern_reboot calls geli's pre_sync event hook, which tries to destroy all unused geli devices. But during a panic, geli can't destroy any devices, because the scheduler is stopped, so it can't switch threads. A livelock results, and the system never dumps core. This commit fixes the problem by refusing to destroy any devices during panic, used or otherwise. PR: 246207 Reviewed by: jhb MFC after: 2 weeks Sponsored by: Axcient Differential Revision: https://reviews.freebsd.org/D24697	2020-05-27 19:13:26 +00:00
Chuck Silvers	d79ff54b5c	This commit enables a UFS filesystem to do a forcible unmount when the underlying media fails or becomes inaccessible. For example when a USB flash memory card hosting a UFS filesystem is unplugged. The strategy for handling disk I/O errors when soft updates are enabled is to stop writing to the disk of the affected file system but continue to accept I/O requests and report that all future writes by the file system to that disk actually succeed. Then initiate an asynchronous forced unmount of the affected file system. There are two cases for disk I/O errors: - ENXIO, which means that this disk is gone and the lower layers of the storage stack already guarantee that no future I/O to this disk will succeed. - EIO (or most other errors), which means that this particular I/O request has failed but subsequent I/O requests to this disk might still succeed. For ENXIO, we can just clear the error and continue, because we know that the file system cannot affect the on-disk state after we see this error. For EIO or other errors, we arrange for the geom_vfs layer to reject all future I/O requests with ENXIO just like is done when the geom_vfs is orphaned. In both cases, the file system code can just clear the error and proceed with the forcible unmount. This new treatment of I/O errors is needed for writes of any buffer that is involved in a dependency. Most dependencies are described by a structure attached to the buffer's b_dep field. But some are created and processed as a result of the completion of the dependencies attached to the buffer. Clearing of some dependencies require a read. For example if there is a dependency that requires an inode to be written, the disk block containing that inode must be read, the updated inode copied into place in that buffer, and the buffer then written back to disk. Often the needed buffer is already in memory and can be used. But if it needs to be read from the disk, the read will fail, so we fabricate a buffer full of zeroes and pretend that the read succeeded. This zero'ed buffer can be updated and written back to disk. The only case where a buffer full of zeros causes the code to do the wrong thing is when reading an inode buffer containing an inode that still has an inode dependency in memory that will reinitialize the effective link count (i_effnlink) based on the actual link count (i_nlink) that we read. To handle this case we now store the i_nlink value that we wrote in the inode dependency so that it can be restored into the zero'ed buffer thus keeping the tracking of the inode link count consistent. Because applications depend on knowing when an attempt to write their data to stable storage has failed, the fsync(2) and msync(2) system calls need to return errors if data fails to be written to stable storage. So these operations return ENXIO for every call made on files in a file system where we have otherwise been ignoring I/O errors. Coauthered by: mckusick Reviewed by: kib Tested by: Peter Holm Approved by: mckusick (mentor) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24088	2020-05-25 23:47:31 +00:00
John Baldwin	9c0e3d3a53	Add support for optional separate output buffers to in-kernel crypto. Some crypto consumers such as GELI and KTLS for file-backed sendfile need to store their output in a separate buffer from the input. Currently these consumers copy the contents of the input buffer into the output buffer and queue an in-place crypto operation on the output buffer. Using a separate output buffer avoids this copy. - Create a new 'struct crypto_buffer' describing a crypto buffer containing a type and type-specific fields. crp_ilen is gone, instead buffers that use a flat kernel buffer have a cb_buf_len field for their length. The length of other buffer types is inferred from the backing store (e.g. uio_resid for a uio). Requests now have two such structures: crp_buf for the input buffer, and crp_obuf for the output buffer. - Consumers now use helper functions (crypto_use_, e.g. crypto_use_mbuf()) to configure the input buffer. If an output buffer is not configured, the request still modifies the input buffer in-place. A consumer uses a second set of helper functions (crypto_use_output_) to configure an output buffer. - Consumers must request support for separate output buffers when creating a crypto session via the CSP_F_SEPARATE_OUTPUT flag and are only permitted to queue a request with a separate output buffer on sessions with this flag set. Existing drivers already reject sessions with unknown flags, so this permits drivers to be modified to support this extension without requiring all drivers to change. - Several data-related functions now have matching versions that operate on an explicit buffer (e.g. crypto_apply_buf, crypto_contiguous_subsegment_buf, bus_dma_load_crp_buf). - Most of the existing data-related functions operate on the input buffer. However crypto_copyback always writes to the output buffer if a request uses a separate output buffer. - For the regions in input/output buffers, the following conventions are followed: - AAD and IV are always present in input only and their fields are offsets into the input buffer. - payload is always present in both buffers. If a request uses a separate output buffer, it must set a new crp_payload_start_output field to the offset of the payload in the output buffer. - digest is in the input buffer for verify operations, and in the output buffer for compute operations. crp_digest_start is relative to the appropriate buffer. - Add a crypto buffer cursor abstraction. This is a more general form of some bits in the cryptosoft driver that tried to always use uio's. However, compared to the original code, this avoids rewalking the uio iovec array for requests with multiple vectors. It also avoids allocate an iovec array for mbufs and populating it by instead walking the mbuf chain directly. - Update the cryptosoft(4) driver to support separate output buffers making use of the cursor abstraction. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24545	2020-05-25 22:12:04 +00:00
Warner Losh	ae1cce524e	Reimplement aliases in geom The alias needs to be part of the provider instead of the geom to work properly. To bind the DEV geom, we need to look at the provider's names and aliases and create the dev entries from there. If this lives in the GEOM, then it won't propigate down the tree properly. Remove it from geom, add it provider. Update geli, gmountver, gnop, gpart, and guzip to use it, which handles the bulk of the uses in FreeBSD. I think this is all the providers that create a new name based on their parent's name.	2020-05-13 19:17:28 +00:00
Conrad Meyer	844b743d31	geom(4) mirror: Do not panic on gmirror(8) insert, resize Geom_mirror initialization occurs in spurts and the present of a non-destroyed g_mirror softc does not always indicate that the geom has launched (i.e., has an sc_provider). Some gmirror(8) commands (via g_mirror_ctl) depend on a g_mirror's sc_provider (insert and resize). For those commands, g_mirror_ctl is modified to sleep-poll in an interruptible way until the target geom is either launched or destroyed. Reviewed by: markj Tested by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D24780	2020-05-11 22:39:53 +00:00
Pawel Jakub Dawidek	cefbc0d19b	Add g_topology_locked() macro that returns true if we already hold the GEOM topology lock.	2020-04-25 21:41:09 +00:00
John Baldwin	bfe26b9707	Mark eli_metadata_crypto_supported inline. This quiets warnings about it not being always used. Reported by: kevans	2020-04-15 18:27:28 +00:00
John Baldwin	e2b9919398	Remove support for geli(4) algorithms deprecated in r348206. This removes support for reading and writing volumes using the following algorithms: - Triple DES - Blowfish - MD5 HMAC integrity In addition, this commit adds an explicit whitelist of supported algorithms to give a better error message when an invalid or unsupported algorithm is used by an existing volume. Reviewed by: cem Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D24343	2020-04-15 00:14:50 +00:00
Warner Losh	9cf738228d	Now that we don't have special-case geom hacking defined in md_var.h, stop including it. sparc64 was the last straggler here, but these weren't removed at the time.	2020-04-07 22:23:22 +00:00
Mark Johnston	c205ac921b	geom_journal: Only stop the switcher process if one was started. PR: 243196 MFC after: 1 week	2020-04-03 13:57:41 +00:00
John Baldwin	c034143269	Refactor driver and consumer interfaces for OCF (in-kernel crypto). - The linked list of cryptoini structures used in session initialization is replaced with a new flat structure: struct crypto_session_params. This session includes a new mode to define how the other fields should be interpreted. Available modes include: - COMPRESS (for compression/decompression) - CIPHER (for simply encryption/decryption) - DIGEST (computing and verifying digests) - AEAD (combined auth and encryption such as AES-GCM and AES-CCM) - ETA (combined auth and encryption using encrypt-then-authenticate) Additional modes could be added in the future (e.g. if we wanted to support TLS MtE for AES-CBC in the kernel we could add a new mode for that. TLS modes might also affect how AAD is interpreted, etc.) The flat structure also includes the key lengths and algorithms as before. However, code doesn't have to walk the linked list and switch on the algorithm to determine which key is the auth key vs encryption key. The 'csp_auth_' fields are always used for auth keys and settings and 'csp_cipher_' for cipher. (Compression algorithms are stored in csp_cipher_alg.) - Drivers no longer register a list of supported algorithms. This doesn't quite work when you factor in modes (e.g. a driver might support both AES-CBC and SHA2-256-HMAC separately but not combined for ETA). Instead, a new 'crypto_probesession' method has been added to the kobj interface for symmteric crypto drivers. This method returns a negative value on success (similar to how device_probe works) and the crypto framework uses this value to pick the "best" driver. There are three constants for hardware (e.g. ccr), accelerated software (e.g. aesni), and plain software (cryptosoft) that give preference in that order. One effect of this is that if you request only hardware when creating a new session, you will no longer get a session using accelerated software. Another effect is that the default setting to disallow software crypto via /dev/crypto now disables accelerated software. Once a driver is chosen, 'crypto_newsession' is invoked as before. - Crypto operations are now solely described by the flat 'cryptop' structure. The linked list of descriptors has been removed. A separate enum has been added to describe the type of data buffer in use instead of using CRYPTO_F_* flags to make it easier to add more types in the future if needed (e.g. wired userspace buffers for zero-copy). It will also make it easier to re-introduce separate input and output buffers (in-kernel TLS would benefit from this). Try to make the flags related to IV handling less insane: - CRYPTO_F_IV_SEPARATE means that the IV is stored in the 'crp_iv' member of the operation structure. If this flag is not set, the IV is stored in the data buffer at the 'crp_iv_start' offset. - CRYPTO_F_IV_GENERATE means that a random IV should be generated and stored into the data buffer. This cannot be used with CRYPTO_F_IV_SEPARATE. If a consumer wants to deal with explicit vs implicit IVs, etc. it can always generate the IV however it needs and store partial IVs in the buffer and the full IV/nonce in crp_iv and set CRYPTO_F_IV_SEPARATE. The layout of the buffer is now described via fields in cryptop. crp_aad_start and crp_aad_length define the boundaries of any AAD. Previously with GCM and CCM you defined an auth crd with this range, but for ETA your auth crd had to span both the AAD and plaintext (and they had to be adjacent). crp_payload_start and crp_payload_length define the boundaries of the plaintext/ciphertext. Modes that only do a single operation (COMPRESS, CIPHER, DIGEST) should only use this region and leave the AAD region empty. If a digest is present (or should be generated), it's starting location is marked by crp_digest_start. Instead of using the CRD_F_ENCRYPT flag to determine the direction of the operation, cryptop now includes an 'op' field defining the operation to perform. For digests I've added a new VERIFY digest mode which assumes a digest is present in the input and fails the request with EBADMSG if it doesn't match the internally-computed digest. GCM and CCM already assumed this, and the new AEAD mode requires this for decryption. The new ETA mode now also requires this for decryption, so IPsec and GELI no longer do their own authentication verification. Simple DIGEST operations can also do this, though there are no in-tree consumers. To eventually support some refcounting to close races, the session cookie is now passed to crypto_getop() and clients should no longer set crp_sesssion directly. - Assymteric crypto operation structures should be allocated via crypto_getkreq() and freed via crypto_freekreq(). This permits the crypto layer to track open asym requests and close races with a driver trying to unregister while asym requests are in flight. - crypto_copyback, crypto_copydata, crypto_apply, and crypto_contiguous_subsegment now accept the 'crp' object as the first parameter instead of individual members. This makes it easier to deal with different buffer types in the future as well as separate input and output buffers. It's also simpler for driver writers to use. - bus_dmamap_load_crp() loads a DMA mapping for a crypto buffer. This understands the various types of buffers so that drivers that use DMA do not have to be aware of different buffer types. - Helper routines now exist to build an auth context for HMAC IPAD and OPAD. This reduces some duplicated work among drivers. - Key buffers are now treated as const throughout the framework and in device drivers. However, session key buffers provided when a session is created are expected to remain alive for the duration of the session. - GCM and CCM sessions now only specify a cipher algorithm and a cipher key. The redundant auth information is not needed or used. - For cryptosoft, split up the code a bit such that the 'process' callback now invokes a function pointer in the session. This function pointer is set based on the mode (in effect) though it simplifies a few edge cases that would otherwise be in the switch in 'process'. It does split up GCM vs CCM which I think is more readable even if there is some duplication. - I changed /dev/crypto to support GMAC requests using CRYPTO_AES_NIST_GMAC as an auth algorithm and updated cryptocheck to work with it. - Combined cipher and auth sessions via /dev/crypto now always use ETA mode. The COP_F_CIPHER_FIRST flag is now a no-op that is ignored. This was actually documented as being true in crypto(4) before, but the code had not implemented this before I added the CIPHER_FIRST flag. - I have not yet updated /dev/crypto to be aware of explicit modes for sessions. I will probably do that at some point in the future as well as teach it about IV/nonce and tag lengths for AEAD so we can support all of the NIST KAT tests for GCM and CCM. - I've split up the exising crypto.9 manpage into several pages of which many are written from scratch. - I have converted all drivers and consumers in the tree and verified that they compile, but I have not tested all of them. I have tested the following drivers: - cryptosoft - aesni (AES only) - blake2 - ccr and the following consumers: - cryptodev - IPsec - ktls_ocf - GELI (lightly) I have not tested the following: - ccp - aesni with sha - hifn - kgssapi_krb5 - ubsec - padlock - safe - armv8_crypto (aarch64) - glxsb (i386) - sec (ppc) - cesa (armv7) - cryptocteon (mips64) - nlmsec (mips64) Discussed with: cem Relnotes: yes Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D23677	2020-03-27 18:25:23 +00:00
John Baldwin	47172feb8d	Use the newer EINTEGRITY error when authentication fails. GELI used to fail with EINVAL when a read request spanned a disk sector whose contents did not match the sector's authentication tag. The recently-added EINTEGRITY more closely matches to the error in this case. Reviewed by: cem, mckusick MFC after: 2 weeks Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D24131	2020-03-23 21:26:32 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Pawel Biernacki	53a6215c83	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (12 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Approved by: kib (mentor, blanket) Differential Revision: https://reviews.freebsd.org/D23637	2020-02-24 10:42:56 +00:00
Kyle Evans	c81929d343	geli taste: allow GELIBOOT tagged providers as well Currently the installer will tag geliboot partitions with both BOOT and GELIBOOT; the former allows the kernel to taste it at boot, while the latter is what loaders keys off of. However, it seems reasonable to assume that if a provider's been tagged with GELIBOOT that the kernel should also take that as a hint to taste/attach at boot. This would allow us to stop tagging GELIBOOT partitions with BOOT in bsdinstall, but I'm not sure that there's a compelling reason to do so any time soon. Reviewed by: oshogbo Differential Revision: https://reviews.freebsd.org/D23387	2020-02-07 21:36:14 +00:00
Warner Losh	9133f3d097	Supress not supported message For the moment, supress the operation not supported messages at this level. In the fullness of time, we will have better error tracking so we can diagnose issues in the future. Reviewed by: scottl@	2020-02-07 17:47:08 +00:00
Pawel Jakub Dawidek	76b47dfb8f	The error variable is not really needed. Remove it.	2020-02-01 10:15:23 +00:00
Konstantin Belousov	fd99699d7e	Fix aggregating geoms for BIO_SPEEDUP. If the bio was split into several bios going down, completion computes bio_completed of the original bio as sum of the bio_completes of the splits. For BIO_SETUP, bio_length means something different than the length. it is the requested speedup amount, and is duplicated into the splits, which is in fact reasonable, since we cannot know how the previous activity was distributed among subordinate geoms. Obviously, the sum of n bio_length is greater than bio_length for n > 1, which triggers assert that bio_length >= bio_completed for e.g. geom_stripe and geom_raid3. Fix this by reassigning bio_completed from bio_length for completed BIO_SPEEDED, I do not think it really mattters what we return in bio_completed. Reported and tested by: pho Reviewed by: imp MFC after: 1 week Differential revision: https://reviews.freebsd.org/D23380	2020-01-27 13:15:16 +00:00
Conrad Meyer	151e04b3fe	GEOM label: strip leading/trailing space synthesizing devfs names %20%20%20 is ugly and doesn't really help make human-readable devfs names. PR: 243318 Reported by: Peter Eriksson <pen AT lysator.liu.se> Relnotes: yes	2020-01-18 03:33:44 +00:00
Warner Losh	3cf5dd8401	Use buf to send speedup It turns out there's a problem with using g_io to send the speedup. It leads to a race when there's a resource shortage when a disk fails. Instead, send BIO_SPEEDUP via struct buf. This is pretty straight forward, except we need to transfer the bio_flags from b_ioflags for BIO_SPEEDUP commands in g_vfs_strategy. Reviewed by: kirk, chs Differential Revision: https://reviews.freebsd.org/D23117	2020-01-17 01:16:19 +00:00
Warner Losh	8b522bdae6	Pass BIO_SPEEDUP through all the geom layers While some geom layers pass unknown commands down, not all do. For the ones that don't, pass BIO_SPEEDUP down to the providers that constittue the geom, as applicable. No changes to vinum or virstor because I was unsure how to add this support, and I'm also unsure how to test these. gvinum doesn't implement BIO_FLUSH either, so it may just be poorly maintained. gvirstor is for testing and not supportig BIO_SPEEDUP is fine. Reviewed by: chs Differential Revision: https://reviews.freebsd.org/D23183	2020-01-17 01:15:55 +00:00
Mateusz Guzik	879e0604ee	Add KERNEL_PANICKED macro for use in place of direct panicstr tests	2020-01-12 06:07:54 +00:00
Mateusz Guzik	c8b3463dd0	vfs: reimplement deferred inactive to use a dedicated flag (VI_DEFINACT) The previous behavior of leaving VI_OWEINACT vnodes on the active list without a hold count is eliminated. Hold count is kept and inactive processing gets explicitly deferred by setting the VI_DEFINACT flag. The syncer is then responsible for vdrop. Reviewed by: kib (previous version) Tested by: pho (in a larger patch, previous version) Differential Revision: https://reviews.freebsd.org/D23036	2020-01-07 15:56:24 +00:00
Alexander Motin	0aabbeff36	Remove extra check for provider being closed. We already checked for that earlier, and since we hold topology lock it could not change. MFC after: 1 week	2020-01-02 20:30:53 +00:00
Alexander Motin	4aa1289a38	Avoid few memory accesses in g_disk_done().	2019-12-31 03:43:13 +00:00
Alexander Motin	024932aae9	Use atomic for start_count in devstat_start_transaction(). Combined with earlier nstart/nend removal it allows to remove several locks from request path of GEOM and few other places. It would be cool if we had more SMP-friendly statistics, but this helps too. Sponsored by: iXsystems, Inc.	2019-12-30 03:13:38 +00:00
Alexander Motin	9794a803fd	Retire nstart/nend counters. Those counters were abused for decade to workaround broken orphanization process in different classes by delaying the call while there are active requests. But from one side it did not close all the races, while from another was quite expensive on SMP due to trashing twice per request cache lines of consumer and provider and requiring locks. It lost its sense after I manually went through all the GEOM classes in base and made orphanization wait for either provider close or request completion. Consumer counters are still used under INVARIANTS to detect premature consumer close and detach. Provider counters are removed completely. Sponsored by: iXsystems, Inc.	2019-12-30 00:46:10 +00:00
Alexander Motin	86c06ff886	Remove GEOM_SCHED class and gsched tool. This code was not actively maintained since it was introduced 10 years ago. It lacks support for many later GEOM features, such as direct dispatch, unmapped I/O, stripesize/stripeoffset, resize, etc. Plus it is the only remaining use of GEOM nstart/nend request counters, used there to implement live insertion/removal, questionable by itself. Plus, as number of people commented, GEOM is not the best place for I/O scheduler, since it has limited information about layers both above and below it, required for efficient scheduling. Plus with the modern shift to SSDs there is just no more significant need for this kind of scheduling. Approved by: imp, phk, luigi Relnotes: yes	2019-12-29 21:16:03 +00:00
Alexander Motin	cfdb91850c	Missed part of r356162. If we postpone consumer destruction till close, then the close calls should not be ignored. Delay geom withering till the last close too. MFC after: 2 weeks X-MFC-with: r356162 Sponsored by: iXsystems, Inc.	2019-12-29 19:33:41 +00:00
Alexander Motin	1d301810d3	Fix GEOM_VIRSTOR orphanization. Previous code closed and destroyed consumer even with I/O in progress. This patch postpones the destruction till the last close. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-12-29 19:21:29 +00:00

1 2 3 4 5 ...

2416 Commits