g_mirror: don't fail reads while losing next-to-last disk

I observed a situation where some read requests failed when a 2-way geom mirror lost one disk. The problem appears to be in the logic that skips retrying a failed request when a mirror has only one active disk. Generally, that makes sense. But during a transition from two disks to one it is possible that the request failed on the failing disk before it was inactivated and, so, the remaining active disk is the disk that should be tried. This change adds an additional check to ensure that it was the (only) active disk that was already tried. Reviewed by: mav MFC after: 3 weeks
2022-01-27 12:49:04 +02:00 · 2022-01-27 12:49:04 +02:00 · 5d5f44623e
commit 5d5f44623e
parent a95fcd81d5
1 changed files with 12 additions and 2 deletions
--- a/sys/geom/mirror/g_mirror.c
+++ b/sys/geom/mirror/g_mirror.c
@ -1035,9 +1035,19 @@ g_mirror_regular_request(struct g_mirror_softc *sc, struct bio *bp)
 	case BIO_READ:
 		if (pbp->bio_inbed < pbp->bio_children)
 			break;
-		if (g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) == 1)
+
+		/*
+		 * If there is only one active disk we want to double-check that
+		 * it is, in fact, the disk that we already tried.  This is
+		 * necessary because we might have just lost a race with a
+		 * removal of the tried disk (likely because of the same error)
+		 * and the only remaining disk is still viable for a retry.
+		 */
+		if (g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) == 1 &&
+		    disk != NULL &&
+		    disk->d_state == G_MIRROR_DISK_STATE_ACTIVE) {
 			g_io_deliver(pbp, pbp->bio_error);
-		else {
+		} else {
 			pbp->bio_error = 0;
 			mtx_lock(&sc->sc_queue_mtx);
 			TAILQ_INSERT_TAIL(&sc->sc_queue, pbp, bio_queue);