freebsd-skq/sys
Kirk McKusick 47806d1b93 Occasional cylinder-group check-hash errors were being reported on
systems running with a heavy filesystem load. Tracking down this
bug was elusive because there were actually two problems. Sometimes
the in-memory check hash was wrong and sometimes the check hash
computed when doing the read was wrong. The occurrence of either
error caused a check-hash mismatch to be reported.

The first error was that the check hash in the in-memory cylinder
group was incorrect. This error was caused by the following
sequence of events:

- We read a cylinder-group buffer and the check hash is valid.
- We update its cg_time and cg_old_time which makes the in-memory
  check-hash value invalid but we do not mark the cylinder group dirty.
- We do not make any other changes to the cylinder group, so we
  never mark it dirty, thus do not write it out, and hence never
  update the incorrect check hash for the in-memory buffer.
- Later, the buffer gets freed, but the page with the old incorrect
  check hash is still in the VM cache.
- Later, we read the cylinder group again, and the first page with
  the old check hash is still in the VM cache, but some other pages
  are not, so we have to do a read.
- The read does not actually get the first page from disk, but rather
  from the VM cache, resulting in the old check hash in the buffer.
- The value computed after doing the read does not match causing the
  error to be printed.

The fix for this problem is to only set cg_time and cg_old_time as
the cylinder group is being written to disk. This keeps the in-memory
check-hash valid unless the cylinder group has had other modifications
which will require it to be written with a new check hash calculated.
It also requires that the check hash be recalculated in the in-memory
cylinder group when it is marked clean after doing a background write.

The second problem was that the check hash computed at the end of the
read was incorrect because the calculation of the check hash on
completion of the read was being done too soon.

- When a read completes we had the following sequence:

  - bufdone()
  -- b_ckhashcalc (calculates check hash)
  -- bufdone_finish()
  --- vfs_vmio_iodone() (replaces bogus pages with the cached ones)

- When we are reading a buffer where one or more pages are already
  in memory (but not all pages, or we wouldn't be doing the read),
  the I/O is done with bogus_page mapped in for the pages that exist
  in the VM cache. This mapping is done to avoid corrupting the
  cached pages if there is any I/O overrun. The vfs_vmio_iodone()
  function is responsible for replacing the bogus_page(s) with the
  cached ones. But we were calculating the check hash before the
  bogus_page(s) were replaced. Hence, when we were calculating the
  check hash, we were partly reading from bogus_page, which means
  we calculated a bad check hash (e.g., because multiple pages have
  been mapped to bogus_page, so its contents are indeterminate).

The second fix is to move the check-hash calculation from bufdone()
to bufdone_finish() after the call to vfs_vmio_iodone() so that it
computes the check hash over the correct set of pages.

With these two changes, the occasional cylinder-group check-hash
errors are gone.

Submitted by: David Pfitzner <dpfitzner@netflix.com>
Reviewed by: kib
Tested by: David Pfitzner
2018-02-06 00:19:46 +00:00
..
amd64 Additional linuxolator whitespace cleanup, missed in r328890 2018-02-05 18:39:06 +00:00
arm Implement mitigation for Spectre version 2 attacks on ARMv7. 2018-01-27 11:19:41 +00:00
arm64 Only promote userspace mappings to superpages. This was dropped in r328510, 2018-02-01 14:26:26 +00:00
bsm sys: further adoption of SPDX licensing ID tags. 2017-11-20 19:43:44 +00:00
cam Do the book-keeping on release before we release the reference. The 2018-01-29 18:07:14 +00:00
cddl zfs: move a utility function, ioflags, closer to its consumers 2018-02-05 14:19:36 +00:00
compat Linuxolator whitespace cleanup 2018-02-05 17:29:12 +00:00
conf Move signal trampolines out of locore.s into separate source file. 2018-02-06 00:02:30 +00:00
contrib MFV r328490: Update libfdt to github:f1879e1 2018-01-27 21:25:45 +00:00
crypto ccp(4): Store IV in output buffer in GCM software fallback when requested 2018-01-27 07:41:31 +00:00
ddb Implement 'domainset', a cpuset based NUMA policy mechanism. This allows 2018-01-12 22:48:23 +00:00
dev bwn(4): migrate bwn(4) to the native bhnd(9) interface, and drop siba_bwn. 2018-02-05 23:38:15 +00:00
dts Add a skeleton Clock Manager for RPi2/3, and use that from pwm 2018-01-22 07:10:30 +00:00
fs ext2fs: remove EXT4F_RO_INCOMPAT_SUPP 2018-02-05 15:14:01 +00:00
gdb sys/gdb: further adoption of SPDX licensing ID tags. 2017-11-27 15:16:59 +00:00
geom geom: don't write stack garbage in disk labels 2018-02-04 14:49:55 +00:00
gnu bwn(4): migrate bwn(4) to the native bhnd(9) interface, and drop siba_bwn. 2018-02-05 23:38:15 +00:00
i386 Move signal trampolines out of locore.s into separate source file. 2018-02-06 00:02:30 +00:00
isa Add ISA PNP tables to ISA drivers. Fix a few incidental comments. 2018-01-29 00:22:30 +00:00
kern Occasional cylinder-group check-hash errors were being reported on 2018-02-06 00:19:46 +00:00
kgssapi sys/kgssapi: general adoption of SPDX licensing ID tags. 2017-11-27 15:49:00 +00:00
libkern SPDX: fix wrong license ID tag in libkern. 2017-12-28 01:20:30 +00:00
mips Garbage collect trailing whitespace. 2018-02-05 18:06:54 +00:00
modules bwn(4): migrate bwn(4) to the native bhnd(9) interface, and drop siba_bwn. 2018-02-05 23:38:15 +00:00
net BPF: Switch to 32 bit compatible mode only when thread is 32 bit 2018-01-25 12:13:41 +00:00
net80211 net80211: sanitize input for ieee80211_output() 2017-12-30 00:40:34 +00:00
netgraph Revert r327828, r327949, r327953, r328016-r328026, r328041: 2018-01-21 15:42:36 +00:00
netinet Export tcp_always_keepalive for use by the Chelsio TOM module. 2018-01-30 23:01:37 +00:00
netinet6 Modify ip6_get_prevhdr() to be able use it safely. 2018-02-05 09:22:07 +00:00
netipsec Adopt revision 1.76 and 1.77 from NetBSD: 2018-01-24 19:48:25 +00:00
netpfil pf: Avoid warning without INVARIANTS 2018-02-01 07:52:06 +00:00
netsmb Unsign some values related to allocation. 2018-01-22 02:08:10 +00:00
nfs Do pass removing some write-only variables from the kernel. 2017-12-25 04:48:39 +00:00
nfsclient style: Remove remaining deprecated MALLOC/FREE macros 2018-01-25 22:25:13 +00:00
nfsserver sys: general adoption of SPDX licensing ID tags. 2017-11-27 15:23:17 +00:00
nlm Do pass removing some write-only variables from the kernel. 2017-12-25 04:48:39 +00:00
ofed sys: general adoption of SPDX licensing ID tags. 2017-11-27 15:23:17 +00:00
opencrypto Move per-operation data out of the csession structure. 2018-01-26 23:21:50 +00:00
powerpc Only look for L2 cache controllers for mpc85xx_cache 2018-02-04 20:07:08 +00:00
riscv Remove SFBUF_OPTIONAL_DIRECT_MAP and such hacks, replacing them across the 2018-01-19 17:46:31 +00:00
rpc Do pass removing some write-only variables from the kernel. 2017-12-25 04:48:39 +00:00
security Do pass removing some write-only variables from the kernel. 2017-12-25 04:48:39 +00:00
sparc64 Add ISA PNP tables to ISA drivers. Fix a few incidental comments. 2018-01-29 00:22:30 +00:00
sys psm(4): Add support for HP EliteBook 1040 ForcePads. 2018-01-31 21:14:59 +00:00
teken sys: general adoption of SPDX licensing ID tags. 2017-11-27 15:23:17 +00:00
tests
tools Avoid using \$. It's an unknown escape sequence. Some awks warn about 2018-01-28 05:13:08 +00:00
ufs Occasional cylinder-group check-hash errors were being reported on 2018-02-06 00:19:46 +00:00
vm On munlock(), unwire correct page. 2018-02-05 12:49:20 +00:00
x86 Expand IBRS TLA in sysctl help lines. 2018-01-31 16:54:05 +00:00
xdr sys: general adoption of SPDX licensing ID tags. 2017-11-27 15:23:17 +00:00
xen sys: general adoption of SPDX licensing ID tags. 2017-11-27 15:23:17 +00:00
Makefile Move sys/boot to stand. Fix all references to new location 2017-11-14 23:02:19 +00:00