57aa9163fd
When we get a device departed message from the firmware, we send a TARGET_REST to the device to let the firmware know we're done and as part of the recovery process. This will abort all the commands. While the documentation says the IOC is responsible for writing the completion message for all the commands pending with an aborted status, we sometimes have queued commands for the target that haven't been completed so are in the INQUEUE state. So, when we later complete the pending CCB as aborted, these commands are freed and we hit the "state not busy" panic. Elsewhere where we dequeue commands, we move the state to BUSY from INQUEUE. Do that here as well. In talking to Ken, Scott and Justin, they recommended a series of tests to see if this is 100% safe. Those tests are ongoing, but preliminary tests suggest this is safe as we see no duplicate completions when we hit this case at work. We have a machine that has a dodgy powersupply which usually doesn't apply power to a few drives, but sometimes does when the machine is under heavy load so we get a rash of the connect / disconnect messages over half an hour. Without this change, we'd see state not busy panic. With this change, the drives just annoyingly come and go without affecting the rest of the machine, but without a complete error injection test suite, it's hard to know if all edge cases are now covered or not. Discussed with: scottl, ken, gibbs |
||
---|---|---|
.. | ||
mpi | ||
mpr_config.c | ||
mpr_ioctl.h | ||
mpr_mapping.c | ||
mpr_mapping.h | ||
mpr_pci.c | ||
mpr_sas_lsi.c | ||
mpr_sas.c | ||
mpr_sas.h | ||
mpr_table.c | ||
mpr_table.h | ||
mpr_user.c | ||
mpr.c | ||
mprvar.h |