checks for the swap space consumption plus checks that the amount of
the free pages exceeds some limit, in case pagedeamon did not coped
with the page shortage in one of the late passes. This is wrong
because it does not account for the presence of the reclamaible pages
in the queues which are not selectable for reclaim immediately. E.g.,
on the swap-less systems, large active queue easily triggered OOM.
Instead, only raise OOM when pagedaemon is unable to produce a free
page in several back-to-back passes. Track the failed passes per
pagedaemon thread.
The number of passes to trigger OOM was selected empirically and
tested both on small (32M-64M i386 VM) and large (32G amd64)
configurations. If the specifics of the load require tuning, sysctl
vm.pageout_oom_seq sets the number of back-to-back passes which must
fail before OOM is raised. Each pass takes 1/2 of seconds. Less the
value, more sensible the pagedaemon is to the page shortage.
In future, some heuristic to calculate the value of the tunable might
be designed based on the system configuration and load. But before it
can be done, the i/o system must be fixed to reliably time-out
pagedaemon writes, even if waiting for the memory to proceed. Then,
code can account for the in-flight page-outs and postpone OOM until
all of them finished, which should reduce the need in tuning. Right
now, ignoring the in-flight writes and the counter allows to break
deadlocks due to write path doing sleepable memory allocations.
Reported by: Dmitry Sivachenko, bde, many others
Tested by: pho, bde, tuexen (arm)
Reviewed by: alc
Discussed with: bde, imp
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks