mjg ac1eb54956 amd64: finish the tail in memset with an overlapping store
Instead of finding the exact size to fit in we can just shift the target
by -8 + tail. Doing a blind write to a previously rep stosq'ed area comes
with a penalty so do it conditionally.

Sample win on EPYC when zeroing a 257 sized buffer (tail = 1) aligned to
16 bytes:
before: 44782846 ops/s
after:  46118614 ops/s

Idea stolen from NetBSD.

Sponsored by:	The FreeBSD Foundation
2018-10-22 06:44:20 +00:00
..
2018-10-22 02:36:18 +00:00