Current UMA internals are not suited for efficient operation in
multi-socket environments. In particular there is very common use of
MAXCPU arrays and other fields which are not always properly aligned and
are not local for target threads (apart from the first node of course).
Turns out the existing UMA_ALIGN macro can be used to mostly work around
the problem until the code get fixed. The current setting of 64 bytes
runs into trouble when adjacent cache line prefetcher gets to work.
An example 128-way benchmark doing a lot of malloc/frees has the following
instruction samples:
before:
kernel`lf_advlockasync+0x43b 32940
kernel`malloc+0xe5 42380
kernel`bzero+0x19 47798
kernel`spinlock_exit+0x26 60423
kernel`0xffffffff80 78238
0x0 136947
kernel`uma_zfree_arg+0x46 159594
kernel`uma_zalloc_arg+0x672 180556
kernel`uma_zfree_arg+0x2a 459923
kernel`uma_zalloc_arg+0x5ec 489910
after:
kernel`bzero+0xd 46115
kernel`lf_advlockasync+0x25f 46134
kernel`lf_advlockasync+0x38a 49078
kernel`fget_unlocked+0xd1 49942
kernel`lf_advlockasync+0x43b 55392
kernel`copyin+0x4a 56963
kernel`bzero+0x19 81983
kernel`spinlock_exit+0x26 91889
kernel`0xffffffff80 136357
0x0 239424
See the review for more details.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D15346