Disable SSE in libthr

Clang emits SSE instructions on amd64 in the common path of
pthread_mutex_unlock.  If the thread does not otherwise use SSE,
this usage incurs a context-switch of the FPU/SSE state, which
reduces the performance of multiple real-world applications by a
non-trivial amount (3-5% in one application).

Instead of this change, I experimented with eagerly switching the
FPU state at context-switch time.  This did not help.  Most of the
cost seems to be in the read/write of memory--as kib@ stated--and
not in the #NM handling.  I tested on machines with and without
XSAVEOPT.

One counter-argument to this change is that most applications already
use SIMD, and the number of applications and amount of SIMD usage
are only increasing.  This is absolutely true.  I agree that--in
general and in principle--this change is in the wrong direction.
However, there are applications that do not use enough SSE to offset
the extra context-switch cost.  SSE does not provide a clear benefit
in the current libthr code with the current compiler, but it does
provide a clear loss in some cases.  Therefore, disabling SSE in
libthr is a non-loss for most, and a gain for some.

I refrained from disabling SSE in libc--as was suggested--because
I can't make the above argument for libc.  It provides a wide variety
of code; each case should be analyzed separately.

https://lists.freebsd.org/pipermail/freebsd-current/2015-March/055193.html

Suggestions from:	dim, jmg, rpaulo
Approved by:	kib (mentor)
MFC after:	2 weeks
Sponsored by:	Dell Inc.
This commit is contained in:
Eric van Gyzen 2015-08-05 12:53:55 +00:00
parent c8fbdcc10d
commit ddab052725
5 changed files with 35 additions and 2 deletions

View File

@ -1,3 +1,9 @@
#$FreeBSD$
SRCS+= _umtx_op_err.S
# With the current compiler and libthr code, using SSE in libthr
# does not provide enough performance improvement to outweigh
# the extra context switch cost. This can measurably impact
# performance when the application also does not use enough SSE.
CFLAGS+=${CFLAGS_NO_SIMD}

View File

@ -1,3 +1,9 @@
# $FreeBSD$
SRCS+= _umtx_op_err.S
# With the current compiler and libthr code, using SSE in libthr
# does not provide enough performance improvement to outweigh
# the extra context switch cost. This can measurably impact
# performance when the application also does not use enough SSE.
CFLAGS+=${CFLAGS_NO_SIMD}

View File

@ -1,6 +1,6 @@
# $FreeBSD$
CFLAGS+= -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -msoft-float
CFLAGS+= ${CFLAGS_NO_SIMD} -msoft-float
# Uncomment this to build the dynamic linker as an executable instead
# of a shared library:
#LDSCRIPT= ${.CURDIR}/${MACHINE_CPUARCH}/elf_rtld.x

View File

@ -1,6 +1,6 @@
# $FreeBSD$
CFLAGS+= -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -msoft-float
CFLAGS+= ${CFLAGS_NO_SIMD} -msoft-float
# Uncomment this to build the dynamic linker as an executable instead
# of a shared library:
#LDSCRIPT= ${.CURDIR}/${MACHINE_CPUARCH}/elf_rtld.x

View File

@ -282,6 +282,27 @@ _CPUCFLAGS += -mfloat-abi=softfp
CFLAGS += ${_CPUCFLAGS}
.endif
#
# Prohibit the compiler from emitting SIMD instructions.
# These flags are added to CFLAGS in areas where the extra context-switch
# cost outweighs the advantages of SIMD instructions.
#
# gcc:
# Setting -mno-mmx implies -mno-3dnow
# Setting -mno-sse implies -mno-sse2, -mno-sse3, -mno-ssse3 and -mfpmath=387
#
# clang:
# Setting -mno-mmx implies -mno-3dnow and -mno-3dnowa
# Setting -mno-sse implies -mno-sse2, -mno-sse3, -mno-ssse3, -mno-sse41 and
# -mno-sse42
# (-mfpmath= is not supported)
#
.if ${MACHINE_CPUARCH} == "i386" || ${MACHINE_CPUARCH} == "amd64"
CFLAGS_NO_SIMD.clang= -mno-avx
CFLAGS_NO_SIMD= -mno-mmx -mno-sse
.endif
CFLAGS_NO_SIMD += ${CFLAGS_NO_SIMD.${COMPILER_TYPE}}
# Add in any architecture-specific CFLAGS.
# These come from make.conf or the command line or the environment.
CFLAGS += ${CFLAGS.${MACHINE_ARCH}}