bbfb40721e
and medium size args too: instead of conditionally subtracting a float 17+24, 17+17+24 or 17+17+17+24 bit approximation to pi/2, always subtract a double 33+53 bit one. The float version is now closer to the double version than to old versions of itself -- it uses the same 33+53 bit approximation as the simplest cases in the double version, and where the float version had to switch to the slow general case at |x| == 2^7*pi/2, it now switches at |x| == 2^19*pi/2 the same as the double version. This speeds up arg reduction by a factor of 2 for |x| between 3*pi/4 and 2^7*pi/4, and by a factor of 7 for |x| between 2^7*pi/4 and 2^19*pi/4.