e.g. "__equal()" -> "__equal_float()", etc.
No functional change; this is necessary groundwork for a forthcoming
peephole optimization that eliminates ANDs of masks in some cases.
This should help with performance of the generated code.
Updated the relevant header files (sse4.h, generic-16.h, generic-32.h, generic-64.h)
Updated generic-32.h and generic-64.h to the new memory API
There were a number of situations where we were left-shifting 1 by a
lane index that were failing due to shifting beyond 32-bits. Fixed
by shifting the 64-bit constant value 1ull.