Now that we never ever run with the mask all off, we no longer need
that logic in a built-in function so that we can check the mask. In
the one place where it was used (turning gathers to the same location
into a load and broadcast), we now just emit the code for that
directly.
Previously, we'd bitcast e.g. a vector of floats to a vector of i32s and then
use the i32 variant of masked_load/masked_store/gather/scatter. Now, we have
separate float/double variants of each of those.
Change function suffix to "_i32", etc, from "_32"
Improve load_and_broadcast macro in util.m4 to grab vector width from
WIDTH variable rather than taking it as a parameter.