These compute the average of two given values, rounding up and down,
respectively, if the result isn't exact. When possible, these are
mapped to target-specific intrinsics (PADD[BW] on IA and VH[R]ADD[US]
on NEON.)
A subsequent commit will add pattern-matching to generate calls to
these intrinsincs when the corresponding patterns are detected in the
IR.)
The stdilb code just calls the signed int{32,64} functions,
which gives the right result for the unsigned case anyway.
The various targets didn't consistently define the unsigned
variants in any case.
Now that we never ever run with the mask all off, we no longer need
that logic in a built-in function so that we can check the mask. In
the one place where it was used (turning gathers to the same location
into a load and broadcast), we now just emit the code for that
directly.
Previously, we'd bitcast e.g. a vector of floats to a vector of i32s and then
use the i32 variant of masked_load/masked_store/gather/scatter. Now, we have
separate float/double variants of each of those.
Change function suffix to "_i32", etc, from "_32"
Improve load_and_broadcast macro in util.m4 to grab vector width from
WIDTH variable rather than taking it as a parameter.