We should never be running with an all off mask and thus should never
enter a function with an all off mask. No performance change from
removing this, however.
Issue #282.
The "base+offsets" variants of gather decompose the integer offsets into
compile-time constant and compile-time unknown elements. (The coalescing
optimization, then, depends on this decomposition being done well--having
as much as possible in the constant component.) We now make multiple
efforts to improve this decomposition as we run optimization passes; in
some cases we're able to move more over to the constant side than was
first possible.
This in particular fixes issue #276, a case where coalescing was expected
but didn't actually happen.
Rather than having separate passes to do conversion, when possible, of:
- General gather/scatter of a vector of pointers to g/s of
a base pointer and integer offsets
- Gather/scatter to masked load/store, load+broadcast
- Masked load/store to regular load/store
Now all are done in a single ImproveMemoryOps pass. This change was in
particular to address some phase ordering issues that showed up with
multidimensional array access wherein after determining that an outer
dimension had the same index value, we previously weren't able to take
advantage of the uniformity of the resulting pointer.
Now that we never ever run with the mask all off, we no longer need
that logic in a built-in function so that we can check the mask. In
the one place where it was used (turning gathers to the same location
into a load and broadcast), we now just emit the code for that
directly.
When the outermost dimension(s) were partially active, but the innermost
dimension was all on, we'd inadvertently use an incorrect "all on"
execution mask.
Fixes issues #177 and #200.
Previously, we'd bitcast e.g. a vector of floats to a vector of i32s and then
use the i32 variant of masked_load/masked_store/gather/scatter. Now, we have
separate float/double variants of each of those.
Change function suffix to "_i32", etc, from "_32"
Improve load_and_broadcast macro in util.m4 to grab vector width from
WIDTH variable rather than taking it as a parameter.
If we have a simple varying 'if' statement where the only code in the body is
a single 'break', then emit special case code that just updates the execution
mask directly.
Surprisingly, this leads to better generated code (e.g. Mandelbrot 7.1x on AVX
vs 5.8x before). It's not clear why the general code generation path for
break doesn't generate the equivalent code; this topic should be investigated
further. (Issue #277).
This allows adding types to the list that are included in the automatically-generated
header files.
struct Foo { . . . };
struct Bar { . . . };
export { Foo, Bar };
Previously, we were trying to take a uniform seed and then shuffle that
around to initialize the state for each of the program instances. This
was becoming increasingly untenable and brittle.
Now a varying seed is expected and used.
There were a number of situations where we were left-shifting 1 by a
lane index that were failing due to shifting beyond 32-bits. Fixed
by shifting the 64-bit constant value 1ull.
It can sometimes be useful to know the general place we were in the program
when an assertion hit; when the position is available / applicable, this
macro is now used.
Issue #268.