The "base+offsets" variants of gather decompose the integer offsets into
compile-time constant and compile-time unknown elements. (The coalescing
optimization, then, depends on this decomposition being done well--having
as much as possible in the constant component.) We now make multiple
efforts to improve this decomposition as we run optimization passes; in
some cases we're able to move more over to the constant side than was
first possible.
This in particular fixes issue #276, a case where coalescing was expected
but didn't actually happen.
Rather than having separate passes to do conversion, when possible, of:
- General gather/scatter of a vector of pointers to g/s of
a base pointer and integer offsets
- Gather/scatter to masked load/store, load+broadcast
- Masked load/store to regular load/store
Now all are done in a single ImproveMemoryOps pass. This change was in
particular to address some phase ordering issues that showed up with
multidimensional array access wherein after determining that an outer
dimension had the same index value, we previously weren't able to take
advantage of the uniformity of the resulting pointer.
Now that we never ever run with the mask all off, we no longer need
that logic in a built-in function so that we can check the mask. In
the one place where it was used (turning gathers to the same location
into a load and broadcast), we now just emit the code for that
directly.
Previously, we'd bitcast e.g. a vector of floats to a vector of i32s and then
use the i32 variant of masked_load/masked_store/gather/scatter. Now, we have
separate float/double variants of each of those.
Change function suffix to "_i32", etc, from "_32"
Improve load_and_broadcast macro in util.m4 to grab vector width from
WIDTH variable rather than taking it as a parameter.
Performance and code quality of performance suite is unchanged,
compilation times are improved by another 20% or so for simple
programs (e.g. rt.ispc). One very complex programs compiles
about 2.4x faster now.
We now try harder to keep the names of instructions related to the
initial names of variables they're derived from and so forth. This
is useful for making both LLVM IR as well as generated C++ code
easier to correlate back to the original ispc source code.
Issue #244.
When we're manually scalarizing the extraction of the first element
of a vector value, we need to be careful about handling constant values
and about where new instructions are inserted. The old code was
sloppy about this, which in turn lead to invalid IR in some cases.
For example, the two bugs below were essentially due to generating
an extractelement inst from a zeroinitializer value and then inserting
it in the wrong bblock such that a phi node that used that value was
malformed.
Fixes issues #240 and #229.
Various optimization passes depend on turning a compile-time constant
mask into a bit vector; it turns out that in LLVM3.1, constant vectors
of ints/floats are represented with llvM::ConstantDataVector, but
constant vectors of bools use llvm::ConstantVector (which is what LLVM
3.0 uses for all constant vectors). Now lGetMask() always does the
llvm::ConstantVector path, to cover this case.
This improves generated C++ code by eliminating things like select
with an all on/off mask, turning movmask calls with constants into
constant values, etc.
When the mask was all off, we'd choose the incorrect operand!
(This bug was masked since this optimization wasn't triggering as
intended, due to other issues to be fixed in a forthcoming commit.
With AOS data, we can often coalesce the accesses into gathers for the main
part of foreach loops but only fail on the last bits where the mask is not
all on (since the coalescing code doesn't handle mixed masks, yet.) Before,
we'd report success with coalescing and then also report that gathers were needed
for the same accesses that were coalesced, which was a) confusing, and b)
didn't accurately represent what was going on for the majority of the loop
iterations.
Clean up the API, so the caller doesn't have to pass in a vector so
the function can track PHI nodes (do that internally instead.)
Handle casts in lValuesAreEqual().
For cases where it turns out that we just need the first element of
a vector (e.g. because we've determined that all of the values are
equal), it's often more efficient to only compute that one value
with scalar operations than to compute the whole vector's worth and
then just use one value. This function tries to rewrite a vector
computation to the scalar equivalent, if possible.
(Partial work-around to http://llvm.org/bugs/show_bug.cgi?id=11775.)
Note that sometimes this is the wrong thing to do--if we need the entire
vector value for other purposes, for example.
The main issue is that they end up generating a number of smaller
vector ops (e.g. 4-wide and 8-wide on the 16-wide generic target,
which the examples/intrinsics implementations don't currently
support.
This fixes a number of failing tests for now; it may be worth
generalizing the stuff in examples/intrinsics at some point,
since as a general principle, e.g. if generating LLVM IR output,
the coalescing optimizations are still desirable.
Issue #175.
There are two related optimizations that happen now. (These
currently only apply for gathers where the mask is known to be
all on, and to gathers that are accessing 32-bit sized elements,
but both of these may be generalized in the future.)
First, for any single gather, we are now more flexible in mapping it
to individual memory operations. Previously, we would only either map
it to a general gather (one scalar load per SIMD lane), or an
unaligned vector load (if the program instances could be determined
to be accessing a sequential set of locations in memory.)
Now, we are able to break gathers into scalar, 2-wide (i.e. 64-bit),
4-wide, or 8-wide loads. Further, we now generate code that shuffles
these loads around. Doing fewer, larger loads in this manner, when
possible, can be more efficient.
Second, we can coalesce memory accesses across multiple gathers. If
we have a series of gathers without any memory writes in the middle,
then we try to analyze their reads collectively and choose an efficient
set of loads for them. Not only does this help if different gathers
reuse values from the same location in memory, but it's specifically
helpful when data with AOS layout is being accessed; in this case,
we're often able to generate wide vector loads and appropriate shuffles
automatically.