We were previously emitting 64-bit indexing for some gathers where
32-bit was actually fine, due to some adds of constant vectors
that hadn't been simplified to the result.
We now have two ways of approaching gather/scatters with a common base
pointer and with offset vectors. For targets with native gather/scatter,
we just turn those into base + {1/2/4/8}*offsets. For targets without,
we turn those into base + {1/2/4/8}*varying_offsets + const_offsets,
where const_offsets is a compile-time constant.
Infrastructure for issue #325.
No functional change; just preparation for having a path that doesn't
factor the offsets into constant and varying parts, which will be better
for AVX2 and KNC.
The "base+offsets" variants of gather decompose the integer offsets into
compile-time constant and compile-time unknown elements. (The coalescing
optimization, then, depends on this decomposition being done well--having
as much as possible in the constant component.) We now make multiple
efforts to improve this decomposition as we run optimization passes; in
some cases we're able to move more over to the constant side than was
first possible.
This in particular fixes issue #276, a case where coalescing was expected
but didn't actually happen.
Rather than having separate passes to do conversion, when possible, of:
- General gather/scatter of a vector of pointers to g/s of
a base pointer and integer offsets
- Gather/scatter to masked load/store, load+broadcast
- Masked load/store to regular load/store
Now all are done in a single ImproveMemoryOps pass. This change was in
particular to address some phase ordering issues that showed up with
multidimensional array access wherein after determining that an outer
dimension had the same index value, we previously weren't able to take
advantage of the uniformity of the resulting pointer.
Now that we never ever run with the mask all off, we no longer need
that logic in a built-in function so that we can check the mask. In
the one place where it was used (turning gathers to the same location
into a load and broadcast), we now just emit the code for that
directly.
Previously, we'd bitcast e.g. a vector of floats to a vector of i32s and then
use the i32 variant of masked_load/masked_store/gather/scatter. Now, we have
separate float/double variants of each of those.
Change function suffix to "_i32", etc, from "_32"
Improve load_and_broadcast macro in util.m4 to grab vector width from
WIDTH variable rather than taking it as a parameter.
Performance and code quality of performance suite is unchanged,
compilation times are improved by another 20% or so for simple
programs (e.g. rt.ispc). One very complex programs compiles
about 2.4x faster now.
We now try harder to keep the names of instructions related to the
initial names of variables they're derived from and so forth. This
is useful for making both LLVM IR as well as generated C++ code
easier to correlate back to the original ispc source code.
Issue #244.
When we're manually scalarizing the extraction of the first element
of a vector value, we need to be careful about handling constant values
and about where new instructions are inserted. The old code was
sloppy about this, which in turn lead to invalid IR in some cases.
For example, the two bugs below were essentially due to generating
an extractelement inst from a zeroinitializer value and then inserting
it in the wrong bblock such that a phi node that used that value was
malformed.
Fixes issues #240 and #229.
Various optimization passes depend on turning a compile-time constant
mask into a bit vector; it turns out that in LLVM3.1, constant vectors
of ints/floats are represented with llvM::ConstantDataVector, but
constant vectors of bools use llvm::ConstantVector (which is what LLVM
3.0 uses for all constant vectors). Now lGetMask() always does the
llvm::ConstantVector path, to cover this case.
This improves generated C++ code by eliminating things like select
with an all on/off mask, turning movmask calls with constants into
constant values, etc.
When the mask was all off, we'd choose the incorrect operand!
(This bug was masked since this optimization wasn't triggering as
intended, due to other issues to be fixed in a forthcoming commit.
With AOS data, we can often coalesce the accesses into gathers for the main
part of foreach loops but only fail on the last bits where the mask is not
all on (since the coalescing code doesn't handle mixed masks, yet.) Before,
we'd report success with coalescing and then also report that gathers were needed
for the same accesses that were coalesced, which was a) confusing, and b)
didn't accurately represent what was going on for the majority of the loop
iterations.
Clean up the API, so the caller doesn't have to pass in a vector so
the function can track PHI nodes (do that internally instead.)
Handle casts in lValuesAreEqual().
For cases where it turns out that we just need the first element of
a vector (e.g. because we've determined that all of the values are
equal), it's often more efficient to only compute that one value
with scalar operations than to compute the whole vector's worth and
then just use one value. This function tries to rewrite a vector
computation to the scalar equivalent, if possible.
(Partial work-around to http://llvm.org/bugs/show_bug.cgi?id=11775.)
Note that sometimes this is the wrong thing to do--if we need the entire
vector value for other purposes, for example.
The main issue is that they end up generating a number of smaller
vector ops (e.g. 4-wide and 8-wide on the 16-wide generic target,
which the examples/intrinsics implementations don't currently
support.
This fixes a number of failing tests for now; it may be worth
generalizing the stuff in examples/intrinsics at some point,
since as a general principle, e.g. if generating LLVM IR output,
the coalescing optimizations are still desirable.
Issue #175.