We now try harder to keep the names of instructions related to the
initial names of variables they're derived from and so forth. This
is useful for making both LLVM IR as well as generated C++ code
easier to correlate back to the original ispc source code.
Issue #244.
When we're manually scalarizing the extraction of the first element
of a vector value, we need to be careful about handling constant values
and about where new instructions are inserted. The old code was
sloppy about this, which in turn lead to invalid IR in some cases.
For example, the two bugs below were essentially due to generating
an extractelement inst from a zeroinitializer value and then inserting
it in the wrong bblock such that a phi node that used that value was
malformed.
Fixes issues #240 and #229.
Various optimization passes depend on turning a compile-time constant
mask into a bit vector; it turns out that in LLVM3.1, constant vectors
of ints/floats are represented with llvM::ConstantDataVector, but
constant vectors of bools use llvm::ConstantVector (which is what LLVM
3.0 uses for all constant vectors). Now lGetMask() always does the
llvm::ConstantVector path, to cover this case.
This improves generated C++ code by eliminating things like select
with an all on/off mask, turning movmask calls with constants into
constant values, etc.
When the mask was all off, we'd choose the incorrect operand!
(This bug was masked since this optimization wasn't triggering as
intended, due to other issues to be fixed in a forthcoming commit.
With AOS data, we can often coalesce the accesses into gathers for the main
part of foreach loops but only fail on the last bits where the mask is not
all on (since the coalescing code doesn't handle mixed masks, yet.) Before,
we'd report success with coalescing and then also report that gathers were needed
for the same accesses that were coalesced, which was a) confusing, and b)
didn't accurately represent what was going on for the majority of the loop
iterations.
Clean up the API, so the caller doesn't have to pass in a vector so
the function can track PHI nodes (do that internally instead.)
Handle casts in lValuesAreEqual().
For cases where it turns out that we just need the first element of
a vector (e.g. because we've determined that all of the values are
equal), it's often more efficient to only compute that one value
with scalar operations than to compute the whole vector's worth and
then just use one value. This function tries to rewrite a vector
computation to the scalar equivalent, if possible.
(Partial work-around to http://llvm.org/bugs/show_bug.cgi?id=11775.)
Note that sometimes this is the wrong thing to do--if we need the entire
vector value for other purposes, for example.
The main issue is that they end up generating a number of smaller
vector ops (e.g. 4-wide and 8-wide on the 16-wide generic target,
which the examples/intrinsics implementations don't currently
support.
This fixes a number of failing tests for now; it may be worth
generalizing the stuff in examples/intrinsics at some point,
since as a general principle, e.g. if generating LLVM IR output,
the coalescing optimizations are still desirable.
Issue #175.
There are two related optimizations that happen now. (These
currently only apply for gathers where the mask is known to be
all on, and to gathers that are accessing 32-bit sized elements,
but both of these may be generalized in the future.)
First, for any single gather, we are now more flexible in mapping it
to individual memory operations. Previously, we would only either map
it to a general gather (one scalar load per SIMD lane), or an
unaligned vector load (if the program instances could be determined
to be accessing a sequential set of locations in memory.)
Now, we are able to break gathers into scalar, 2-wide (i.e. 64-bit),
4-wide, or 8-wide loads. Further, we now generate code that shuffles
these loads around. Doing fewer, larger loads in this manner, when
possible, can be more efficient.
Second, we can coalesce memory accesses across multiple gathers. If
we have a series of gathers without any memory writes in the middle,
then we try to analyze their reads collectively and choose an efficient
set of loads for them. Not only does this help if different gathers
reuse values from the same location in memory, but it's specifically
helpful when data with AOS layout is being accessed; in this case,
we're often able to generate wide vector loads and appropriate shuffles
automatically.
Previously, we'd pick one lane and generate a regular store for its value.
This was the wrong thing to do, since we also should have been checking
that the mask was on (for the lane that was chosen). This bug didn't
become evident until the scalar target was added, since many stores fall
into this case with that target.
Now, we just leave those as regular scatters.
Fixes most of the failing tests for the scalar target listed in issue #167.
Don't issue warnings about all instances writing to the same location if
there is only one program instance in the gang.
Be sure to report that all values are equal in one-element vectors in
LLVMVectorValuesAllEqual().
Issue #166.
When we're able to turn a general gather/scatter into the "base + offsets"
form, we now try to extract out any constant components of the offsets and
then pass them as a separate parameter to the gather/scatter function
implementation.
We then in turn carefully emit code for the addressing calculation so that
these constant offsets match LLVM's patterns to detect this case, such that
we get the constant offsets directly encoded in the instruction's addressing
calculation in many cases, saving arithmetic instructions to do these
calculations.
Improves performance of stencil by ~15%. Other workloads unchanged.
Effectively, the patterns that detected when given a gather or
scatter in base+offsets form, the offsets were actually a multiple
of 2/4/8, were no longer working.
This change not only fixes this, but also expands the set of
patterns that are matched by this. For example, given offsets of
the form 4*v1 + 16*v2, it identifies a scale of 4 and new offsets
of v1 + 4*v2.
This fix makes the volume renderer run 1.19x faster, and noise 1.54x
faster.
In this case, we now emit calls to potentially-specialized functions for the
left/right shifts that take a single integer value for the shift amount. These
in turn can be matched to the corresponding intrinsics for the SSE target.
Issue #145.
More specifically, we do a proper masked store (rather than a load-
blend-store) unless we can determine that we're accessing a stack-allocated
"varying" variable. This fixes a number of nefarious bugs where given
code like:
uniform float a[21];
foreach (i = 0 … 21)
a[i] = 0;
We'd use a blend and in turn read past the end of a[] in the last
iteration.
Also made slight changes to inlining in aobench; this keeps compiles
to ~5s, versus ~45s without them (with this change).
Fixes issue #160.