Don't issue warnings about all instances writing to the same location if
there is only one program instance in the gang.
Be sure to report that all values are equal in one-element vectors in
LLVMVectorValuesAllEqual().
Issue #166.
We now follow C's approach of evaluating these: we don't evaluate
the second expression in the operator if the value of the first one
determines the overall result. Thus, these can now be used
idiomatically like (index < limit && array[index] > 0) and such.
For varying expressions, the mask is set appropriately when evaluating
the second expression.
(For expressions that can be determined to be both simple and safe to
evaluate with the mask all off, we still evaluate both sides and compute
the logical op result directly, which saves a number of branches and tests.
However, the effect of this should never be visible to the programmer.)
Issue #4.
Don't include declarations of malloc/free in the generated code (get
the standard ones from system headers instead).
Add a cast to (uint8_t *) before calls to malloc, which C++ requires,
since proper malloc returns a void *.
When we're able to turn a general gather/scatter into the "base + offsets"
form, we now try to extract out any constant components of the offsets and
then pass them as a separate parameter to the gather/scatter function
implementation.
We then in turn carefully emit code for the addressing calculation so that
these constant offsets match LLVM's patterns to detect this case, such that
we get the constant offsets directly encoded in the instruction's addressing
calculation in many cases, saving arithmetic instructions to do these
calculations.
Improves performance of stencil by ~15%. Other workloads unchanged.
We now do a single atomic hardware swap and then effectively do
swaps between the running program instances such that the result
is the same as if they had happened to run a particular ordering
of hardware swaps themselves.
Also cleaned up __atomic_swap_uniform_* built-in implementations
to not take the mask, which they weren't using anyway.
Finishes Issue #56.
Effectively, the patterns that detected when given a gather or
scatter in base+offsets form, the offsets were actually a multiple
of 2/4/8, were no longer working.
This change not only fixes this, but also expands the set of
patterns that are matched by this. For example, given offsets of
the form 4*v1 + 16*v2, it identifies a scale of 4 and new offsets
of v1 + 4*v2.
This fix makes the volume renderer run 1.19x faster, and noise 1.54x
faster.
In this case, we now emit calls to potentially-specialized functions for the
left/right shifts that take a single integer value for the shift amount. These
in turn can be matched to the corresponding intrinsics for the SSE target.
Issue #145.
Previously, when we had a switch statement with a uniform switch condition
but a 'break' statement that was under varying control flow inside the
switch, we'd promote the switch condition to be varying so that the
break would work correctly.
Now, we leave the condition as uniform and are thus able to use the
more-efficient LLVM switch instruction in this case.
Issue #156.
Specifically, don't use vector select for masked store blend there,
but emit a call to a undefined __masked_store_blend_*() functions.
Added implementations of these functions to the sse4.h and generic-16.h
in examples/instrinsics. (Calls to these will never be generated with
LLVM 3.1).
More specifically, we do a proper masked store (rather than a load-
blend-store) unless we can determine that we're accessing a stack-allocated
"varying" variable. This fixes a number of nefarious bugs where given
code like:
uniform float a[21];
foreach (i = 0 … 21)
a[i] = 0;
We'd use a blend and in turn read past the end of a[] in the last
iteration.
Also made slight changes to inlining in aobench; this keeps compiles
to ~5s, versus ~45s without them (with this change).
Fixes issue #160.