? : now short-circuits evaluation of the expressions following
the boolean test for varying test types. (It already did this
for uniform tests).
Issue #169.
This gets deferred closer to working with the scalar target, but there are still
some issues. (Partially in gamma correction / final clamping, it seems.)
This fix causes a ~0.5% performance degradation with e.g. the AVX target,
though it's not clear that it's worth having a separate code path in order to
not lose this small amount of perf.
(Partially addresses issue #167)
Previously, we'd pick one lane and generate a regular store for its value.
This was the wrong thing to do, since we also should have been checking
that the mask was on (for the lane that was chosen). This bug didn't
become evident until the scalar target was added, since many stores fall
into this case with that target.
Now, we just leave those as regular scatters.
Fixes most of the failing tests for the scalar target listed in issue #167.
Also reworked TypeCastExpr::GetConstant() to just forward the request along
and moved the code that was previously there to handle uniform->varying
smears of function pointers to FunctionSymbolExpr::GetConstant().
Fixes issue #168.
Don't issue warnings about all instances writing to the same location if
there is only one program instance in the gang.
Be sure to report that all values are equal in one-element vectors in
LLVMVectorValuesAllEqual().
Issue #166.
We now follow C's approach of evaluating these: we don't evaluate
the second expression in the operator if the value of the first one
determines the overall result. Thus, these can now be used
idiomatically like (index < limit && array[index] > 0) and such.
For varying expressions, the mask is set appropriately when evaluating
the second expression.
(For expressions that can be determined to be both simple and safe to
evaluate with the mask all off, we still evaluate both sides and compute
the logical op result directly, which saves a number of branches and tests.
However, the effect of this should never be visible to the programmer.)
Issue #4.
Don't include declarations of malloc/free in the generated code (get
the standard ones from system headers instead).
Add a cast to (uint8_t *) before calls to malloc, which C++ requires,
since proper malloc returns a void *.
When we're able to turn a general gather/scatter into the "base + offsets"
form, we now try to extract out any constant components of the offsets and
then pass them as a separate parameter to the gather/scatter function
implementation.
We then in turn carefully emit code for the addressing calculation so that
these constant offsets match LLVM's patterns to detect this case, such that
we get the constant offsets directly encoded in the instruction's addressing
calculation in many cases, saving arithmetic instructions to do these
calculations.
Improves performance of stencil by ~15%. Other workloads unchanged.
We now do a single atomic hardware swap and then effectively do
swaps between the running program instances such that the result
is the same as if they had happened to run a particular ordering
of hardware swaps themselves.
Also cleaned up __atomic_swap_uniform_* built-in implementations
to not take the mask, which they weren't using anyway.
Finishes Issue #56.