The main issue is that they end up generating a number of smaller
vector ops (e.g. 4-wide and 8-wide on the 16-wide generic target,
which the examples/intrinsics implementations don't currently
support.
This fixes a number of failing tests for now; it may be worth
generalizing the stuff in examples/intrinsics at some point,
since as a general principle, e.g. if generating LLVM IR output,
the coalescing optimizations are still desirable.
Issue #175.
There are two related optimizations that happen now. (These
currently only apply for gathers where the mask is known to be
all on, and to gathers that are accessing 32-bit sized elements,
but both of these may be generalized in the future.)
First, for any single gather, we are now more flexible in mapping it
to individual memory operations. Previously, we would only either map
it to a general gather (one scalar load per SIMD lane), or an
unaligned vector load (if the program instances could be determined
to be accessing a sequential set of locations in memory.)
Now, we are able to break gathers into scalar, 2-wide (i.e. 64-bit),
4-wide, or 8-wide loads. Further, we now generate code that shuffles
these loads around. Doing fewer, larger loads in this manner, when
possible, can be more efficient.
Second, we can coalesce memory accesses across multiple gathers. If
we have a series of gathers without any memory writes in the middle,
then we try to analyze their reads collectively and choose an efficient
set of loads for them. Not only does this help if different gathers
reuse values from the same location in memory, but it's specifically
helpful when data with AOS layout is being accessed; in this case,
we're often able to generate wide vector loads and appropriate shuffles
automatically.
Specifically, if both of the expressions are compile-time constants
and the condition is a varying compile-time constant (even if not
all true or all false), then we can assemble a compile-time constant
result.
Add a number of additional error cases in the grammar.
Enable bison's extended error reporting, to get better messages about the
context of errors and the expected (but not found) tokens at errors.
Improve the printing of these by providing an implementation of yytnamerr
that rewrites things like "TOKEN_MUL_ASSIGN" to "*=" in error messages.
Print the source location (using Error() when yyerror() is called; wiring
this up seems to require no longer building a 'pure parser' but having
yylloc as a global, which in turn led to having to update all of the uses of
it (which previously accessed it as a pointer).
Updated a number of tests_errors for resulting changesin error text.
When the --fuzz-test command-line option is given, the input program
will be randomly perturbed by the lexer in an effort to trigger
assertions or crashes in the compiler (neither of which should ever
happen, even for malformed programs.)
Don't let the preprocessor remove comments anymore, so that the rules
in lex.ll can handle them. Fix lCComment() to update the source
position as it eats characters in comments.
? : now short-circuits evaluation of the expressions following
the boolean test for varying test types. (It already did this
for uniform tests).
Issue #169.
This gets deferred closer to working with the scalar target, but there are still
some issues. (Partially in gamma correction / final clamping, it seems.)
This fix causes a ~0.5% performance degradation with e.g. the AVX target,
though it's not clear that it's worth having a separate code path in order to
not lose this small amount of perf.
(Partially addresses issue #167)