Previously, we'd bitcast e.g. a vector of floats to a vector of i32s and then
use the i32 variant of masked_load/masked_store/gather/scatter. Now, we have
separate float/double variants of each of those.
Previously, we were trying to take a uniform seed and then shuffle that
around to initialize the state for each of the program instances. This
was becoming increasingly untenable and brittle.
Now a varying seed is expected and used.
There were a number of situations where we were left-shifting 1 by a
lane index that were failing due to shifting beyond 32-bits. Fixed
by shifting the 64-bit constant value 1ull.
Rather than XOR'ing with a temporary 'all-on' vector, we call
__not. Also, we call out to __and_not1 and __and_not2, for an
AND where the first or second operand, respectively, has had
NOT applied to it.
Now, the __smear* functions in generated C++ code have an unused first
parameter of the desired return type; this allows us to have headers
that include variants of __smear for multiple target widths. (This
approach is necessary since we can't overload by return type in C++.)
Issue #256.
Now, when we're printing out a constant vector value, we check to see
if it's a splat and call out to one of the __splat_* functions in
the generated code if to.
Now, if a struct member has an explicit 'uniform' or 'varying'
qualifier, then that member has that variability, regardless of
the variability of the struct's variability. Members without
'uniform' or 'varying' have unbound variability, and in turn
inherit the variability of the struct.
As a result of this, now structs can properly be 'varying' by default,
just like all the other types, while still having sensible semantics.
This gets deferred closer to working with the scalar target, but there are still
some issues. (Partially in gamma correction / final clamping, it seems.)
This fix causes a ~0.5% performance degradation with e.g. the AVX target,
though it's not clear that it's worth having a separate code path in order to
not lose this small amount of perf.
(Partially addresses issue #167)
When we're able to turn a general gather/scatter into the "base + offsets"
form, we now try to extract out any constant components of the offsets and
then pass them as a separate parameter to the gather/scatter function
implementation.
We then in turn carefully emit code for the addressing calculation so that
these constant offsets match LLVM's patterns to detect this case, such that
we get the constant offsets directly encoded in the instruction's addressing
calculation in many cases, saving arithmetic instructions to do these
calculations.
Improves performance of stencil by ~15%. Other workloads unchanged.
In this case, we now emit calls to potentially-specialized functions for the
left/right shifts that take a single integer value for the shift amount. These
in turn can be matched to the corresponding intrinsics for the SSE target.
Issue #145.
Specifically, don't use vector select for masked store blend there,
but emit a call to a undefined __masked_store_blend_*() functions.
Added implementations of these functions to the sse4.h and generic-16.h
in examples/instrinsics. (Calls to these will never be generated with
LLVM 3.1).
More specifically, we do a proper masked store (rather than a load-
blend-store) unless we can determine that we're accessing a stack-allocated
"varying" variable. This fixes a number of nefarious bugs where given
code like:
uniform float a[21];
foreach (i = 0 … 21)
a[i] = 0;
We'd use a blend and in turn read past the end of a[] in the last
iteration.
Also made slight changes to inlining in aobench; this keeps compiles
to ~5s, versus ~45s without them (with this change).
Fixes issue #160.
ispc now supports goto, but only under uniform control flow--i.e.
it must be possible for the compiler to statically determine that
all program instances will follow the goto. An error is issued at
compile time if a goto is used when this is not the case.
They're all based off a common examples/common.mk file, so that individual
makefiles are quite simple now.
The common.mk file also provides targets to build the examples using C++
output with the generic-16h or sse4.h files. These targets don't run by
default, but do run if 'make all' is run.
The compiler now supports an --emit-c++ option, which generates generic
vector C++ code. To actually compile this code, the user must provide
C++ code that implements a variety of types and operations (e.g. adding
two floating-point vector values together, comparing them, etc).
There are two examples of this required code in examples/intrinsics:
generic-16.h is a "generic" 16-wide implementation that does all required
with scalar math; it's useful for demonstrating the requirements of the
implementation. Then, sse4.h shows a simple implementation of a SSE4
target that maps the emitted function calls to SSE intrinsics.
When using these example implementations with the ispc test suite,
all but one or two tests pass with gcc and clang on Linux and OSX.
There are currently ~10 failures with icc on Linux, and ~50 failures with
MSVC 2010. (To be fixed in coming days.)
Performance varies: when running the examples through the sse4.h
target, some have the same performance as when compiled with --target=sse4
from ispc directly (options), while noise is 12% slower, rt is 26%
slower, and aobench is 2.2x slower. The details of this haven't yet been
carefully investigated, but will be in coming days as well.
Issue #92.