For KNC (gather/scatter), it's not helpful to factor base+offsets gathers
and scatters into base_ptr + {1/2/4/8} * varying_offsets + const_offsets.
Now, if a HW instruction is available for gather/scatter, we just factor
into base + {1/2/4/8} * offsets (if possible). Not only is this simpler,
but it's also what we need to pass a value along to the scale by
2/4/8 available directly in those instructions.
Finishes issue #325.
No functional change; just preparation for having a path that doesn't
factor the offsets into constant and varying parts, which will be better
for AVX2 and KNC.
Add peephole optimization to eliminate some mask AND operations.
On KNC, the various vector comparison instructions can optionally
be masked; if a mask is provided, the result is effectively that
the value returned is the AND of the mask with the result of the
comparison.
This change adds an optimization pass to the C++ backend that looks
for vector ANDs where one operand is a comparison and rewrites
them--e.g. "and(equalfloat(a, b), c)" is changed to
"_equal_float_and_mask(a, b, c)", saving an instruction in the end.
Issue #319.
Merge commit '8ef6bc16364d4c08aa5972141748110160613087'
Conflicts:
examples/intrinsics/knc.h
examples/intrinsics/sse4.h
Fixes include adding "_float" and "_double" suffixes as appropriate as well
as providing a number of missing implementations.
This fixes a number of failures in the half* tests.
On KNC, the various vector comparison instructions can optionally
be masked; if a mask is provided, the result is effectively that
the value returned is the AND of the mask with the result of the
comparison.
This change adds an optimization pass to the C++ backend that looks
for vector ANDs where one operand is a comparison and rewrites
them--e.g. "__and(__equal_float(a, b), c)" is changed to
"__equal_float_and_mask(a, b, c)", saving an instruction in the end.
Issue #319.
e.g. "__equal()" -> "__equal_float()", etc.
No functional change; this is necessary groundwork for a forthcoming
peephole optimization that eliminates ANDs of masks in some cases.
If we have a vector of all zeros, a __setzero_* function call is emitted,
permitting calling specialized intrinsics for this. Undefined values
are reflected with an __undef_* call, which similarly allows passing that
information along.
This change also includes a cleanup to the signature of the __smear_*
functions; since they already have different names depending on the
scalar value type, we don't need to use the trick of passing an
undefined value of the return vector type as the first parameter as
an indirect way to overload by return value.
Issue #317.
This should help with performance of the generated code.
Updated the relevant header files (sse4.h, generic-16.h, generic-32.h, generic-64.h)
Updated generic-32.h and generic-64.h to the new memory API
Previously, we'd bitcast e.g. a vector of floats to a vector of i32s and then
use the i32 variant of masked_load/masked_store/gather/scatter. Now, we have
separate float/double variants of each of those.
Now, the __smear* functions in generated C++ code have an unused first
parameter of the desired return type; this allows us to have headers
that include variants of __smear for multiple target widths. (This
approach is necessary since we can't overload by return type in C++.)
Issue #256.
Now, when we're printing out a constant vector value, we check to see
if it's a splat and call out to one of the __splat_* functions in
the generated code if to.
When we're able to turn a general gather/scatter into the "base + offsets"
form, we now try to extract out any constant components of the offsets and
then pass them as a separate parameter to the gather/scatter function
implementation.
We then in turn carefully emit code for the addressing calculation so that
these constant offsets match LLVM's patterns to detect this case, such that
we get the constant offsets directly encoded in the instruction's addressing
calculation in many cases, saving arithmetic instructions to do these
calculations.
Improves performance of stencil by ~15%. Other workloads unchanged.
In this case, we now emit calls to potentially-specialized functions for the
left/right shifts that take a single integer value for the shift amount. These
in turn can be matched to the corresponding intrinsics for the SSE target.
Issue #145.
Specifically, don't use vector select for masked store blend there,
but emit a call to a undefined __masked_store_blend_*() functions.
Added implementations of these functions to the sse4.h and generic-16.h
in examples/instrinsics. (Calls to these will never be generated with
LLVM 3.1).
ispc now supports goto, but only under uniform control flow--i.e.
it must be possible for the compiler to statically determine that
all program instances will follow the goto. An error is issued at
compile time if a goto is used when this is not the case.
The compiler now supports an --emit-c++ option, which generates generic
vector C++ code. To actually compile this code, the user must provide
C++ code that implements a variety of types and operations (e.g. adding
two floating-point vector values together, comparing them, etc).
There are two examples of this required code in examples/intrinsics:
generic-16.h is a "generic" 16-wide implementation that does all required
with scalar math; it's useful for demonstrating the requirements of the
implementation. Then, sse4.h shows a simple implementation of a SSE4
target that maps the emitted function calls to SSE intrinsics.
When using these example implementations with the ispc test suite,
all but one or two tests pass with gcc and clang on Linux and OSX.
There are currently ~10 failures with icc on Linux, and ~50 failures with
MSVC 2010. (To be fixed in coming days.)
Performance varies: when running the examples through the sse4.h
target, some have the same performance as when compiled with --target=sse4
from ispc directly (options), while noise is 12% slower, rt is 26%
slower, and aobench is 2.2x slower. The details of this haven't yet been
carefully investigated, but will be in coming days as well.
Issue #92.