Like SSE4-8 and SSE4-16, these use 8-bit and 16-bit values for mask
elements, respectively, and thus should generate the best code when used
for computation with datatypes of those sizes.
Various LLVM optimization passes are turning code like:
%cmp = icmp lt <8 x i32> %foo, %bar
%cmp32 = sext <8 x i1> %cmp to <8 x i32>
. . .
%cmp1 = trunc <8 x i32> %cmp32 to <8 x i1>
%result = select <8 x i1> %cmp1, . . .
Into:
%cmp = icmp lt <8 x i32> %foo, %bar
%cmp32 = zext <8 x i1> %cmp to <8 x i32> # note: zext
. . .
%cmp1 = icmp ne <8 x i32> %cmp32, zeroinitializer
%result = select <8 x i1> %cmp1, …
Which in turn isn't matched well by the LLVM code generators, which
in turn leads to fairly inefficient code. (i.e. it doesn't just emit
a vector compare and blend instruction.)
Also, renamed VSelMovmskOptPass to InstructionSimplifyPass to better
describe its functionality.
Along the lines of sse4-8, this is an 8-wide target for SSE4, using
16-bit elements for the mask. It's thus (in principle) the best
target for SIMD computation with 16-bit datatypes.
Commit 53414f12e6 introduced a but where lEmitVaryingSelect() would
try to truncate a vector of i1s to a vector of i1s, which in turn
made LLVM's IR analyzer unhappy.
This change adds a new 'sse4-8' target, where programCount is 16 and
the mask element size is 8-bits. (i.e. the most appropriate sizing of
the mask for SIMD computation with 8-bit datatypes.)
On a target with a 16-bit mask (for example), we would choose the type
of an integer literal "1024" to be an int16. Previously, we used an int32,
which is a worse fit and leads to less efficient code than an int16
on a 16-bit mask target. (However, we'd still give an integer literal
1000000 the type int32, even in a 16-bit target.)
Updated the tests to still pass with 8 and 16-bit targets, given this
change.
There were a number of places throughout the system that assumed that the
execution mask would only have either 32-bit or 1-bit elements. This
commit makes it possible to have a target with an 8- or 16-bit mask.
For varying int8/16/32 types, divides by small constants can be
implemented efficiently through multiplies and shifts with integer
types of twice the bit-width; this commit adds this optimization.
(Implementation is based on Halide.)
Fixes:
- Don't issue a warning when the shift is a by the same amount in all
vector lanes.
- Do issue a warning when it's a compile-time constant but the values
are different in different lanes.
Previously, we warned iff the shift amount wasn't a compile-time constant.
We can now do constant folding with all basic datatypes (the previous
implementation handled int32 well, but had limited, if any, coverage
for other datatypes.)
Reduced a bit of repeated code in the constant folding implementation
through template helper functions.
Initial support for ARM NEON on Cortex-A9 and A15 CPUs. All but ~10 tests
pass, and all examples compile and run correctly. Most of the examples
show a ~2x speedup on a single A15 core versus scalar code.
Current open issues/TODOs
- Code quality looks decent, but hasn't been carefully examined. Known
issues/opportunities for improvement include:
- fp32 vector divide is done as a series of scalar divides rather than
a vector divide (which I believe exists, but I may be mistaken.)
This is particularly harmful to examples/rt, which only runs ~1.5x
faster with ispc, likely due to long chains of scalar divides.
- The compiler isn't generating a vmin.f32 for e.g. the final scalar
min in reduce_min(); instead it's generating a compare and then a
select instruction (and similarly elsewhere).
- There are some additional FIXMEs in builtins/target-neon.ll that
include both a few pieces of missing functionality (e.g. rounding
doubles) as well as places that deserve attention for possible
code quality improvements.
- Currently only the "cortex-a9" and "cortex-15" CPU targets are
supported; LLVM supports many other ARM CPUs and ispc should provide
access to all of the ones that have NEON support (and aren't too
obscure.)
- ~5 of the reduce-* tests hit an assertion inside LLVM (unfortunately
only when the compiler runs on an ARM host, though).
- The Windows build hasn't been tested (though I've tried to update
ispc.vcxproj appropriately). It may just work, but will more likely
have various small issues.)
- Anything related to 64-bit ARM has seen no attention.
LLVM3.4 seems to be turning multiplies by a constant power of 2 into
the equivalent SHL, which was in turn thwarting the pattern matching
for turning gathers/scatters into vector loads/stores.