Using vector select versus a store and masked load for varying vector
selects seems to give worse code. This may be related to
http://llvm.org/bugs/show_bug.cgi?id=16941.
This should be a bool, not a one-wide vector of bools. The equivalent
fix was previously made in generic-16.h, but not made here. (Note that
many tests are still failing with these targets, but at least they
compile properly now.)
These compute the average of two given values, rounding up and down,
respectively, if the result isn't exact. When possible, these are
mapped to target-specific intrinsics (PADD[BW] on IA and VH[R]ADD[US]
on NEON.)
A subsequent commit will add pattern-matching to generate calls to
these intrinsincs when the corresponding patterns are detected in the
IR.)
Like SSE4-8 and SSE4-16, these use 8-bit and 16-bit values for mask
elements, respectively, and thus should generate the best code when used
for computation with datatypes of those sizes.
1. builtins/target-nvptx64.ll to write, now it is just a copy of target-generic-1.ll
2. add __global__ & __device__ scope
2. make code work for a single cuda thread
3. use tasks to work as a block grid and programIndex as laneIdx, programCount as warpSize
4. ... and more...
Various LLVM optimization passes are turning code like:
%cmp = icmp lt <8 x i32> %foo, %bar
%cmp32 = sext <8 x i1> %cmp to <8 x i32>
. . .
%cmp1 = trunc <8 x i32> %cmp32 to <8 x i1>
%result = select <8 x i1> %cmp1, . . .
Into:
%cmp = icmp lt <8 x i32> %foo, %bar
%cmp32 = zext <8 x i1> %cmp to <8 x i32> # note: zext
. . .
%cmp1 = icmp ne <8 x i32> %cmp32, zeroinitializer
%result = select <8 x i1> %cmp1, …
Which in turn isn't matched well by the LLVM code generators, which
in turn leads to fairly inefficient code. (i.e. it doesn't just emit
a vector compare and blend instruction.)
Also, renamed VSelMovmskOptPass to InstructionSimplifyPass to better
describe its functionality.
Along the lines of sse4-8, this is an 8-wide target for SSE4, using
16-bit elements for the mask. It's thus (in principle) the best
target for SIMD computation with 16-bit datatypes.
Commit 53414f12e6 introduced a but where lEmitVaryingSelect() would
try to truncate a vector of i1s to a vector of i1s, which in turn
made LLVM's IR analyzer unhappy.
This change adds a new 'sse4-8' target, where programCount is 16 and
the mask element size is 8-bits. (i.e. the most appropriate sizing of
the mask for SIMD computation with 8-bit datatypes.)