Like SSE4-8 and SSE4-16, these use 8-bit and 16-bit values for mask
elements, respectively, and thus should generate the best code when used
for computation with datatypes of those sizes.
Along the lines of sse4-8, this is an 8-wide target for SSE4, using
16-bit elements for the mask. It's thus (in principle) the best
target for SIMD computation with 16-bit datatypes.
This change adds a new 'sse4-8' target, where programCount is 16 and
the mask element size is 8-bits. (i.e. the most appropriate sizing of
the mask for SIMD computation with 8-bit datatypes.)
Initial support for ARM NEON on Cortex-A9 and A15 CPUs. All but ~10 tests
pass, and all examples compile and run correctly. Most of the examples
show a ~2x speedup on a single A15 core versus scalar code.
Current open issues/TODOs
- Code quality looks decent, but hasn't been carefully examined. Known
issues/opportunities for improvement include:
- fp32 vector divide is done as a series of scalar divides rather than
a vector divide (which I believe exists, but I may be mistaken.)
This is particularly harmful to examples/rt, which only runs ~1.5x
faster with ispc, likely due to long chains of scalar divides.
- The compiler isn't generating a vmin.f32 for e.g. the final scalar
min in reduce_min(); instead it's generating a compare and then a
select instruction (and similarly elsewhere).
- There are some additional FIXMEs in builtins/target-neon.ll that
include both a few pieces of missing functionality (e.g. rounding
doubles) as well as places that deserve attention for possible
code quality improvements.
- Currently only the "cortex-a9" and "cortex-15" CPU targets are
supported; LLVM supports many other ARM CPUs and ispc should provide
access to all of the ones that have NEON support (and aren't too
obscure.)
- ~5 of the reduce-* tests hit an assertion inside LLVM (unfortunately
only when the compiler runs on an ARM host, though).
- The Windows build hasn't been tested (though I've tried to update
ispc.vcxproj appropriately). It may just work, but will more likely
have various small issues.)
- Anything related to 64-bit ARM has seen no attention.
This forces all vector loads/stores to be done assuming that the given
pointer is aligned to the vector size, thus allowing the use of sometimes
more-efficient instructions. (If it isn't the case that the memory is
aligned, the program will fail!).
For KNC (gather/scatter), it's not helpful to factor base+offsets gathers
and scatters into base_ptr + {1/2/4/8} * varying_offsets + const_offsets.
Now, if a HW instruction is available for gather/scatter, we just factor
into base + {1/2/4/8} * offsets (if possible). Not only is this simpler,
but it's also what we need to pass a value along to the scale by
2/4/8 available directly in those instructions.
Finishes issue #325.
We now have two ways of approaching gather/scatters with a common base
pointer and with offset vectors. For targets with native gather/scatter,
we just turn those into base + {1/2/4/8}*offsets. For targets without,
we turn those into base + {1/2/4/8}*varying_offsets + const_offsets,
where const_offsets is a compile-time constant.
Infrastructure for issue #325.
Some modules require an include of unistd.h (e.g. for getcwd and isatty
definitions).
These changes were required to build successfully on a Fedora 17 system,
using GCC 4.7.0 & glibc-headers 2.15.
The intent of this was to indicate whether it was safe to run code
with an 'all of' mask on the given target (and then sometimes be
more flexible about e.g. running both true and false blocks of if
statements, etc.)
The problem is that even if the architecture has full native mask support,
it's still not safe to run 'uniform' memory operations with the mask all
off. Even more tricky, we sometimes transform masked varying memory operations
to uniform ones during optimization (e.g. gather->load and broadcast).
This fixes a number of the tests/switch-* tests that were failing on the
generic targets due to this issue.