1. builtins/target-nvptx64.ll to write, now it is just a copy of target-generic-1.ll
2. add __global__ & __device__ scope
2. make code work for a single cuda thread
3. use tasks to work as a block grid and programIndex as laneIdx, programCount as warpSize
4. ... and more...
Initial support for ARM NEON on Cortex-A9 and A15 CPUs. All but ~10 tests
pass, and all examples compile and run correctly. Most of the examples
show a ~2x speedup on a single A15 core versus scalar code.
Current open issues/TODOs
- Code quality looks decent, but hasn't been carefully examined. Known
issues/opportunities for improvement include:
- fp32 vector divide is done as a series of scalar divides rather than
a vector divide (which I believe exists, but I may be mistaken.)
This is particularly harmful to examples/rt, which only runs ~1.5x
faster with ispc, likely due to long chains of scalar divides.
- The compiler isn't generating a vmin.f32 for e.g. the final scalar
min in reduce_min(); instead it's generating a compare and then a
select instruction (and similarly elsewhere).
- There are some additional FIXMEs in builtins/target-neon.ll that
include both a few pieces of missing functionality (e.g. rounding
doubles) as well as places that deserve attention for possible
code quality improvements.
- Currently only the "cortex-a9" and "cortex-15" CPU targets are
supported; LLVM supports many other ARM CPUs and ispc should provide
access to all of the ones that have NEON support (and aren't too
obscure.)
- ~5 of the reduce-* tests hit an assertion inside LLVM (unfortunately
only when the compiler runs on an ARM host, though).
- The Windows build hasn't been tested (though I've tried to update
ispc.vcxproj appropriately). It may just work, but will more likely
have various small issues.)
- Anything related to 64-bit ARM has seen no attention.
This forces all vector loads/stores to be done assuming that the given
pointer is aligned to the vector size, thus allowing the use of sometimes
more-efficient instructions. (If it isn't the case that the memory is
aligned, the program will fail!).
Some modules require an include of unistd.h (e.g. for getcwd and isatty
definitions).
These changes were required to build successfully on a Fedora 17 system,
using GCC 4.7.0 & glibc-headers 2.15.