Initial support for ARM NEON on Cortex-A9 and A15 CPUs. All but ~10 tests
pass, and all examples compile and run correctly. Most of the examples
show a ~2x speedup on a single A15 core versus scalar code.
Current open issues/TODOs
- Code quality looks decent, but hasn't been carefully examined. Known
issues/opportunities for improvement include:
- fp32 vector divide is done as a series of scalar divides rather than
a vector divide (which I believe exists, but I may be mistaken.)
This is particularly harmful to examples/rt, which only runs ~1.5x
faster with ispc, likely due to long chains of scalar divides.
- The compiler isn't generating a vmin.f32 for e.g. the final scalar
min in reduce_min(); instead it's generating a compare and then a
select instruction (and similarly elsewhere).
- There are some additional FIXMEs in builtins/target-neon.ll that
include both a few pieces of missing functionality (e.g. rounding
doubles) as well as places that deserve attention for possible
code quality improvements.
- Currently only the "cortex-a9" and "cortex-15" CPU targets are
supported; LLVM supports many other ARM CPUs and ispc should provide
access to all of the ones that have NEON support (and aren't too
obscure.)
- ~5 of the reduce-* tests hit an assertion inside LLVM (unfortunately
only when the compiler runs on an ARM host, though).
- The Windows build hasn't been tested (though I've tried to update
ispc.vcxproj appropriately). It may just work, but will more likely
have various small issues.)
- Anything related to 64-bit ARM has seen no attention.
This forces all vector loads/stores to be done assuming that the given
pointer is aligned to the vector size, thus allowing the use of sometimes
more-efficient instructions. (If it isn't the case that the memory is
aligned, the program will fail!).
For KNC (gather/scatter), it's not helpful to factor base+offsets gathers
and scatters into base_ptr + {1/2/4/8} * varying_offsets + const_offsets.
Now, if a HW instruction is available for gather/scatter, we just factor
into base + {1/2/4/8} * offsets (if possible). Not only is this simpler,
but it's also what we need to pass a value along to the scale by
2/4/8 available directly in those instructions.
Finishes issue #325.
We now have two ways of approaching gather/scatters with a common base
pointer and with offset vectors. For targets with native gather/scatter,
we just turn those into base + {1/2/4/8}*offsets. For targets without,
we turn those into base + {1/2/4/8}*varying_offsets + const_offsets,
where const_offsets is a compile-time constant.
Infrastructure for issue #325.
Some modules require an include of unistd.h (e.g. for getcwd and isatty
definitions).
These changes were required to build successfully on a Fedora 17 system,
using GCC 4.7.0 & glibc-headers 2.15.
The intent of this was to indicate whether it was safe to run code
with an 'all of' mask on the given target (and then sometimes be
more flexible about e.g. running both true and false blocks of if
statements, etc.)
The problem is that even if the architecture has full native mask support,
it's still not safe to run 'uniform' memory operations with the mask all
off. Even more tricky, we sometimes transform masked varying memory operations
to uniform ones during optimization (e.g. gather->load and broadcast).
This fixes a number of the tests/switch-* tests that were failing on the
generic targets due to this issue.
In InitSymbol(), we try to be smart and emit a memcpy when there
are a number of values to store (e.g. for arrays, structs, etc.)
Unfortunately, this wasn't working as desired for bools (i.e. i1 types),
since the SizeOf() call that tried to figure out how many bytes to
copy would return 0 bytes, due to dividing the number of bits to copy
by 8.
Fixes issue #234.
Now, if the user specified a CPU then we base the ISA choice on that--only
if no CPU and no target is specified do we use the CPUID-based check to
pick a vector ISA.
Improvement to fix to #205.
This was causing functions like round() to fail on SSE2, since it has code
that does:
x += 0x1.0p23f;
x -= 0x1.0p23f;
which was in turn being undesirably optimized away.
Fixes issue #211.
There are two related optimizations that happen now. (These
currently only apply for gathers where the mask is known to be
all on, and to gathers that are accessing 32-bit sized elements,
but both of these may be generalized in the future.)
First, for any single gather, we are now more flexible in mapping it
to individual memory operations. Previously, we would only either map
it to a general gather (one scalar load per SIMD lane), or an
unaligned vector load (if the program instances could be determined
to be accessing a sequential set of locations in memory.)
Now, we are able to break gathers into scalar, 2-wide (i.e. 64-bit),
4-wide, or 8-wide loads. Further, we now generate code that shuffles
these loads around. Doing fewer, larger loads in this manner, when
possible, can be more efficient.
Second, we can coalesce memory accesses across multiple gathers. If
we have a series of gathers without any memory writes in the middle,
then we try to analyze their reads collectively and choose an efficient
set of loads for them. Not only does this help if different gathers
reuse values from the same location in memory, but it's specifically
helpful when data with AOS layout is being accessed; in this case,
we're often able to generate wide vector loads and appropriate shuffles
automatically.