Generalize the lScalarizeVector() utility routine (used in determining
when we can change gathers/scatters into vector loads/stores, respectively)
to handle vector shuffles and vector loads. This fixes issue #79, which
provided a case where a gather was being performed even though a vector
load was possible.
All of the masked store calls were inhibiting putting values into
registers, which in turn led to a lot of unnecessary stack traffic.
This approach seems to give better code in the end.
When replacing 'all on' masked store with regular store, set alignment
to be the vector element alignment, not the alignment for a whole vector.
(i.e. 4 or 8 byte alignment, not 32 or 64).
Emit calls to masked_store, not masked_store_blend, when handling
masked stores emitted by the frontend.
Fix bug in binary8to16 macro in builtins.m4
Fix bug in 16-wide version of __reduce_add_float
Remove blend function implementations for masked_store_blend for
AVX; just forward those on to the corresponding real masked store
functions.
Add optimization patterns to detect and simplify masked loads and stores
with the mask all on / all off.
Enable AVX for LLVM 3.0 builds (still generally hits bugs / unimplemented
stuff on the LLVM side, but it's getting there).
were expecting vector-width-aligned pointers where in point of fact,
there's no guarantee that they would have been in general.
Removed the aligned memory allocation routines from some of the examples;
they're no longer needed.
No perf. difference on Core2/Core i5 CPUs; older CPUs may see some
regressions.
Still need to update the documentation for this change and finish reviewing
alignment issues in Load/Store instructions generated by .cpp files.