Properly pick up on ISPC_FORCE_ALIGNED_MEMORY when --opt=force-aligned-memory is used
Fixed usage of loadunpack and packstore to use proper memory offset
Fixed implementation of __masked_load_*() __masked_store_*() incorrectly (un)packing the lanes loaded
Cleaned up usage of _mm512_undefined_*(), it is now mostly confined to constructor
Minor cleanups
knc2x.h
Fixed usage of loadunpack and packstore to use proper memory offset
Fixed implementation of __masked_load_*() __masked_store_*() incorrectly (un)packing the lanes loaded
Properly pick up on ISPC_FORCE_ALIGNED_MEMORY when --opt=force-aligned-memory is used
__any() and __none() speedups.
Cleaned up usage of _mm512_undefined_*(), it is now mostly confined to constructor
Introduced knc2x.h which supprts 2x interleaved code generation for KNC (use the target generic-32).
This implementation is even more experimental and incomplete than knc.h but is useful already (mandelbrot works for example)
knc.h:
Switch to new intrinsic names _mm512_set_1to16_epi32() -> _mm512_set1_epi32(), etc...
Fix the declaration of the unspecialized template for __smear_*(), __setzero_*(), __undef_*()
Specifically mark _mm512_undefined_*() a few vectors in __load<>()
Fixed implementations of some implementations of __smear_*(), __setzero_*(), __undef_*() to remove unecessary dependent instructions.
Implemented ISPC reductions by simply calling existing intrinsic reductions, which are slightly more efficient than our precendent implementation. Also added reductions for double types.