This should be a bool, not a one-wide vector of bools. The equivalent
fix was previously made in generic-16.h, but not made here. (Note that
many tests are still failing with these targets, but at least they
compile properly now.)
Properly pick up on ISPC_FORCE_ALIGNED_MEMORY when --opt=force-aligned-memory is used
Fixed usage of loadunpack and packstore to use proper memory offset
Fixed implementation of __masked_load_*() __masked_store_*() incorrectly (un)packing the lanes loaded
Cleaned up usage of _mm512_undefined_*(), it is now mostly confined to constructor
Minor cleanups
knc2x.h
Fixed usage of loadunpack and packstore to use proper memory offset
Fixed implementation of __masked_load_*() __masked_store_*() incorrectly (un)packing the lanes loaded
Properly pick up on ISPC_FORCE_ALIGNED_MEMORY when --opt=force-aligned-memory is used
__any() and __none() speedups.
Cleaned up usage of _mm512_undefined_*(), it is now mostly confined to constructor
Introduced knc2x.h which supprts 2x interleaved code generation for KNC (use the target generic-32).
This implementation is even more experimental and incomplete than knc.h but is useful already (mandelbrot works for example)
knc.h:
Switch to new intrinsic names _mm512_set_1to16_epi32() -> _mm512_set1_epi32(), etc...
Fix the declaration of the unspecialized template for __smear_*(), __setzero_*(), __undef_*()
Specifically mark _mm512_undefined_*() a few vectors in __load<>()
Fixed implementations of some implementations of __smear_*(), __setzero_*(), __undef_*() to remove unecessary dependent instructions.
Implemented ISPC reductions by simply calling existing intrinsic reductions, which are slightly more efficient than our precendent implementation. Also added reductions for double types.
vec16_i64 improved with the addition of the following: __extract_element(), insert_element(), __sub(), __mul(),
__sdiv(), __udiv(), __and(), __or(), __xor(), __shl(), __lshr(), __ashr(), __select()
Fixed a bug in the __mul(__vec16_i64, __vec16_i32) implementation
Constructors are all explicitly inlined, copy constructor and operator=() explicitly provided
Load and stores for __vec16_i64 and __vec16_d use aligned instructions when possible
__rotate_i32() now has a vector implementation
Added several reductions: __reduce_add_i32(), __reduce_min_i32(), __reduce_max_i32(),
__reduce_add_f(), __reduce_min_f(), __reduce_max_f()
__min_varying_in32(), __min_varying_uint32(), __max_varying_int32(), __max_varying_uint32()
Fixed the signature of __smear_i64() to match current codegen