Properly pick up on ISPC_FORCE_ALIGNED_MEMORY when --opt=force-aligned-memory is used
Fixed usage of loadunpack and packstore to use proper memory offset
Fixed implementation of __masked_load_*() __masked_store_*() incorrectly (un)packing the lanes loaded
Cleaned up usage of _mm512_undefined_*(), it is now mostly confined to constructor
Minor cleanups
knc2x.h
Fixed usage of loadunpack and packstore to use proper memory offset
Fixed implementation of __masked_load_*() __masked_store_*() incorrectly (un)packing the lanes loaded
Properly pick up on ISPC_FORCE_ALIGNED_MEMORY when --opt=force-aligned-memory is used
__any() and __none() speedups.
Cleaned up usage of _mm512_undefined_*(), it is now mostly confined to constructor
Introduced knc2x.h which supprts 2x interleaved code generation for KNC (use the target generic-32).
This implementation is even more experimental and incomplete than knc.h but is useful already (mandelbrot works for example)
knc.h:
Switch to new intrinsic names _mm512_set_1to16_epi32() -> _mm512_set1_epi32(), etc...
Fix the declaration of the unspecialized template for __smear_*(), __setzero_*(), __undef_*()
Specifically mark _mm512_undefined_*() a few vectors in __load<>()
Fixed implementations of some implementations of __smear_*(), __setzero_*(), __undef_*() to remove unecessary dependent instructions.
Implemented ISPC reductions by simply calling existing intrinsic reductions, which are slightly more efficient than our precendent implementation. Also added reductions for double types.
vec16_i64 improved with the addition of the following: __extract_element(), insert_element(), __sub(), __mul(),
__sdiv(), __udiv(), __and(), __or(), __xor(), __shl(), __lshr(), __ashr(), __select()
Fixed a bug in the __mul(__vec16_i64, __vec16_i32) implementation
Constructors are all explicitly inlined, copy constructor and operator=() explicitly provided
Load and stores for __vec16_i64 and __vec16_d use aligned instructions when possible
__rotate_i32() now has a vector implementation
Added several reductions: __reduce_add_i32(), __reduce_min_i32(), __reduce_max_i32(),
__reduce_add_f(), __reduce_min_f(), __reduce_max_f()
__min_varying_in32(), __min_varying_uint32(), __max_varying_int32(), __max_varying_uint32()
Fixed the signature of __smear_i64() to match current codegen
For KNC (gather/scatter), it's not helpful to factor base+offsets gathers
and scatters into base_ptr + {1/2/4/8} * varying_offsets + const_offsets.
Now, if a HW instruction is available for gather/scatter, we just factor
into base + {1/2/4/8} * offsets (if possible). Not only is this simpler,
but it's also what we need to pass a value along to the scale by
2/4/8 available directly in those instructions.
Finishes issue #325.
No functional change; just preparation for having a path that doesn't
factor the offsets into constant and varying parts, which will be better
for AVX2 and KNC.
Add peephole optimization to eliminate some mask AND operations.
On KNC, the various vector comparison instructions can optionally
be masked; if a mask is provided, the result is effectively that
the value returned is the AND of the mask with the result of the
comparison.
This change adds an optimization pass to the C++ backend that looks
for vector ANDs where one operand is a comparison and rewrites
them--e.g. "and(equalfloat(a, b), c)" is changed to
"_equal_float_and_mask(a, b, c)", saving an instruction in the end.
Issue #319.
Merge commit '8ef6bc16364d4c08aa5972141748110160613087'
Conflicts:
examples/intrinsics/knc.h
examples/intrinsics/sse4.h
Fixes include adding "_float" and "_double" suffixes as appropriate as well
as providing a number of missing implementations.
This fixes a number of failures in the half* tests.
On KNC, the various vector comparison instructions can optionally
be masked; if a mask is provided, the result is effectively that
the value returned is the AND of the mask with the result of the
comparison.
This change adds an optimization pass to the C++ backend that looks
for vector ANDs where one operand is a comparison and rewrites
them--e.g. "__and(__equal_float(a, b), c)" is changed to
"__equal_float_and_mask(a, b, c)", saving an instruction in the end.
Issue #319.
e.g. "__equal()" -> "__equal_float()", etc.
No functional change; this is necessary groundwork for a forthcoming
peephole optimization that eliminates ANDs of masks in some cases.
If we have a vector of all zeros, a __setzero_* function call is emitted,
permitting calling specialized intrinsics for this. Undefined values
are reflected with an __undef_* call, which similarly allows passing that
information along.
This change also includes a cleanup to the signature of the __smear_*
functions; since they already have different names depending on the
scalar value type, we don't need to use the trick of passing an
undefined value of the return vector type as the first parameter as
an indirect way to overload by return value.
Issue #317.
Fixes to __load and __store.
Added __add, __mul, __equal, __not_equal, __extract_elements, __smear_i64, __cast_sext, __cast_zext,
and __scatter_base_offsets32_float.
__rcp_varying_float now has a fast-math and full-precision implementation.