Cilk (cilk_for), OpenMP (#pragma omp parallel for), TBB(tbb::task_group and tbb::parallel_for)
as well as a new pthreads-based model that fully subscribes the machine (good for KNC).
With major contributions from Ingo Wald and James Brodman.
Introduced knc2x.h which supprts 2x interleaved code generation for KNC (use the target generic-32).
This implementation is even more experimental and incomplete than knc.h but is useful already (mandelbrot works for example)
knc.h:
Switch to new intrinsic names _mm512_set_1to16_epi32() -> _mm512_set1_epi32(), etc...
Fix the declaration of the unspecialized template for __smear_*(), __setzero_*(), __undef_*()
Specifically mark _mm512_undefined_*() a few vectors in __load<>()
Fixed implementations of some implementations of __smear_*(), __setzero_*(), __undef_*() to remove unecessary dependent instructions.
Implemented ISPC reductions by simply calling existing intrinsic reductions, which are slightly more efficient than our precendent implementation. Also added reductions for double types.
vec16_i64 improved with the addition of the following: __extract_element(), insert_element(), __sub(), __mul(),
__sdiv(), __udiv(), __and(), __or(), __xor(), __shl(), __lshr(), __ashr(), __select()
Fixed a bug in the __mul(__vec16_i64, __vec16_i32) implementation
Constructors are all explicitly inlined, copy constructor and operator=() explicitly provided
Load and stores for __vec16_i64 and __vec16_d use aligned instructions when possible
__rotate_i32() now has a vector implementation
Added several reductions: __reduce_add_i32(), __reduce_min_i32(), __reduce_max_i32(),
__reduce_add_f(), __reduce_min_f(), __reduce_max_f()
This ensures that they have static linkage, which in turn lets one
have multiple object files compiled to multiple targets without having
those cause link errors.
Issue #355.
(This in particular would happen when there was an error in the body of a struct
definition and we were left with an UndefinedStructType and then later tried to
do loads/stores from/to it.)
Issue #356.
Also, removed erroneous checks about the type of the test expression
in DoStmt and ForStmt.
These together were preventing conversion of pointer types to boolean
values, so things like "while (ptr)" would improperly not compile.
Issue #346.
Explicitly documented that fact that ICC needs the -mmic flag to compile for KNC.
Updated ISPC User Guide with details on ICC compiler options that impact FP performance in generated code.
We were previously emitting 64-bit indexing for some gathers where
32-bit was actually fine, due to some adds of constant vectors
that hadn't been simplified to the result.
__min_varying_in32(), __min_varying_uint32(), __max_varying_int32(), __max_varying_uint32()
Fixed the signature of __smear_i64() to match current codegen
For KNC (gather/scatter), it's not helpful to factor base+offsets gathers
and scatters into base_ptr + {1/2/4/8} * varying_offsets + const_offsets.
Now, if a HW instruction is available for gather/scatter, we just factor
into base + {1/2/4/8} * offsets (if possible). Not only is this simpler,
but it's also what we need to pass a value along to the scale by
2/4/8 available directly in those instructions.
Finishes issue #325.
We now have two ways of approaching gather/scatters with a common base
pointer and with offset vectors. For targets with native gather/scatter,
we just turn those into base + {1/2/4/8}*offsets. For targets without,
we turn those into base + {1/2/4/8}*varying_offsets + const_offsets,
where const_offsets is a compile-time constant.
Infrastructure for issue #325.
No functional change; just preparation for having a path that doesn't
factor the offsets into constant and varying parts, which will be better
for AVX2 and KNC.