Commit Graph

369 Commits

Author SHA1 Message Date
Matt Pharr
7ab4c5391c Fix build with LLVM 3.2 and generic-4 / examples/sse4.h target. 2013-08-09 19:56:43 -07:00
Matt Pharr
cd9afe946c Merge branch 'master' into arm
Conflicts:
	Makefile
	builtins.cpp
	ispc.cpp
	ispc.h
	ispc.vcxproj
	opt.cpp
2013-08-06 17:39:21 -07:00
Dmitry Babokin
43423c276f Merge pull request #560 from ifilippov/perf
Supporting perf.py on Mac OS
2013-08-01 13:20:01 -07:00
Ilia Filippov
3c06924a02 Supporting perf.py on Mac OS 2013-08-01 12:47:37 +04:00
Dmitry Babokin
220f0b0b40 Renaming mandelbrot_tasks files to be different from mandelbrot 2013-07-30 19:53:12 -07:00
Dmitry Babokin
fa93cb7d0b InterlockedAdd -> InterlockedExchangeAdd for better portability (InterlockedAdd is not always supported) 2013-07-29 22:46:36 -07:00
Matt Pharr
b6df447b55 Add reduce_add() for int8 and int16 types.
This maps to specialized instructions (e.g. PSADBW) when available.
2013-07-25 09:46:01 -07:00
Matt Pharr
d7b0c5794e Add support for ARM NEON targets.
Initial support for ARM NEON on Cortex-A9 and A15 CPUs.  All but ~10 tests
pass, and all examples compile and run correctly.  Most of the examples
show a ~2x speedup on a single A15 core versus scalar code.

Current open issues/TODOs
- Code quality looks decent, but hasn't been carefully examined.  Known
  issues/opportunities for improvement include:
  - fp32 vector divide is done as a series of scalar divides rather than
    a vector divide (which I believe exists, but I may be mistaken.)
    This is particularly harmful to examples/rt, which only runs ~1.5x
    faster with ispc, likely due to long chains of scalar divides.
  - The compiler isn't generating a vmin.f32 for e.g. the final scalar
    min in reduce_min(); instead it's generating a compare and then a
    select instruction (and similarly elsewhere).
  - There are some additional FIXMEs in builtins/target-neon.ll that
    include both a few pieces of missing functionality (e.g. rounding
    doubles) as well as places that deserve attention for possible
    code quality improvements.

- Currently only the "cortex-a9" and "cortex-15" CPU targets are
  supported; LLVM supports many other ARM CPUs and ispc should provide
  access to all of the ones that have NEON support (and aren't too
  obscure.)

- ~5 of the reduce-* tests hit an assertion inside LLVM (unfortunately
   only when the compiler runs on an ARM host, though).

- The Windows build hasn't been tested (though I've tried to update
  ispc.vcxproj appropriately).  It may just work, but will more likely
  have various small issues.)

- Anything related to 64-bit ARM has seen no attention.
2013-07-19 23:07:24 -07:00
Matt Pharr
b007bba59f Replace inline assembly in task system with equivalent gcc intrinsics.
gcc/icc build only: the Windows build still uses the Win32 calls for
these.
2013-07-19 23:07:24 -07:00
Ilia Filippov
fd7f87b55e Supporting perf.py on Windows and some small corrections in it 2013-07-02 19:23:18 +04:00
Dmitry Babokin
8be4128c5a Merge pull request #534 from ifilippov/perf
add script for measuring performance
2013-07-01 05:09:03 -07:00
Ilia Filippov
806e37338c add script for measuring performance 2013-07-01 13:30:49 +04:00
Dmitry Babokin
ec1095624a Merge pull request #527 from tkoziara/master
examples/sort added
2013-06-25 10:11:39 -07:00
Tomasz Koziara
a23d69ebe8 Copyright changed to simplify legal matters. 2013-06-25 17:28:27 +01:00
Tomasz Koziara
86ee8db778 Parallel prefix sum added + minor amendements. 2013-06-25 12:45:51 +01:00
Ilia Filippov
9fb981e9a0 correction of --instrument option support 2013-06-25 12:33:23 +04:00
Tomasz Koziara
f2452f040d First commit of the radix sort example. 2013-06-24 18:37:44 +01:00
james.brodman
6211966c55 Change mask to use __mmask16 instead of a struct. 2013-05-30 16:04:44 -04:00
james.brodman
7b2eaf63af knc.h cleanup 2013-05-10 13:36:18 -04:00
Dmitry Babokin
1069a3c77e Removing some sources of warnings sse4.h and trailing spaces 2013-04-25 03:40:32 +04:00
james.brodman
52dcbf087a Implemented 3 more intrinsics on double precision vectors 2013-03-28 11:55:53 -04:00
james.brodman
ef1af547e2 Change sse4.h to enable inlining. 2013-03-13 10:55:53 -04:00
Jean-Luc Duprat
24087ff3cc Expose none() in the ISPC standard library.
On KNC: all(), any() and none() do not generate a redundant movmsk instruction.
2012-11-27 13:38:28 -08:00
Jean-Luc Duprat
2129b1e27d knc.h: Fixed __rsqrt_varying_float() to use _mm512_invsqrt_ps() instead of _mm512_invsqrt_pd()
This was a typo.
2012-11-21 15:40:35 -08:00
Jean-Luc Duprat
d3b86dcc90 KNC: fix implementation of __all() to use KNCni mask test instructions... 2012-11-14 09:24:01 -08:00
Jean-Luc Duprat
b601331362 Approximation for inverse sqrt and reciprocal provided in fast math mode.
RCP was actually slow in fast math mode
   Inverse sqrt did not expose fast approximation
2012-11-13 14:01:35 -08:00
james.brodman
97ddc1ed10 Fixed =/== error in __all() 2012-11-08 16:30:12 -05:00
jbrodman
e323b1d0ad Fixed compile error: == instead of = 2012-10-26 16:55:28 -04:00
Matt Pharr
406fbab40e Fix bugs in declarations of __any, __all, and __none in examples/intrinsics.
They return bool, not vector of bool.
2012-10-17 10:55:50 -07:00
Matt Pharr
9002837750 Remove incorrect assert in tasksys.cpp 2012-10-15 10:43:46 -07:00
Matt Pharr
538d51cbfe Add GMRES example 2012-09-20 14:06:55 -07:00
Jean-Luc Duprat
3dd9ff3d84 knc.h:
Properly pick up on ISPC_FORCE_ALIGNED_MEMORY when --opt=force-aligned-memory is used
	Fixed usage of loadunpack and packstore to use proper memory offset
	Fixed implementation of __masked_load_*() __masked_store_*() incorrectly (un)packing the lanes loaded
	Cleaned up usage of _mm512_undefined_*(), it is now mostly confined to constructor
	Minor cleanups

knc2x.h
	Fixed usage of loadunpack and packstore to use proper memory offset
	Fixed implementation of __masked_load_*() __masked_store_*() incorrectly (un)packing the lanes loaded
	Properly pick up on ISPC_FORCE_ALIGNED_MEMORY when --opt=force-aligned-memory is used
	__any() and __none() speedups.
	Cleaned up usage of _mm512_undefined_*(), it is now mostly confined to constructor
2012-09-19 17:11:04 -07:00
Ingo Wald
7f386923b0 Merge branch 'master' of https://github.com/ispc/ispc 2012-09-17 15:54:25 +02:00
Ingo Wald
d2312b1fbd now using the ASSUME_ALIGNED flag in knc.h 2012-09-17 15:54:00 +02:00
Ingo Wald
6655373ac3 commit test 2012-09-17 15:51:37 +02:00
Ingo Wald
d492af7bc0 64-bit gather/scatter, aligned load/store, i8 support 2012-09-17 03:39:02 +02:00
Jean-Luc Duprat
0e88d5f97f Fixed unaligned masked stores on KNC 2012-09-14 14:11:41 -07:00
Jean-Luc Duprat
f0b0618484 Added the following mask tests: __any(), __all(), __none() for all supported targets.
This allows for more efficient code generation of KNC.
2012-09-14 11:06:18 -07:00
Jean-Luc Duprat
11db466a88 Implement the KNC prefetch API so that ISPC prefetch_*() stdlib functions may be used. 2012-08-30 10:24:31 -07:00
Jean-Luc Duprat
09bb36f58c Updated the task system in the example directory to support:
Cilk (cilk_for), OpenMP (#pragma omp parallel for), TBB(tbb::task_group and tbb::parallel_for)
as well as a new pthreads-based model that fully subscribes the machine (good for KNC).
With major contributions from Ingo Wald and James Brodman.
2012-08-28 11:13:12 -07:00
Jean-Luc Duprat
8a22c63889 knc2x.h
Introduced knc2x.h which supprts 2x interleaved code generation for KNC (use the target generic-32).
This implementation is even more experimental and incomplete than knc.h but is useful already (mandelbrot works for example)

knc.h:
Switch to new intrinsic names _mm512_set_1to16_epi32() -> _mm512_set1_epi32(), etc...
Fix the declaration of the unspecialized template for __smear_*(), __setzero_*(), __undef_*()
Specifically mark _mm512_undefined_*() a few vectors in __load<>()
Fixed implementations of some implementations of __smear_*(), __setzero_*(), __undef_*() to remove unecessary dependent instructions.
Implemented ISPC reductions by simply calling existing intrinsic reductions, which are slightly more efficient than our precendent implementation.  Also added reductions for double types.
2012-08-15 17:41:10 -07:00
Jean-Luc Duprat
165a13b13e knc.h:
vec16_i64 improved with the addition of the following: __extract_element(), insert_element(), __sub(), __mul(),
		   __sdiv(), __udiv(), __and(), __or(), __xor(), __shl(), __lshr(), __ashr(), __select()
	Fixed a bug in the __mul(__vec16_i64, __vec16_i32) implementation
	Constructors are all explicitly inlined, copy constructor and operator=() explicitly provided
	Load and stores for __vec16_i64 and __vec16_d use aligned instructions when possible
	__rotate_i32() now has a vector implementation
	Added several reductions: __reduce_add_i32(), __reduce_min_i32(), __reduce_max_i32(),
	       __reduce_add_f(), __reduce_min_f(), __reduce_max_f()
2012-08-10 12:20:10 -07:00
Jean-Luc Duprat
a2d42c3242 KNC: all masked_load_*() and masked_store_*() functions need to do unaligned accesses 2012-08-01 14:37:25 -07:00
Jean-Luc Duprat
aecd6e0878 All the smear(), setzero() and undef() APIs are now templated on the return type.
Modified ISPC's internal mangling to pass these through unchanged.
Tried hard to make sure this is not going to introduce an ABI change.
2012-07-17 17:06:36 -07:00
Jean-Luc Duprat
e09e953bbb Added a few functions: __setzero_i64() __cast_sext(__vec16_i64, __vec16_i32), __cast_zext(__vec16_i32)
__min_varying_in32(), __min_varying_uint32(), __max_varying_int32(), __max_varying_uint32()
Fixed the signature of __smear_i64() to match current codegen
2012-07-12 10:32:38 -07:00
Jean-Luc Duprat
df18b2a150 Fixed missing tmp var needed for use with gather intrinsic 2012-07-11 15:43:11 -07:00
Matt Pharr
216ac4b1a4 Stop factoring out constant offsets for gather/scatter if instr is available.
For KNC (gather/scatter), it's not helpful to factor base+offsets gathers
and scatters into base_ptr + {1/2/4/8} * varying_offsets + const_offsets.
Now, if a HW instruction is available for gather/scatter, we just factor
into base + {1/2/4/8} * offsets (if possible).  Not only is this simpler,
but it's also what we need to pass a value along to the scale by
2/4/8 available directly in those instructions.

Finishes issue #325.
2012-07-11 14:52:29 -07:00
Matt Pharr
ec0280be11 Rename gather/scatter_base_offsets functions to *factored_based_offsets*.
No functional change; just preparation for having a path that doesn't
factor the offsets into constant and varying parts, which will be better
for AVX2 and KNC.
2012-07-11 14:16:39 -07:00
Jean-Luc Duprat
7a7c54bd59 Minor fixes to knc.h that resulted from integrating bea88ab122 2012-07-10 16:10:48 -07:00
Jean-Luc Duprat
bea88ab122 Integrated changes from mmp/and-fold-opt:
Add peephole optimization to eliminate some mask AND operations.

On KNC, the various vector comparison instructions can optionally
be masked; if a mask is provided, the result is effectively that
the value returned is the AND of the mask with the result of the
comparison.

This change adds an optimization pass to the C++ backend that looks
for vector ANDs where one operand is a comparison and rewrites
them--e.g. "and(equalfloat(a, b), c)" is changed to
"_equal_float_and_mask(a, b, c)", saving an instruction in the end.

Issue #319.

Merge commit '8ef6bc16364d4c08aa5972141748110160613087'

Conflicts:
	examples/intrinsics/knc.h
	examples/intrinsics/sse4.h
2012-07-10 10:33:24 -07:00