aaron/ispc - ispc - git.frat.tech

aaron/ispc

Author	SHA1	Message	Date
Matt Pharr	7ab4c5391c	Fix build with LLVM 3.2 and generic-4 / examples/sse4.h target.	2013-08-09 19:56:43 -07:00
Matt Pharr	cd9afe946c	Merge branch 'master' into arm Conflicts: Makefile builtins.cpp ispc.cpp ispc.h ispc.vcxproj opt.cpp	2013-08-06 17:39:21 -07:00
Dmitry Babokin	43423c276f	Merge pull request #560 from ifilippov/perf Supporting perf.py on Mac OS	2013-08-01 13:20:01 -07:00
Ilia Filippov	3c06924a02	Supporting perf.py on Mac OS	2013-08-01 12:47:37 +04:00
Dmitry Babokin	220f0b0b40	Renaming mandelbrot_tasks files to be different from mandelbrot	2013-07-30 19:53:12 -07:00
Dmitry Babokin	fa93cb7d0b	InterlockedAdd -> InterlockedExchangeAdd for better portability (InterlockedAdd is not always supported)	2013-07-29 22:46:36 -07:00
Matt Pharr	b6df447b55	Add reduce_add() for int8 and int16 types. This maps to specialized instructions (e.g. PSADBW) when available.	2013-07-25 09:46:01 -07:00
Matt Pharr	d7b0c5794e	Add support for ARM NEON targets. Initial support for ARM NEON on Cortex-A9 and A15 CPUs. All but ~10 tests pass, and all examples compile and run correctly. Most of the examples show a ~2x speedup on a single A15 core versus scalar code. Current open issues/TODOs - Code quality looks decent, but hasn't been carefully examined. Known issues/opportunities for improvement include: - fp32 vector divide is done as a series of scalar divides rather than a vector divide (which I believe exists, but I may be mistaken.) This is particularly harmful to examples/rt, which only runs ~1.5x faster with ispc, likely due to long chains of scalar divides. - The compiler isn't generating a vmin.f32 for e.g. the final scalar min in reduce_min(); instead it's generating a compare and then a select instruction (and similarly elsewhere). - There are some additional FIXMEs in builtins/target-neon.ll that include both a few pieces of missing functionality (e.g. rounding doubles) as well as places that deserve attention for possible code quality improvements. - Currently only the "cortex-a9" and "cortex-15" CPU targets are supported; LLVM supports many other ARM CPUs and ispc should provide access to all of the ones that have NEON support (and aren't too obscure.) - ~5 of the reduce-* tests hit an assertion inside LLVM (unfortunately only when the compiler runs on an ARM host, though). - The Windows build hasn't been tested (though I've tried to update ispc.vcxproj appropriately). It may just work, but will more likely have various small issues.) - Anything related to 64-bit ARM has seen no attention.	2013-07-19 23:07:24 -07:00
Matt Pharr	b007bba59f	Replace inline assembly in task system with equivalent gcc intrinsics. gcc/icc build only: the Windows build still uses the Win32 calls for these.	2013-07-19 23:07:24 -07:00
Ilia Filippov	fd7f87b55e	Supporting perf.py on Windows and some small corrections in it	2013-07-02 19:23:18 +04:00
Dmitry Babokin	8be4128c5a	Merge pull request #534 from ifilippov/perf add script for measuring performance	2013-07-01 05:09:03 -07:00
Ilia Filippov	806e37338c	add script for measuring performance	2013-07-01 13:30:49 +04:00
Dmitry Babokin	ec1095624a	Merge pull request #527 from tkoziara/master examples/sort added	2013-06-25 10:11:39 -07:00
Tomasz Koziara	a23d69ebe8	Copyright changed to simplify legal matters.	2013-06-25 17:28:27 +01:00
Tomasz Koziara	86ee8db778	Parallel prefix sum added + minor amendements.	2013-06-25 12:45:51 +01:00
Ilia Filippov	9fb981e9a0	correction of --instrument option support	2013-06-25 12:33:23 +04:00
Tomasz Koziara	f2452f040d	First commit of the radix sort example.	2013-06-24 18:37:44 +01:00
james.brodman	6211966c55	Change mask to use __mmask16 instead of a struct.	2013-05-30 16:04:44 -04:00
james.brodman	7b2eaf63af	knc.h cleanup	2013-05-10 13:36:18 -04:00
Dmitry Babokin	1069a3c77e	Removing some sources of warnings sse4.h and trailing spaces	2013-04-25 03:40:32 +04:00
james.brodman	52dcbf087a	Implemented 3 more intrinsics on double precision vectors	2013-03-28 11:55:53 -04:00
james.brodman	ef1af547e2	Change sse4.h to enable inlining.	2013-03-13 10:55:53 -04:00
Jean-Luc Duprat	24087ff3cc	Expose none() in the ISPC standard library. On KNC: all(), any() and none() do not generate a redundant movmsk instruction.	2012-11-27 13:38:28 -08:00
Jean-Luc Duprat	2129b1e27d	knc.h: Fixed __rsqrt_varying_float() to use _mm512_invsqrt_ps() instead of _mm512_invsqrt_pd() This was a typo.	2012-11-21 15:40:35 -08:00
Jean-Luc Duprat	d3b86dcc90	KNC: fix implementation of __all() to use KNCni mask test instructions...	2012-11-14 09:24:01 -08:00
Jean-Luc Duprat	b601331362	Approximation for inverse sqrt and reciprocal provided in fast math mode. RCP was actually slow in fast math mode Inverse sqrt did not expose fast approximation	2012-11-13 14:01:35 -08:00
james.brodman	97ddc1ed10	Fixed =/== error in __all()	2012-11-08 16:30:12 -05:00
jbrodman	e323b1d0ad	Fixed compile error: == instead of =	2012-10-26 16:55:28 -04:00
Matt Pharr	406fbab40e	Fix bugs in declarations of __any, __all, and __none in examples/intrinsics. They return bool, not vector of bool.	2012-10-17 10:55:50 -07:00
Matt Pharr	9002837750	Remove incorrect assert in tasksys.cpp	2012-10-15 10:43:46 -07:00
Matt Pharr	538d51cbfe	Add GMRES example	2012-09-20 14:06:55 -07:00
Jean-Luc Duprat	3dd9ff3d84	knc.h: Properly pick up on ISPC_FORCE_ALIGNED_MEMORY when --opt=force-aligned-memory is used Fixed usage of loadunpack and packstore to use proper memory offset Fixed implementation of __masked_load_() __masked_store_() incorrectly (un)packing the lanes loaded Cleaned up usage of _mm512_undefined_(), it is now mostly confined to constructor Minor cleanups knc2x.h Fixed usage of loadunpack and packstore to use proper memory offset Fixed implementation of __masked_load_() __masked_store_() incorrectly (un)packing the lanes loaded Properly pick up on ISPC_FORCE_ALIGNED_MEMORY when --opt=force-aligned-memory is used __any() and __none() speedups. Cleaned up usage of _mm512_undefined_(), it is now mostly confined to constructor	2012-09-19 17:11:04 -07:00
Ingo Wald	7f386923b0	Merge branch 'master' of https://github.com/ispc/ispc	2012-09-17 15:54:25 +02:00
Ingo Wald	d2312b1fbd	now using the ASSUME_ALIGNED flag in knc.h	2012-09-17 15:54:00 +02:00
Ingo Wald	6655373ac3	commit test	2012-09-17 15:51:37 +02:00
Ingo Wald	d492af7bc0	64-bit gather/scatter, aligned load/store, i8 support	2012-09-17 03:39:02 +02:00
Jean-Luc Duprat	0e88d5f97f	Fixed unaligned masked stores on KNC	2012-09-14 14:11:41 -07:00
Jean-Luc Duprat	f0b0618484	Added the following mask tests: __any(), __all(), __none() for all supported targets. This allows for more efficient code generation of KNC.	2012-09-14 11:06:18 -07:00
Jean-Luc Duprat	11db466a88	Implement the KNC prefetch API so that ISPC prefetch_*() stdlib functions may be used.	2012-08-30 10:24:31 -07:00
Jean-Luc Duprat	09bb36f58c	Updated the task system in the example directory to support: Cilk (cilk_for), OpenMP (#pragma omp parallel for), TBB(tbb::task_group and tbb::parallel_for) as well as a new pthreads-based model that fully subscribes the machine (good for KNC). With major contributions from Ingo Wald and James Brodman.	2012-08-28 11:13:12 -07:00
Jean-Luc Duprat	8a22c63889	knc2x.h Introduced knc2x.h which supprts 2x interleaved code generation for KNC (use the target generic-32). This implementation is even more experimental and incomplete than knc.h but is useful already (mandelbrot works for example) knc.h: Switch to new intrinsic names _mm512_set_1to16_epi32() -> _mm512_set1_epi32(), etc... Fix the declaration of the unspecialized template for __smear_(), __setzero_(), __undef_() Specifically mark _mm512_undefined_() a few vectors in __load<>() Fixed implementations of some implementations of __smear_(), __setzero_(), __undef_*() to remove unecessary dependent instructions. Implemented ISPC reductions by simply calling existing intrinsic reductions, which are slightly more efficient than our precendent implementation. Also added reductions for double types.	2012-08-15 17:41:10 -07:00
Jean-Luc Duprat	165a13b13e	knc.h: vec16_i64 improved with the addition of the following: __extract_element(), insert_element(), __sub(), __mul(), __sdiv(), __udiv(), __and(), __or(), __xor(), __shl(), __lshr(), __ashr(), __select() Fixed a bug in the __mul(__vec16_i64, __vec16_i32) implementation Constructors are all explicitly inlined, copy constructor and operator=() explicitly provided Load and stores for __vec16_i64 and __vec16_d use aligned instructions when possible __rotate_i32() now has a vector implementation Added several reductions: __reduce_add_i32(), __reduce_min_i32(), __reduce_max_i32(), __reduce_add_f(), __reduce_min_f(), __reduce_max_f()	2012-08-10 12:20:10 -07:00
Jean-Luc Duprat	a2d42c3242	KNC: all masked_load_() and masked_store_() functions need to do unaligned accesses	2012-08-01 14:37:25 -07:00
Jean-Luc Duprat	aecd6e0878	All the smear(), setzero() and undef() APIs are now templated on the return type. Modified ISPC's internal mangling to pass these through unchanged. Tried hard to make sure this is not going to introduce an ABI change.	2012-07-17 17:06:36 -07:00
Jean-Luc Duprat	e09e953bbb	Added a few functions: __setzero_i64() __cast_sext(__vec16_i64, __vec16_i32), __cast_zext(__vec16_i32) __min_varying_in32(), __min_varying_uint32(), __max_varying_int32(), __max_varying_uint32() Fixed the signature of __smear_i64() to match current codegen	2012-07-12 10:32:38 -07:00
Jean-Luc Duprat	df18b2a150	Fixed missing tmp var needed for use with gather intrinsic	2012-07-11 15:43:11 -07:00
Matt Pharr	216ac4b1a4	Stop factoring out constant offsets for gather/scatter if instr is available. For KNC (gather/scatter), it's not helpful to factor base+offsets gathers and scatters into base_ptr + {1/2/4/8} * varying_offsets + const_offsets. Now, if a HW instruction is available for gather/scatter, we just factor into base + {1/2/4/8} * offsets (if possible). Not only is this simpler, but it's also what we need to pass a value along to the scale by 2/4/8 available directly in those instructions. Finishes issue #325.	2012-07-11 14:52:29 -07:00
Matt Pharr	ec0280be11	Rename gather/scatter_base_offsets functions to factored_based_offsets. No functional change; just preparation for having a path that doesn't factor the offsets into constant and varying parts, which will be better for AVX2 and KNC.	2012-07-11 14:16:39 -07:00
Jean-Luc Duprat	7a7c54bd59	Minor fixes to knc.h that resulted from integrating `bea88ab122`	2012-07-10 16:10:48 -07:00
Jean-Luc Duprat	bea88ab122	Integrated changes from mmp/and-fold-opt: Add peephole optimization to eliminate some mask AND operations. On KNC, the various vector comparison instructions can optionally be masked; if a mask is provided, the result is effectively that the value returned is the AND of the mask with the result of the comparison. This change adds an optimization pass to the C++ backend that looks for vector ANDs where one operand is a comparison and rewrites them--e.g. "and(equalfloat(a, b), c)" is changed to "_equal_float_and_mask(a, b, c)", saving an instruction in the end. Issue #319. Merge commit '8ef6bc16364d4c08aa5972141748110160613087' Conflicts: examples/intrinsics/knc.h examples/intrinsics/sse4.h	2012-07-10 10:33:24 -07:00

... 3 4 5 6 7 ...

369 Commits