Commit Graph

1354 Commits

Author SHA1 Message Date
Matt Pharr
5b20b06bd9 Add avg_{up,down}_int{8,16} routines to stdlib
These compute the average of two given values, rounding up and down,
respectively, if the result isn't exact.  When possible, these are
mapped to target-specific intrinsics (PADD[BW] on IA and VH[R]ADD[US]
on NEON.)

A subsequent commit will add pattern-matching to generate calls to
these intrinsincs when the corresponding patterns are detected in the
IR.)
2013-08-06 08:41:12 -07:00
Matt Pharr
4f48d3258a Documentation updates for NEON 2013-07-31 20:06:04 -07:00
Matt Pharr
d9c38b5c1f Remove support for using SVML for math lib routines.
This path was poorly maintained and wasn't actually available on most
targets.
2013-07-31 06:56:48 -07:00
Matt Pharr
d3c567503b Remove support for building with LLVM 3.1 2013-07-31 06:46:45 -07:00
Matt Pharr
d7562d3836 Merge branch 'master' into arm 2013-07-31 06:38:17 -07:00
Matt Pharr
48ff03112f Remove __pause from stdlib_core() in utils.m4.
It wasn't ever being used, and was breaking compilation on ARM.
2013-07-30 08:44:22 -07:00
Matt Pharr
ab3b633733 Add 8-bit and 16-bit specialized NEON targets.
Like SSE4-8 and SSE4-16, these use 8-bit and 16-bit values for mask
elements, respectively, and thus should generate the best code when used
for computation with datatypes of those sizes.
2013-07-30 08:44:16 -07:00
Matt Pharr
b6df447b55 Add reduce_add() for int8 and int16 types.
This maps to specialized instructions (e.g. PSADBW) when available.
2013-07-25 09:46:01 -07:00
Matt Pharr
2d063925a1 Explicitly call the PBLENDVB intrinsic for i8 blending with sse4-8.
This is slightly cleaner than trunc-ing the i8 mask to i1 and using
a vector select.  (And is probably more safe in terms of good code.)
2013-07-25 09:46:01 -07:00
Matt Pharr
bba84f247c Improved optimization of vector select instructions.
Various LLVM optimization passes are turning code like:

%cmp = icmp lt <8 x i32> %foo, %bar
%cmp32 = sext <8 x i1> %cmp to <8 x i32>
. . .
%cmp1 = trunc <8 x i32> %cmp32 to <8 x i1>
%result = select <8 x i1> %cmp1, . . .

Into:

%cmp = icmp lt <8 x i32> %foo, %bar
%cmp32 = zext <8 x i1> %cmp to <8 x i32>   # note: zext
. . .
%cmp1 = icmp ne <8 x i32> %cmp32, zeroinitializer
%result = select <8 x i1> %cmp1, …

Which in turn isn't matched well by the LLVM code generators, which
in turn leads to fairly inefficient code.  (i.e. it doesn't just emit
a vector compare and blend instruction.)

Also, renamed VSelMovmskOptPass to InstructionSimplifyPass to better
describe its functionality.
2013-07-25 09:46:01 -07:00
Matt Pharr
780b0dfe47 Add SSE4-16 target.
Along the lines of sse4-8, this is an 8-wide target for SSE4, using
16-bit elements for the mask.  It's thus (in principle) the best
target for SIMD computation with 16-bit datatypes.
2013-07-25 09:46:01 -07:00
Matt Pharr
04d61afa23 Fix bug in lEmitVaryingSelect() for targets with i1 mask types.
Commit 53414f12e6 introduced a but where lEmitVaryingSelect() would
try to truncate a vector of i1s to a vector of i1s, which in turn
made LLVM's IR analyzer unhappy.
2013-07-25 09:45:20 -07:00
Dmitry Babokin
663ebf7857 Merge pull request #551 from mmp/constfold
Improvements to constant folding.
2013-07-24 10:27:04 -07:00
Matt Pharr
53414f12e6 Add SSE4 target optimized for computation with 8-bit datatypes.
This change adds a new 'sse4-8' target, where programCount is 16 and
the mask element size is 8-bits.  (i.e. the most appropriate sizing of
the mask for SIMD computation with 8-bit datatypes.)
2013-07-23 17:30:32 -07:00
Matt Pharr
15a3ef370a Use @llvm.readcyclecounter to implement stdlib clock() function.
Also added a test for the clock builtin.
2013-07-23 17:24:57 -07:00
Matt Pharr
c14659c675 Fix bug in lGetConstantInt() in parse.yy.
Previously, we weren't handling signed/unsigned constant types correctly.
2013-07-23 17:24:57 -07:00
Matt Pharr
f7f281a256 Choose type for integer literals to match the target mask size (if possible).
On a target with a 16-bit mask (for example), we would choose the type
of an integer literal "1024" to be an int16.  Previously, we used an int32,
which is a worse fit and leads to less efficient code than an int16
on a 16-bit mask target.  (However, we'd still give an integer literal
1000000 the type int32, even in a 16-bit target.)

Updated the tests to still pass with 8 and 16-bit targets, given this
change.
2013-07-23 17:24:50 -07:00
Matt Pharr
9ba49eabb2 Reduce estimated costs for 8 and 16-bit min() and max() in stdlib.
These actually compile to a single instruction.
2013-07-23 16:52:43 -07:00
Matt Pharr
e7abf3f2ea Add support for mask vectors of 8 and 16-bit element types.
There were a number of places throughout the system that assumed that the
execution mask would only have either 32-bit or 1-bit elements.  This
commit makes it possible to have a target with an 8- or 16-bit mask.
2013-07-23 16:50:11 -07:00
Matt Pharr
83e1630fbc Add support for fast division of varying int values by small constants.
For varying int8/16/32 types, divides by small constants can be
implemented efficiently through multiplies and shifts with integer
types of twice the bit-width; this commit adds this optimization.
    
(Implementation is based on Halide.)
2013-07-23 16:49:56 -07:00
Matt Pharr
0277ba1aaa Improve warnings for right shift by varying amounts.
Fixes:
- Don't issue a warning when the shift is a by the same amount in all
  vector lanes.
- Do issue a warning when it's a compile-time constant but the values
  are different in different lanes.

Previously, we warned iff the shift amount wasn't a compile-time constant.
2013-07-23 16:49:07 -07:00
Matt Pharr
753c001e69 Merge branch 'master' of https://github.com/ispc/ispc into constfold 2013-07-23 16:12:04 -07:00
Dmitry Babokin
10c0b42d0d Merge pull request #549 from mmp/fix-tot
Fix build with LLVM top-of-tree.
2013-07-23 09:14:08 -07:00
Matt Pharr
564e61c828 Improvements to constant folding.
We can now do constant folding with all basic datatypes (the previous
implementation handled int32 well, but had limited, if any, coverage
for other datatypes.)

Reduced a bit of repeated code in the constant folding implementation
through template helper functions.
2013-07-22 16:12:02 -07:00
Matt Pharr
946c39a5df Fix build with LLVM top-of-tree.
The DIBuilder::getCU() method has been removed; we now just store the
compilation unit returned when we call DIBuilder::createCompileUnit.
2013-07-22 15:42:52 -07:00
Jean-Luc Duprat
2948e84846 Merge pull request #547 from mmp/arm-merge
Add ARM NEON support
2013-07-22 09:24:16 -07:00
Matt Pharr
068fd8098c Explicitly set armv7-eabi target triple on ARM.
This lets the compiler generate FMA instructions, which seems
desirable.
2013-07-20 11:19:10 -07:00
Matt Pharr
d7b0c5794e Add support for ARM NEON targets.
Initial support for ARM NEON on Cortex-A9 and A15 CPUs.  All but ~10 tests
pass, and all examples compile and run correctly.  Most of the examples
show a ~2x speedup on a single A15 core versus scalar code.

Current open issues/TODOs
- Code quality looks decent, but hasn't been carefully examined.  Known
  issues/opportunities for improvement include:
  - fp32 vector divide is done as a series of scalar divides rather than
    a vector divide (which I believe exists, but I may be mistaken.)
    This is particularly harmful to examples/rt, which only runs ~1.5x
    faster with ispc, likely due to long chains of scalar divides.
  - The compiler isn't generating a vmin.f32 for e.g. the final scalar
    min in reduce_min(); instead it's generating a compare and then a
    select instruction (and similarly elsewhere).
  - There are some additional FIXMEs in builtins/target-neon.ll that
    include both a few pieces of missing functionality (e.g. rounding
    doubles) as well as places that deserve attention for possible
    code quality improvements.

- Currently only the "cortex-a9" and "cortex-15" CPU targets are
  supported; LLVM supports many other ARM CPUs and ispc should provide
  access to all of the ones that have NEON support (and aren't too
  obscure.)

- ~5 of the reduce-* tests hit an assertion inside LLVM (unfortunately
   only when the compiler runs on an ARM host, though).

- The Windows build hasn't been tested (though I've tried to update
  ispc.vcxproj appropriately).  It may just work, but will more likely
  have various small issues.)

- Anything related to 64-bit ARM has seen no attention.
2013-07-19 23:07:24 -07:00
Matt Pharr
b007bba59f Replace inline assembly in task system with equivalent gcc intrinsics.
gcc/icc build only: the Windows build still uses the Win32 calls for
these.
2013-07-19 23:07:24 -07:00
Dmitry Babokin
abf43ad01d Merge pull request #546 from dbabokin/release
Release 1.4.4
2013-07-19 18:49:07 -07:00
Dmitry Babokin
922895de69 Changing ISPC version to 1.4.5dev 2013-07-19 18:47:43 -07:00
Dmitry Babokin
28f0bce9f2 Release 1.4.4 v1.4.4 2013-07-19 16:22:10 -07:00
Dmitry Babokin
0f82f216a2 Merge pull request #544 from mmp/master
Handle SHL with a constant vector in LLVMVectorIsLinear().
2013-07-18 11:46:11 -07:00
Matt Pharr
7454b1399c Handle SHL with a constant vector in LLVMVectorIsLinear().
LLVM3.4 seems to be turning multiplies by a constant power of 2 into
the equivalent SHL, which was in turn thwarting the pattern matching
for turning gathers/scatters into vector loads/stores.
2013-07-17 14:12:43 -07:00
jbrodman
4ebf46bd63 Merge pull request #543 from mmp/master
Fix build with LLVM top-of-tree
2013-07-17 10:38:06 -07:00
Matt Pharr
f1cce0ef5f Fix build with LLVM top-of-tree 2013-07-17 09:25:00 -07:00
Dmitry Babokin
8c9e873c10 Merge pull request #540 from dbabokin/embree_bug
Fix for the bug introduced by --intrumentation fix
2013-07-04 10:45:06 -07:00
Dmitry Babokin
c85439e7bb Fix for the bug introduced by --intrumentation fix 2013-07-04 21:41:57 +04:00
Ilia Filippov
fd7f87b55e Supporting perf.py on Windows and some small corrections in it 2013-07-02 19:23:18 +04:00
Dmitry Babokin
8be4128c5a Merge pull request #534 from ifilippov/perf
add script for measuring performance
2013-07-01 05:09:03 -07:00
Ilia Filippov
806e37338c add script for measuring performance 2013-07-01 13:30:49 +04:00
Dmitry Babokin
ec1095624a Merge pull request #527 from tkoziara/master
examples/sort added
2013-06-25 10:11:39 -07:00
Tomasz Koziara
a23d69ebe8 Copyright changed to simplify legal matters. 2013-06-25 17:28:27 +01:00
Dmitry Babokin
0aff61ffc6 Merge pull request #533 from dbabokin/patch
Quick fix for LLVM 3.3 patch
2013-06-25 08:50:32 -07:00
Dmitry Babokin
05aa540984 Quick fix for LLVM 3.3 patch 2013-06-25 19:49:41 +04:00
Dmitry Babokin
033e83e490 Merge pull request #532 from dbabokin/release_1_4_3
Release 1.4.3
2013-06-25 07:42:08 -07:00
Dmitry Babokin
594485c38c Release 1.4.3 v1.4.3 2013-06-25 18:38:21 +04:00
Dmitry Babokin
d52e2d5a8d License update (just dates) 2013-06-25 17:02:42 +04:00
Dmitry Babokin
1e5d852e2f Merge pull request #531 from ifilippov/qsize_fail
replacement of qsize due to it's fails on MacOS
2013-06-25 05:36:45 -07:00
Ilia Filippov
cc32d913a0 replacement of qsize due to it's fails on MacOS 2013-06-25 16:27:25 +04:00