Commit Graph

180 Commits

Author SHA1 Message Date
Matt Pharr
cc86e4a7d2 Disable coalescing optimizations when using generic target.
The main issue is that they end up generating a number of smaller
vector ops (e.g. 4-wide and 8-wide on the 16-wide generic target,
which the examples/intrinsics implementations don't currently
support.

This fixes a number of failing tests for now; it may be worth
generalizing the stuff in examples/intrinsics at some point,
since as a general principle, e.g. if generating LLVM IR output,
the coalescing optimizations are still desirable.

Issue #175.
2012-02-13 16:52:01 -08:00
Matt Pharr
e864447e4a Fix silly bug in vector scale extraction optimization.
(Introduced in f20a2d2ee.  How did this ever pass tests?)
2012-02-13 12:06:45 -08:00
Matt Pharr
73bf552cd6 Add support for coalescing memory accesses from gathers.
There are two related optimizations that happen now.  (These
currently only apply for gathers where the mask is known to be
all on, and to gathers that are accessing 32-bit sized elements,
but both of these may be generalized in the future.)

First, for any single gather, we are now more flexible in mapping it
to individual memory operations.  Previously, we would only either map
it to a general gather (one scalar load per SIMD lane), or an 
unaligned vector load (if the program instances could be determined
to be accessing a sequential set of locations in memory.)

Now, we are able to break gathers into scalar, 2-wide (i.e. 64-bit),
4-wide, or 8-wide loads.  Further, we now generate code that shuffles
these loads around.  Doing fewer, larger loads in this manner, when
possible, can be more efficient.

Second, we can coalesce memory accesses across multiple gathers. If 
we have a series of gathers without any memory writes in the middle,
then we try to analyze their reads collectively and choose an efficient
set of loads for them.  Not only does this help if different gathers
reuse values from the same location in memory, but it's specifically
helpful when data with AOS layout is being accessed; in this case,
we're often able to generate wide vector loads and appropriate shuffles
automatically.
2012-02-10 13:10:39 -08:00
Matt Pharr
f20a2d2ee9 Generalize code to extract scales by 2/4/8 from addressing calculations.
Now, if we have a scale by 16, say, we extract out the scalar scale
of 8 and leave an explicit scale by 2.
2012-02-10 12:35:44 -08:00
Matt Pharr
0c25bc063c Add lGEPInst() utility routine to opt.cpp.
Deal with the messiness of LLVM API changes when creating
these in a single place.
2012-02-10 12:32:15 -08:00
Matt Pharr
5b4673e8eb Fix build with LLVM 2.9. 2012-02-07 08:37:13 -08:00
Matt Pharr
0432f97555 Fix build with LLVM 3.1 TOT 2012-01-31 14:10:07 -08:00
Matt Pharr
f73abb05a7 Fix bug in handling scatters where all instances go to the same location.
Previously, we'd pick one lane and generate a regular store for its value.
This was the wrong thing to do, since we also should have been checking
that the mask was on (for the lane that was chosen).  This bug didn't
become evident until the scalar target was added, since many stores fall
into this case with that target.

Now, we just leave those as regular scatters.

Fixes most of the failing tests for the scalar target listed in issue #167.
2012-01-31 11:06:14 -08:00
Matt Pharr
d71c49494f Missed pass that should be skipped when pseudo memory ops are supposed to be left unchanged. 2012-01-31 11:02:23 -08:00
Matt Pharr
1eec27f890 Scalar target fixes.
Don't issue warnings about all instances writing to the same location if
there is only one program instance in the gang.

Be sure to report that all values are equal in one-element vectors in
LLVMVectorValuesAllEqual().

Issue #166.
2012-01-31 08:52:11 -08:00
Matt Pharr
b7f17d435f Fix crash in gather/scatter optimization pass. 2012-01-27 14:44:35 -08:00
Matt Pharr
5893a9c49d Remove incorrect assert 2012-01-27 09:14:45 -08:00
Matt Pharr
177e6312b4 Fix build with LLVM ToT (ConstantVector::getVectorElements() is gone now). 2012-01-27 09:07:58 -08:00
Matt Pharr
a5b7fca7e0 Extract constant offsets from gather/scatter base+offsets offset vectors.
When we're able to turn a general gather/scatter into the "base + offsets"
form, we now try to extract out any constant components of the offsets and
then pass them as a separate parameter to the gather/scatter function
implementation.

We then in turn carefully emit code for the addressing calculation so that
these constant offsets match LLVM's patterns to detect this case, such that
we get the constant offsets directly encoded in the instruction's addressing
calculation in many cases, saving arithmetic instructions to do these
calculations.

Improves performance of stencil by ~15%.  Other workloads unchanged.
2012-01-24 14:41:15 -08:00
Matt Pharr
7be2c399b1 Rename various optimization passes to have more descriptive names.
No functionality change.
2012-01-23 14:49:48 -08:00
Matt Pharr
d6337b3b22 Code cleanups in opt.cpp; no functional change 2012-01-23 14:36:32 -08:00
Matt Pharr
91ac3b9d7c Back out WIP changes to opt.cpp that were inadvertently checked in. 2012-01-21 07:34:53 -08:00
Matt Pharr
d65bf2eb2f Doxygen number bump and release notes for 1.1.3 2012-01-20 17:04:16 -08:00
Matt Pharr
4388338dad Fix performance regression introduced in be0c77d556
Effectively, the patterns that detected when given a gather or
scatter in base+offsets form, the offsets were actually a multiple
of 2/4/8, were no longer working.

This change not only fixes this, but also expands the set of
patterns that are matched by this.  For example, given offsets of
the form 4*v1 + 16*v2, it identifies a scale of 4 and new offsets
of v1 + 4*v2.

This fix makes the volume renderer run 1.19x faster, and noise 1.54x
faster.
2012-01-19 17:57:59 -08:00
Matt Pharr
68f6ea8def For << and >> with C++, detect when all instances are shifting by the same amount.
In this case, we now emit calls to potentially-specialized functions for the
left/right shifts that take a single integer value for the shift amount.  These
in turn can be matched to the corresponding intrinsics for the SSE target.

Issue #145.
2012-01-19 10:04:32 -07:00
Matt Pharr
3bf3ac7922 Be more conservative about using blending in place of masked store.
More specifically, we do a proper masked store (rather than a load-
blend-store) unless we can determine that we're accessing a stack-allocated
"varying" variable.  This fixes a number of nefarious bugs where given
code like:

    uniform float a[21];
    foreach (i = 0 … 21)
        a[i] = 0;

We'd use a blend and in turn read past the end of a[] in the last
iteration.

Also made slight changes to inlining in aobench; this keeps compiles
to ~5s, versus ~45s without them (with this change).

Fixes issue #160.
2012-01-17 23:42:22 -07:00
Matt Pharr
0f8eee9809 Fix cases in optimization code to not inadvertently match calls to func ptrs.
If we call a function pointer, CallInst::getCalledFunction() returns NULL; we
need to be careful about this case when we're matching various function calls
in optimization passes.

(Fixes a crash.)
2012-01-12 10:33:06 -08:00
Pierre-Antoine Lacaze
da9200fcee Fix alloca use on mingw. 2012-01-09 10:19:09 +01:00
Matt Pharr
be0c77d556 Detect more gather/scatter cases that are actually base+offsets.
We now recognize patterns like (ptr + offset1 + offset2) as being
cases we can handle with the base_offsets variants of the gather/scatter
functions.  (This can come up with multidimensional array indexing,
for example.)

Issue #150.
2012-01-08 14:06:44 -08:00
Matt Pharr
ff6971fb15 Use Assert() rather than assert() 2012-01-08 14:06:44 -08:00
Matt Pharr
71317e6aa6 Fix bug in gather/scatter optimization passes.
When flattening chains of insertelement instructions, we didn't
handle the case where the initial insertelement was to a constant
vector (with one value set and the other values undef).

Also generalized the "do all of the instances access the same location"
check to handle the case where some of them are accessing undef
locations; these are ignored in this check, as they should correspond to the
mask being off for that lane anyway.

Fixes issue #149.
2012-01-06 09:19:18 -08:00
Matt Pharr
8938e14442 Add support for emitting ~generic vectorized C++ code.
The compiler now supports an --emit-c++ option, which generates generic
vector C++ code.  To actually compile this code, the user must provide
C++ code that implements a variety of types and operations (e.g. adding
two floating-point vector values together, comparing them, etc).

There are two examples of this required code in examples/intrinsics:
generic-16.h is a "generic" 16-wide implementation that does all required
with scalar math; it's useful for demonstrating the requirements of the
implementation.  Then, sse4.h shows a simple implementation of a SSE4
target that maps the emitted function calls to SSE intrinsics.

When using these example implementations with the ispc test suite,
all but one or two tests pass with gcc and clang on Linux and OSX.
There are currently ~10 failures with icc on Linux, and ~50 failures with
MSVC 2010.  (To be fixed in coming days.)

Performance varies: when running the examples through the sse4.h
target, some have the same performance as when compiled with --target=sse4
from ispc directly (options), while noise is 12% slower, rt is 26%
slower, and aobench is 2.2x slower.  The details of this haven't yet been
carefully investigated, but will be in coming days as well.

Issue #92.
2012-01-04 12:59:03 -08:00
Matt Pharr
dea13979e0 Fix bug in lIs248Splat() in opt.cpp 2012-01-04 11:55:02 -08:00
Matt Pharr
052d34bf5b Various cleanups to optimization code.
Stop using the PassManagerBuilder but add all of the passes directly in code here.
This currently leads to no different behavior, but was useful with experimenting
with disabling the SROA pass when compiling to generic targets.
2012-01-04 11:54:44 -08:00
Matt Pharr
d4c5e82896 Add VSelMovMsk optimization pass.
Various peephole improvements to vector select instructions.
2012-01-04 11:52:27 -08:00
Matt Pharr
562d61caff Added masked load optimization pass.
This pass handles the "all on" and "all off" mask cases appropriately.

Also renamed load_masked stuff in built-ins to masked_load for consistency with
masked_store.
2012-01-04 11:51:26 -08:00
Matt Pharr
1d9201fe3d Add "generic" 4, 8, and 16-wide targets.
When used, these targets end up with calls to undefined functions for all
of the various special vector stuff ispc needs to compile ispc programs
(masked store, gather, min/max, sqrt, etc.).

These targets are not yet useful for anything, but are a step toward
having an option to C++ code with calls out to intrinsics.

Reorganized the directory structure a bit and put the LLVM bitcode used
to define target-specific stuff (as well as some generic built-ins stuff)
into a builtins/ directory.

Note that for building on Windows, it's now necessary to set a LLVM_VERSION
environment variable (with values like LLVM_2_9, LLVM_3_0, LLVM_3_1svn, etc.)
2011-12-19 13:46:50 -08:00
Matt Pharr
6dbb15027a Take advantage of x86's free "scale by 2, 4, or 8" in addressing calculations
When loading from an address that's computed by adding two registers
together, x86 can scale one of them by 2, 4, or 8, for free as part
of the addressing calculation.  This change makes the code generated
for gather and scatter use this.

For the cases where gather/scatter is based on a base pointer and
an integer offset vector, the GSImprovementsPass looks to see if the
integer offsets are being computed as 2/4/8 times some other value.
If so, it extracts the 2x/4x/8x part and leaves the rest there as
the the offsets.  The {gather,scatter}_base_offsets_* functions take
an i32 scale factor, which is passed to them, and then they carefully
generate IR so that it hits LLVM's pattern matching for these scales.

This is particular win on AVX, since it saves us two 4-wide integer
multiplies.

Noise runs 14% faster with this.
Issue #132.
2011-12-16 15:55:44 -08:00
Matt Pharr
e82a720223 Fix various warnings / build issues on Windows 2011-12-15 12:06:38 -08:00
Matt Pharr
8d1b77b235 Have assertion macro and FATAL() text ask user to file a bug, provide URL to do so.
Switch to Assert() from assert() to make it clear it's not the C stdlib one we're
using any more.
2011-12-15 11:11:16 -08:00
Matt Pharr
46bfef3fce Add option to turn off codegen improvements when mask 'all on' is statically known. 2011-12-11 16:16:36 -08:00
Matt Pharr
e2b6ed3db8 Fix built for LLVM2.9 and 3.1svn 2011-12-06 08:08:41 -08:00
Matt Pharr
b48775a549 Handle global arrays better in varying pointer analysis.
Specifically, indexing into global arrays sometimes comes in as a big 
llvm::ConstantVector, so we need to handle traversing those as well when
we do the corresponding checks in GatherScatterFlattenOpt so that we
still detect cases where we can convert them into the base pointer +
offsets form that's used in later analysis.
2011-11-30 12:29:49 -08:00
Matt Pharr
e52104ff55 Pointer fixes/improvements.
Allow <, <=, >, >= comparisons of pointers
Allow explicit type-casting of pointers to and from integers
Fix bug in handling expressions of the form "int + ptr" ("ptr + int"
  was fine).
Fix a bug in TypeCastExpr where varying -> uniform typecasts
  would be allowed (leading to a crash later)
2011-11-29 13:22:36 -08:00
Matt Pharr
2a6e3e5fea Fix bug in ptr+offset decomposition in GatherScatterFlattenOpt
Given IR that encoded computation like "vec(4) + ptr2int(some pointer)",
we'd report that "int2ptr(4)" was the base pointer and the ptr2int 
value was the offset.  This in turn could lead to incorrect code
from LLVM, since we'd end up with GEP instructions where the first
operand was int2ptr(4) and the offset was the original pointer value.
This in turn was sometimes leading to incorrect code and thence a 
failure on the tests/gs-double-improve-multidim.ispc test since LLVM's
memory read/write analysis assumes that nothing after the first operand
of a GEP is actually a pointer.
2011-11-28 15:00:41 -08:00
Matt Pharr
975db80ef6 Add support for pointers to the language.
Pointers can be either uniform or varying, and behave correspondingly.
e.g.: "uniform float * varying" is a varying pointer to uniform float
data in memory, and "float * uniform" is a uniform pointer to varying
data in memory.  Like other types, pointers are varying by default.

Pointer-based expressions, & and *, sizeof, ->, pointer arithmetic,
and the array/pointer duality all bahave as in C.  Array arguments
to functions are converted to pointers, also like C.

There is a built-in NULL for a null pointer value; conversion from
compile-time constant 0 values to NULL still needs to be implemented.

Other changes:
- Syntax for references has been updated to be C++ style; a useful
  warning is now issued if the "reference" keyword is used.
- It is now illegal to pass a varying lvalue as a reference parameter
  to a function; references are essentially uniform pointers.
  This case had previously been handled via special case call by value
  return code.  That path has been removed, now that varying pointers
  are available to handle this use case (and much more).
- Some stdlib routines have been updated to take pointers as
  arguments where appropriate (e.g. prefetch and the atomics).
  A number of others still need attention.
- All of the examples have been updated
- Many new tests

TODO: documentation
2011-11-27 13:09:59 -08:00
Matt Pharr
068ea3e4c4 Better SourcePos reporting for gathers/scatters 2011-11-21 10:26:53 -08:00
Matt Pharr
f8eb100c60 Use llvm TargetData to find object sizes, offsets.
Previously, to compute the size of objects and the offsets of struct
elements within structs, we were using the trick of using getelementpointer 
with a NULL base pointer and then casting the result to an int32/64.
However, since we actually know the target we're compiling for at
compile time, we can use corresponding methods from TargetData to
get these values directly.

This mostly cleans up code, but may make some of the gather/scatter
lowering to loads/stores optimizations work better in the presence
of structures.
2011-11-06 19:31:19 -08:00
Matt Pharr
cabe358c0a Workaround change to linker behavior in LLVM 3.1
Now, the Linker::LinkModules() call doesn't link in any functions
marked as 'internal', which is problematic, since we'd like to have
just about all of the builtins marked as internal so that they are
eliminated after they've been inlined when they are in fact used.

This change removes all of the internal qualifiers in the builtins
and adds a lSetInternalFunctions() routine to builtins.cpp that
sets this property on the functions that need it after they've
been linked in by LinkModules().
2011-11-05 16:57:26 -07:00
Matt Pharr
43a2d510bf Incorporate per-lane offsets for varying data in the front-end.
Previously, it was only in the GatherScatterFlattenOpt optimization pass that
we added the per-lane offsets when we were indexing into varying data.
(Specifically, the case of float foo[]; int index; foo[index], where foo
is an array of varying elements rather than uniform elements.)  Now, this
is done in the front-end as we're first emitting code.

In addition to the basic ugliness of doing this in an optimization pass, 
it was also error-prone to do it there, since we no longer have access
to all of the type information that's around in the front-end.

No functionality or performance change.
2011-11-03 13:15:07 -07:00
Matt Pharr
6084d6aeaf Added disable-handle-pseudo-memory-ops option. 2011-10-31 08:29:13 -07:00
Matt Pharr
d224252b5d Fix bug where multiplying varying array offset by zero would cause crash in optimization passes. 2011-10-31 08:28:51 -07:00
Matt Pharr
8b719e4c4e Fix warnings reported by doxygen 2011-10-20 11:49:54 -07:00
Matt Pharr
074cbc2716 Fix #ifdefs to catch LLVM 3.1svn now as well 2011-10-19 14:01:19 -07:00
Matt Pharr
19087e4761 When casting pointers to ints, choose int32/64 based on target pointer size.
Issue #97.
2011-10-17 06:57:04 -04:00