There are two related optimizations that happen now. (These
currently only apply for gathers where the mask is known to be
all on, and to gathers that are accessing 32-bit sized elements,
but both of these may be generalized in the future.)
First, for any single gather, we are now more flexible in mapping it
to individual memory operations. Previously, we would only either map
it to a general gather (one scalar load per SIMD lane), or an
unaligned vector load (if the program instances could be determined
to be accessing a sequential set of locations in memory.)
Now, we are able to break gathers into scalar, 2-wide (i.e. 64-bit),
4-wide, or 8-wide loads. Further, we now generate code that shuffles
these loads around. Doing fewer, larger loads in this manner, when
possible, can be more efficient.
Second, we can coalesce memory accesses across multiple gathers. If
we have a series of gathers without any memory writes in the middle,
then we try to analyze their reads collectively and choose an efficient
set of loads for them. Not only does this help if different gathers
reuse values from the same location in memory, but it's specifically
helpful when data with AOS layout is being accessed; in this case,
we're often able to generate wide vector loads and appropriate shuffles
automatically.
Previously, we'd pick one lane and generate a regular store for its value.
This was the wrong thing to do, since we also should have been checking
that the mask was on (for the lane that was chosen). This bug didn't
become evident until the scalar target was added, since many stores fall
into this case with that target.
Now, we just leave those as regular scatters.
Fixes most of the failing tests for the scalar target listed in issue #167.
Don't issue warnings about all instances writing to the same location if
there is only one program instance in the gang.
Be sure to report that all values are equal in one-element vectors in
LLVMVectorValuesAllEqual().
Issue #166.
When we're able to turn a general gather/scatter into the "base + offsets"
form, we now try to extract out any constant components of the offsets and
then pass them as a separate parameter to the gather/scatter function
implementation.
We then in turn carefully emit code for the addressing calculation so that
these constant offsets match LLVM's patterns to detect this case, such that
we get the constant offsets directly encoded in the instruction's addressing
calculation in many cases, saving arithmetic instructions to do these
calculations.
Improves performance of stencil by ~15%. Other workloads unchanged.
Effectively, the patterns that detected when given a gather or
scatter in base+offsets form, the offsets were actually a multiple
of 2/4/8, were no longer working.
This change not only fixes this, but also expands the set of
patterns that are matched by this. For example, given offsets of
the form 4*v1 + 16*v2, it identifies a scale of 4 and new offsets
of v1 + 4*v2.
This fix makes the volume renderer run 1.19x faster, and noise 1.54x
faster.
In this case, we now emit calls to potentially-specialized functions for the
left/right shifts that take a single integer value for the shift amount. These
in turn can be matched to the corresponding intrinsics for the SSE target.
Issue #145.
More specifically, we do a proper masked store (rather than a load-
blend-store) unless we can determine that we're accessing a stack-allocated
"varying" variable. This fixes a number of nefarious bugs where given
code like:
uniform float a[21];
foreach (i = 0 … 21)
a[i] = 0;
We'd use a blend and in turn read past the end of a[] in the last
iteration.
Also made slight changes to inlining in aobench; this keeps compiles
to ~5s, versus ~45s without them (with this change).
Fixes issue #160.
If we call a function pointer, CallInst::getCalledFunction() returns NULL; we
need to be careful about this case when we're matching various function calls
in optimization passes.
(Fixes a crash.)
We now recognize patterns like (ptr + offset1 + offset2) as being
cases we can handle with the base_offsets variants of the gather/scatter
functions. (This can come up with multidimensional array indexing,
for example.)
Issue #150.
When flattening chains of insertelement instructions, we didn't
handle the case where the initial insertelement was to a constant
vector (with one value set and the other values undef).
Also generalized the "do all of the instances access the same location"
check to handle the case where some of them are accessing undef
locations; these are ignored in this check, as they should correspond to the
mask being off for that lane anyway.
Fixes issue #149.
The compiler now supports an --emit-c++ option, which generates generic
vector C++ code. To actually compile this code, the user must provide
C++ code that implements a variety of types and operations (e.g. adding
two floating-point vector values together, comparing them, etc).
There are two examples of this required code in examples/intrinsics:
generic-16.h is a "generic" 16-wide implementation that does all required
with scalar math; it's useful for demonstrating the requirements of the
implementation. Then, sse4.h shows a simple implementation of a SSE4
target that maps the emitted function calls to SSE intrinsics.
When using these example implementations with the ispc test suite,
all but one or two tests pass with gcc and clang on Linux and OSX.
There are currently ~10 failures with icc on Linux, and ~50 failures with
MSVC 2010. (To be fixed in coming days.)
Performance varies: when running the examples through the sse4.h
target, some have the same performance as when compiled with --target=sse4
from ispc directly (options), while noise is 12% slower, rt is 26%
slower, and aobench is 2.2x slower. The details of this haven't yet been
carefully investigated, but will be in coming days as well.
Issue #92.
Stop using the PassManagerBuilder but add all of the passes directly in code here.
This currently leads to no different behavior, but was useful with experimenting
with disabling the SROA pass when compiling to generic targets.
This pass handles the "all on" and "all off" mask cases appropriately.
Also renamed load_masked stuff in built-ins to masked_load for consistency with
masked_store.
When used, these targets end up with calls to undefined functions for all
of the various special vector stuff ispc needs to compile ispc programs
(masked store, gather, min/max, sqrt, etc.).
These targets are not yet useful for anything, but are a step toward
having an option to C++ code with calls out to intrinsics.
Reorganized the directory structure a bit and put the LLVM bitcode used
to define target-specific stuff (as well as some generic built-ins stuff)
into a builtins/ directory.
Note that for building on Windows, it's now necessary to set a LLVM_VERSION
environment variable (with values like LLVM_2_9, LLVM_3_0, LLVM_3_1svn, etc.)
When loading from an address that's computed by adding two registers
together, x86 can scale one of them by 2, 4, or 8, for free as part
of the addressing calculation. This change makes the code generated
for gather and scatter use this.
For the cases where gather/scatter is based on a base pointer and
an integer offset vector, the GSImprovementsPass looks to see if the
integer offsets are being computed as 2/4/8 times some other value.
If so, it extracts the 2x/4x/8x part and leaves the rest there as
the the offsets. The {gather,scatter}_base_offsets_* functions take
an i32 scale factor, which is passed to them, and then they carefully
generate IR so that it hits LLVM's pattern matching for these scales.
This is particular win on AVX, since it saves us two 4-wide integer
multiplies.
Noise runs 14% faster with this.
Issue #132.
Specifically, indexing into global arrays sometimes comes in as a big
llvm::ConstantVector, so we need to handle traversing those as well when
we do the corresponding checks in GatherScatterFlattenOpt so that we
still detect cases where we can convert them into the base pointer +
offsets form that's used in later analysis.
Allow <, <=, >, >= comparisons of pointers
Allow explicit type-casting of pointers to and from integers
Fix bug in handling expressions of the form "int + ptr" ("ptr + int"
was fine).
Fix a bug in TypeCastExpr where varying -> uniform typecasts
would be allowed (leading to a crash later)
Given IR that encoded computation like "vec(4) + ptr2int(some pointer)",
we'd report that "int2ptr(4)" was the base pointer and the ptr2int
value was the offset. This in turn could lead to incorrect code
from LLVM, since we'd end up with GEP instructions where the first
operand was int2ptr(4) and the offset was the original pointer value.
This in turn was sometimes leading to incorrect code and thence a
failure on the tests/gs-double-improve-multidim.ispc test since LLVM's
memory read/write analysis assumes that nothing after the first operand
of a GEP is actually a pointer.
Pointers can be either uniform or varying, and behave correspondingly.
e.g.: "uniform float * varying" is a varying pointer to uniform float
data in memory, and "float * uniform" is a uniform pointer to varying
data in memory. Like other types, pointers are varying by default.
Pointer-based expressions, & and *, sizeof, ->, pointer arithmetic,
and the array/pointer duality all bahave as in C. Array arguments
to functions are converted to pointers, also like C.
There is a built-in NULL for a null pointer value; conversion from
compile-time constant 0 values to NULL still needs to be implemented.
Other changes:
- Syntax for references has been updated to be C++ style; a useful
warning is now issued if the "reference" keyword is used.
- It is now illegal to pass a varying lvalue as a reference parameter
to a function; references are essentially uniform pointers.
This case had previously been handled via special case call by value
return code. That path has been removed, now that varying pointers
are available to handle this use case (and much more).
- Some stdlib routines have been updated to take pointers as
arguments where appropriate (e.g. prefetch and the atomics).
A number of others still need attention.
- All of the examples have been updated
- Many new tests
TODO: documentation
Previously, to compute the size of objects and the offsets of struct
elements within structs, we were using the trick of using getelementpointer
with a NULL base pointer and then casting the result to an int32/64.
However, since we actually know the target we're compiling for at
compile time, we can use corresponding methods from TargetData to
get these values directly.
This mostly cleans up code, but may make some of the gather/scatter
lowering to loads/stores optimizations work better in the presence
of structures.
Now, the Linker::LinkModules() call doesn't link in any functions
marked as 'internal', which is problematic, since we'd like to have
just about all of the builtins marked as internal so that they are
eliminated after they've been inlined when they are in fact used.
This change removes all of the internal qualifiers in the builtins
and adds a lSetInternalFunctions() routine to builtins.cpp that
sets this property on the functions that need it after they've
been linked in by LinkModules().
Previously, it was only in the GatherScatterFlattenOpt optimization pass that
we added the per-lane offsets when we were indexing into varying data.
(Specifically, the case of float foo[]; int index; foo[index], where foo
is an array of varying elements rather than uniform elements.) Now, this
is done in the front-end as we're first emitting code.
In addition to the basic ugliness of doing this in an optimization pass,
it was also error-prone to do it there, since we no longer have access
to all of the type information that's around in the front-end.
No functionality or performance change.