FAQ and perf guide updates
This commit is contained in:
382
docs/faq.txt
382
docs/faq.txt
@@ -2,3 +2,385 @@
|
||||
Intel® SPMD Program Compiler Frequently Asked Questions (FAQ)
|
||||
=============================================================
|
||||
|
||||
This document includes a number of frequently (and not frequently) asked
|
||||
questions about ispc, the Intel® SPMD Program Compiler. The source to this
|
||||
document is in the file ``docs/faq.txt`` in the ``ispc`` source
|
||||
distribution.
|
||||
|
||||
* Understanding ispc's Output
|
||||
|
||||
+ `How can I see the assembly language generated by ispc?`_
|
||||
+ `How can I have the assembly output be printed using Intel assembly syntax?`_
|
||||
+ `Why are there multiple versions of exported ispc functions in the assembly output?`_
|
||||
+ `How can I more easily see gathers and scatters in generated assembly?`_
|
||||
|
||||
* Interoperability
|
||||
|
||||
+ `How can I supply an initial execution mask in the call from the application?`_
|
||||
+ `How can I generate a single binary executable with support for multiple instruction sets?`_
|
||||
+ `How can I determine at run-time which vector instruction set's instructions were selected to execute?`_
|
||||
|
||||
* Programming Techniques
|
||||
|
||||
+ `What primitives are there for communicating between SPMD program instances?`_
|
||||
+ `How can a gang of program instances generate variable output efficiently?`_
|
||||
+ `Is it possible to use ispc for explicit vector programming?`_
|
||||
|
||||
|
||||
Understanding ispc's Output
|
||||
===========================
|
||||
|
||||
How can I see the assembly language generated by ispc?
|
||||
------------------------------------------------------
|
||||
|
||||
The ``--emit-asm`` flag causes assembly output to be generated. If the
|
||||
``-o`` command-line flag is also supplied, the assembly is stored in the
|
||||
given file, or printed to standard output if ``-`` is specified for the
|
||||
filename. For example, given the simple ``ispc`` program:
|
||||
|
||||
::
|
||||
|
||||
export uniform int foo(uniform int a, uniform int b) {
|
||||
return a+b;
|
||||
}
|
||||
|
||||
If the SSE4 target is used, then the following assembly is printed:
|
||||
|
||||
::
|
||||
|
||||
_foo: ## @foo
|
||||
## BB#0: ## %allocas
|
||||
addl %esi, %edi
|
||||
movl %edi, %eax
|
||||
ret
|
||||
|
||||
|
||||
How can I have the assembly output be printed using Intel assembly syntax?
|
||||
--------------------------------------------------------------------------
|
||||
|
||||
The ``ispc`` compiler is currently only able to emit assembly with AT+T
|
||||
syntax, where the destination operand is the last operand after an
|
||||
instruction. If you'd prefer Intel assembly output, one option is to use
|
||||
Agner Fog's ``objconv`` tool: have ``ispc`` emit a native object file and
|
||||
then use ``objconv`` to disassemble it, specifying the assembler syntax
|
||||
that you prefer. ``objconv`` `is available for download here`_.
|
||||
|
||||
.. _is available for download here: http://www.agner.org/optimize/#objconv
|
||||
|
||||
Why are there multiple versions of exported ispc functions in the assembly output?
|
||||
----------------------------------------------------------------------------------
|
||||
|
||||
Two generations of all functions qualified with ``export`` are generated:
|
||||
one of them is for being be called by other ``ispc`` functions, and the
|
||||
other is to be called by the application. The application callable
|
||||
function has the original function's name, while the ``ispc``-callable
|
||||
function has a mangled name that encodes the types of the function's
|
||||
parameters.
|
||||
|
||||
The crucial difference between these two functions is that the
|
||||
application-callable function doesn't take a parameter encoding the current
|
||||
execution mask, while ``ispc``-callable functions have a hidden mask
|
||||
parameter. An implication of this difference is that the ``export``
|
||||
function starts with the execution mask "all on". This allows a number of
|
||||
improvements in the generated code, particularly on architectures that
|
||||
don't have support for masked load and store instructions.
|
||||
|
||||
As an example, consider this short function, which loads a vector's worth
|
||||
values from two arrays in memory, adds them, and writes the result to an
|
||||
output array.
|
||||
|
||||
::
|
||||
|
||||
export void foo(uniform float a[], uniform float b[],
|
||||
uniform float result[]) {
|
||||
float aa = a[programIndex], bb = b[programIndex];
|
||||
result[programIndex] = aa+bb;
|
||||
}
|
||||
|
||||
Here is the assembly code for the application-callable instance of the
|
||||
function--note that the selected instructions are ideal.
|
||||
|
||||
::
|
||||
|
||||
_foo:
|
||||
movups (%rsi), %xmm1
|
||||
movups (%rdi), %xmm0
|
||||
addps %xmm1, %xmm0
|
||||
movups %xmm0, (%rdx)
|
||||
ret
|
||||
|
||||
|
||||
And here is the assembly code for the ``ispc``-callable instance of the
|
||||
function. There are a few things to notice in this code.
|
||||
|
||||
The current program mask is coming in via the %xmm0 register and the
|
||||
initial few instructions in the function essentially check to see if the
|
||||
mask is all-on or all-off. If the mask is all on, the code at the label
|
||||
LBB0_3 executes; it's the same as the code that was generated for ``_foo``
|
||||
above. If the mask is all off, then there's nothing to be done, and the
|
||||
function can return immediately.
|
||||
|
||||
In the case of a mixed mask, a substantial amount of code is generated to
|
||||
load from and then store to only the array elements that correspond to
|
||||
program instances where the mask is on. (This code is elided below). This
|
||||
general pattern of having two-code paths for the "all on" and "mixed" mask
|
||||
cases is used in the code generated for almost all but the most simple
|
||||
functions (where the overhead of the test isn't worthwhile.)
|
||||
|
||||
::
|
||||
|
||||
"_foo___uptr<Uf>uptr<Uf>uptr<Uf>":
|
||||
movmskps %xmm0, %eax
|
||||
cmpl $15, %eax
|
||||
je LBB0_3
|
||||
testl %eax, %eax
|
||||
jne LBB0_4
|
||||
ret
|
||||
LBB0_3:
|
||||
movups (%rsi), %xmm1
|
||||
movups (%rdi), %xmm0
|
||||
addps %xmm1, %xmm0
|
||||
movups %xmm0, (%rdx)
|
||||
ret
|
||||
LBB0_4:
|
||||
####
|
||||
#### Code elided; handle mixed mask case..
|
||||
####
|
||||
ret
|
||||
|
||||
|
||||
How can I more easily see gathers and scatters in generated assembly?
|
||||
---------------------------------------------------------------------
|
||||
|
||||
FIXME
|
||||
|
||||
Interoperability
|
||||
================
|
||||
|
||||
How can I supply an initial execution mask in the call from the application?
|
||||
----------------------------------------------------------------------------
|
||||
|
||||
Recall that when execution transitions from the application code to an
|
||||
``ispc`` function, all of the program instances are initially executing.
|
||||
In some cases, it may desired that only some of them are running, based on
|
||||
a data-dependent condition computed in the application program. This
|
||||
situation can easily be handled via an additional parameter from the
|
||||
application.
|
||||
|
||||
As a simple example, consider a case where the application code has an
|
||||
array of ``float`` values and we'd like the ``ispc`` code to update
|
||||
just specific values in that array, where which of those values to be
|
||||
updated has been determined by the application. In C++ code, we might
|
||||
have:
|
||||
|
||||
::
|
||||
|
||||
int count = ...;
|
||||
float *array = new float[count];
|
||||
bool *shouldUpdate = new bool[count];
|
||||
// initialize array and shouldUpdate
|
||||
ispc_func(array, shouldUpdate, count);
|
||||
|
||||
Then, the ``ispc`` code could process this update as:
|
||||
|
||||
::
|
||||
|
||||
export void ispc_func(uniform float array[], uniform bool update[],
|
||||
uniform int count) {
|
||||
foreach (i = 0 ... count) {
|
||||
cif (update[i] == true)
|
||||
// update array[i+programIndex]...
|
||||
}
|
||||
}
|
||||
|
||||
(In this case a "coherent" if statement is likely to be worthwhile if the
|
||||
``update`` array will tend to have sections that are either all-true or
|
||||
all-false.)
|
||||
|
||||
How can I generate a single binary executable with support for multiple instruction sets?
|
||||
-----------------------------------------------------------------------------------------
|
||||
|
||||
``ispc`` can also generate output that supports multiple target instruction
|
||||
sets, also generating code that chooses the most appropriate one at runtime
|
||||
if multiple targets are specified with the ``--target`` command-line
|
||||
argument.
|
||||
|
||||
For example, if you run the command:
|
||||
|
||||
::
|
||||
|
||||
ispc foo.ispc -o foo.o --target=sse2,sse4-x2,avx-x2
|
||||
|
||||
Then four object files will be generated: ``foo_sse2.o``, ``foo_sse4.o``,
|
||||
``foo_avx.o``, and ``foo.o``.[#]_ Link all of these into your executable, and
|
||||
when you call a function in ``foo.ispc`` from your application code,
|
||||
``ispc`` will determine which instruction sets are supported by the CPU the
|
||||
code is running on and will call the most appropraite version of the
|
||||
function available.
|
||||
|
||||
.. [#] Similarly, if you choose to generate assembly langauage output or
|
||||
LLVM bitcode output, multiple versions of those files will be created.
|
||||
|
||||
In general, the version of the function that runs will be the one in the
|
||||
most general instruction set that is supported by the system. If you only
|
||||
compile SSE2 and SSE4 variants and run on a system that supports AVX, for
|
||||
example, then the SSE4 variant will be executed. If the system doesn't
|
||||
is not able to run any of the available variants of the function (for
|
||||
example, trying to run a function that only has SSE4 and AVX variants on a
|
||||
system that only supports SSE2), then the standard library ``abort()``
|
||||
function will be called.
|
||||
|
||||
One subtlety is that all non-static global variables (if any) must have the
|
||||
same size and layout with all of the targets used. For example, if you
|
||||
have the global variables:
|
||||
|
||||
::
|
||||
|
||||
uniform int foo[2*programCount];
|
||||
int bar;
|
||||
|
||||
and compile to both SSE2 and AVX targets, both of these variables will have
|
||||
different sizes (the first due to program count having the value 4 for SSE2
|
||||
and 8 for AVX, and the second due to ``varying`` types having different
|
||||
numbers of elements with the two targets--essentially the same issue as the
|
||||
first.) ``ispc`` issues an error in this case.
|
||||
|
||||
|
||||
How can I determine at run-time which vector instruction set's instructions were selected to execute?
|
||||
-----------------------------------------------------------------------------------------------------
|
||||
|
||||
``ispc`` doesn't provide any API that allows querying which vector ISA's
|
||||
instructions are running when multi-target compilation was used. However,
|
||||
this can be solved in "user space" by writing a small helper function.
|
||||
Specifically, if you implement a function like this
|
||||
|
||||
::
|
||||
|
||||
export uniform int isa() {
|
||||
#if defined(ISPC_TARGET_SSE2)
|
||||
return 0;
|
||||
#elif defined(ISPC_TARGET_SSE4)
|
||||
return 1;
|
||||
#elif defined(ISPC_TARGET_AVX)
|
||||
return 2;
|
||||
#else
|
||||
return -1;
|
||||
#endif
|
||||
}
|
||||
|
||||
And then call it from your application code at runtime, it will return 0,
|
||||
1, or 2, depending on which target's instructions are running.
|
||||
|
||||
The way this works is a little surprising, but it's a useful trick. Of
|
||||
course the preprocessor ``#if`` checks are all compile-time only
|
||||
operations. What's actually happening is that the function is compiled
|
||||
multiple times, once for each target, with the appropriate ``ISPC_TARGET``
|
||||
preprocessor symbol set. Then, a small dispatch function is generated for
|
||||
the application to actually call. This dispatch function in turn calls the
|
||||
appropriate version of the function based on the CPU of the system it's
|
||||
executing on, which in turn returns the appropriate value.
|
||||
|
||||
In a similar fashion, it's possible to find out at run-time the value of
|
||||
``programCount`` for the target that's actually being used.
|
||||
|
||||
::
|
||||
|
||||
export uniform int width() { return programCount; }
|
||||
|
||||
|
||||
Programming Techniques
|
||||
======================
|
||||
|
||||
What primitives are there for communicating between SPMD program instances?
|
||||
---------------------------------------------------------------------------
|
||||
|
||||
The ``broadcast()``, ``rotate()``, and ``shuffle()`` standard library
|
||||
routines provide a variety of mechanisms for the running program instances
|
||||
to communicate values to each other during execution. Note that there's no
|
||||
need to synchronize the program instances before communicating between
|
||||
them, due to the synchronized execution model of gangs of program instances
|
||||
in ``ispc``.
|
||||
|
||||
How can a gang of program instances generate variable output efficiently?
|
||||
-------------------------------------------------------------------------
|
||||
|
||||
A useful application of the ``exclusive_scan_add()`` function in the
|
||||
standard library is when program instances want to generate a variable
|
||||
amount of output and when one would like that output to be densely packed
|
||||
in a single array. For example, consider the code fragment below:
|
||||
|
||||
::
|
||||
|
||||
uniform int func(uniform float outArray[], ...) {
|
||||
int numOut = ...; // figure out how many to be output
|
||||
float outLocal[MAX_OUT]; // staging area
|
||||
|
||||
// each program instance in the gang puts its results in
|
||||
// outLocal[0], ..., outLocal[numOut-1]
|
||||
|
||||
int startOffset = exclusive_scan_add(numOut);
|
||||
for (int i = 0; i < numOut; ++i)
|
||||
outArray[startOffset + i] = outLocal[i];
|
||||
return reduce_add(numOut);
|
||||
}
|
||||
|
||||
Here, each program instance has computed a number, ``numOut``, of values to
|
||||
output, and has stored them in the ``outLocal`` array. Assume that four
|
||||
program instances are running and that the first one wants to output one
|
||||
value, the second two values, and the third and fourth three values each.
|
||||
In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6)
|
||||
to the four program instances, respectively.
|
||||
|
||||
The first program instance will write its one result to ``outArray[0]``,
|
||||
the second will write its two values to ``outArray[1]`` and
|
||||
``outArray[2]``, and so forth. The ``reduce_add`` call at the end returns
|
||||
the total number of values that all of the program instances have written
|
||||
to the array.
|
||||
|
||||
FIXME: add discussion of foreach_active as an option here once that's in
|
||||
|
||||
Is it possible to use ispc for explicit vector programming?
|
||||
-----------------------------------------------------------
|
||||
|
||||
The typical model for programming in ``ispc`` is an *implicit* parallel
|
||||
model, where one writes a program that is apparently doing scalar
|
||||
computation on values and the program is then vectorized to run in parallel
|
||||
across the SIMD lanes of a processor. However, ``ispc`` also has some
|
||||
support for explicit vector unit programming, where the vectorization is
|
||||
explicit. Some computations may be more effectively described in the
|
||||
explicit model rather than the implicit model.
|
||||
|
||||
This support is provided via ``uniform`` instances of short vectors
|
||||
Specifically, if this short program
|
||||
|
||||
::
|
||||
|
||||
export uniform float<8> madd(uniform float<8> a, uniform float<8> b,
|
||||
uniform float<8> c) {
|
||||
return a + b * c;
|
||||
}
|
||||
|
||||
is compiled with the AVX target, ``ispc`` generates the following assembly:
|
||||
|
||||
::
|
||||
|
||||
_madd:
|
||||
vmulps %ymm2, %ymm1, %ymm1
|
||||
vaddps %ymm0, %ymm1, %ymm0
|
||||
ret
|
||||
|
||||
(And similarly, if compiled with a 4-wide SSE target, two ``mulps`` and two
|
||||
``addps`` instructions are generated, and so forth.)
|
||||
|
||||
Note that ``ispc`` doesn't currently support control-flow based on
|
||||
``uniform`` short vector types; it is thus not possible to write code like:
|
||||
|
||||
::
|
||||
|
||||
export uniform int<8> count(uniform float<8> a, uniform float<8> b) {
|
||||
uniform int<8> sum = 0;
|
||||
while (a++ < b)
|
||||
++sum;
|
||||
}
|
||||
|
||||
|
||||
|
||||
424
docs/perf.txt
424
docs/perf.txt
@@ -2,3 +2,427 @@
|
||||
Intel® SPMD Program Compiler Performance Guide
|
||||
==============================================
|
||||
|
||||
|
||||
* `Using ISPC Effectively`_
|
||||
|
||||
+ `Gather and Scatter`_
|
||||
+ `8 and 16-bit Integer Types`_
|
||||
+ `Low-level Vector Tricks`_
|
||||
+ `The "Fast math" Option`_
|
||||
+ `"Inline" Aggressively`_
|
||||
+ `Small Performance Tricks`_
|
||||
+ `Instrumenting Your ISPC Programs`_
|
||||
+ `Choosing A Target Vector Width`_
|
||||
+ `Implementing Reductions Efficiently`_
|
||||
|
||||
* `Disclaimer and Legal Information`_
|
||||
|
||||
* `Optimization Notice`_
|
||||
|
||||
|
||||
don't use the system math library unless it's absolutely necessary
|
||||
|
||||
opt=32-bit-addressing
|
||||
|
||||
Using ISPC Effectively
|
||||
======================
|
||||
|
||||
|
||||
Gather and Scatter
|
||||
------------------
|
||||
|
||||
The CPU is a poor fit for SPMD execution in some ways, the worst of which
|
||||
is handling of general memory reads and writes from SPMD program instances.
|
||||
For example, in a "simple" array index:
|
||||
|
||||
::
|
||||
|
||||
int i = ....;
|
||||
uniform float x[10] = { ... };
|
||||
float f = x[i];
|
||||
|
||||
Since the index ``i`` is a varying value, the various SPMD program
|
||||
instances will in general be reading different locations in the array
|
||||
``x``. Because the CPU doesn't have a gather instruction, the ``ispc``
|
||||
compiler has to serialize these memory reads, performing a separate memory
|
||||
load for each running program instance, packing the result into ``f``.
|
||||
(And the analogous case would happen for a write into ``x[i]``.)
|
||||
|
||||
In many cases, gathers like these are unavoidable; the running program
|
||||
instances just need to access incoherent memory locations. However, if the
|
||||
array index ``i`` could actually be declared and used as a ``uniform``
|
||||
variable, the resulting array index is substantially more
|
||||
efficient. This is another case where using ``uniform`` whenever applicable
|
||||
is of benefit.
|
||||
|
||||
In some cases, the ``ispc`` compiler is able to deduce that the memory
|
||||
locations accessed are either all the same or are uniform. For example,
|
||||
given:
|
||||
|
||||
::
|
||||
|
||||
uniform int x = ...;
|
||||
int y = x;
|
||||
return array[y];
|
||||
|
||||
The compiler is able to determine that all of the program instances are
|
||||
loading from the same location, even though ``y`` is not a ``uniform``
|
||||
variable. In this case, the compiler will transform this load to a regular vector
|
||||
load, rather than a general gather.
|
||||
|
||||
Sometimes the running program instances will access a
|
||||
linear sequence of memory locations; this happens most frequently when
|
||||
array indexing is done based on the built-in ``programIndex`` variable. In
|
||||
many of these cases, the compiler is also able to detect this case and then
|
||||
do a vector load. For example, given:
|
||||
|
||||
::
|
||||
|
||||
uniform int x = ...;
|
||||
return array[2*x + programIndex];
|
||||
|
||||
A regular vector load is done from array, starting at offset ``2*x``.
|
||||
|
||||
|
||||
8 and 16-bit Integer Types
|
||||
--------------------------
|
||||
|
||||
The code generated for 8 and 16-bit integer types is generally not as
|
||||
efficient as the code generated for 32-bit integer types. It is generally
|
||||
worthwhile to use 32-bit integer types for intermediate computations, even
|
||||
if the final result will be stored in a smaller integer type.
|
||||
|
||||
Low-level Vector Tricks
|
||||
-----------------------
|
||||
|
||||
Many low-level Intel® SSE coding constructs can be implemented in ``ispc``
|
||||
code. For example, the following code efficiently reverses the sign of the
|
||||
given values.
|
||||
|
||||
::
|
||||
|
||||
float flipsign(float a) {
|
||||
unsigned int i = intbits(a);
|
||||
i ^= 0x80000000;
|
||||
return floatbits(i);
|
||||
}
|
||||
|
||||
This code compiles down to a single XOR instruction.
|
||||
|
||||
The "Fast math" Option
|
||||
----------------------
|
||||
|
||||
``ispc`` has a ``--fast-math`` command-line flag that enables a number of
|
||||
optimizations that may be undesirable in code where numerical preceision is
|
||||
critically important. For many graphics applications, the
|
||||
approximations may be acceptable. The following two optimizations are
|
||||
performed when ``--fast-math`` is used. By default, the ``--fast-math``
|
||||
flag is off.
|
||||
|
||||
* Expressions like ``x / y``, where ``y`` is a compile-time constant, are
|
||||
transformed to ``x * (1./y)``, where the inverse value of ``y`` is
|
||||
precomputed at compile time.
|
||||
|
||||
* Expressions like ``x / y``, where ``y`` is not a compile-time constant,
|
||||
are transformed to ``x * rcp(y)``, where ``rcp()`` maps to the
|
||||
approximate reciprocal instruction from the standard library.
|
||||
|
||||
|
||||
"Inline" Aggressively
|
||||
---------------------
|
||||
|
||||
Inlining functions aggressively is generally beneficial for performance
|
||||
with ``ispc``. Definitely use the ``inline`` qualifier for any short
|
||||
functions (a few lines long), and experiment with it for longer functions.
|
||||
|
||||
Small Performance Tricks
|
||||
------------------------
|
||||
|
||||
Performance is slightly improved by declaring variables at the same block
|
||||
scope where they are first used. For example, in code like the
|
||||
following, if the lifetime of ``foo`` is only within the scope of the
|
||||
``if`` clause, write the code like this:
|
||||
|
||||
::
|
||||
|
||||
float func() {
|
||||
....
|
||||
if (x < y) {
|
||||
float foo;
|
||||
... use foo ...
|
||||
}
|
||||
}
|
||||
|
||||
Try not to write code as:
|
||||
|
||||
::
|
||||
|
||||
float func() {
|
||||
float foo;
|
||||
....
|
||||
if (x < y) {
|
||||
... use foo ...
|
||||
}
|
||||
}
|
||||
|
||||
Doing so can reduce the amount of masked store instructions that the
|
||||
compiler needs to generate.
|
||||
|
||||
Instrumenting Your ISPC Programs
|
||||
--------------------------------
|
||||
|
||||
``ispc`` has an optional instrumentation feature that can help you
|
||||
understand performance issues. If a program is compiled using the
|
||||
``--instrument`` flag, the compiler emits calls to a function with the
|
||||
following signature at various points in the program (for
|
||||
example, at interesting points in the control flow, when scatters or
|
||||
gathers happen.)
|
||||
|
||||
::
|
||||
|
||||
extern "C" {
|
||||
void ISPCInstrument(const char *fn, const char *note,
|
||||
int line, int mask);
|
||||
}
|
||||
|
||||
This function is passed the file name of the ``ispc`` file running, a short
|
||||
note indicating what is happening, the line number in the source file, and
|
||||
the current mask of active SPMD program lanes. You must provide an
|
||||
implementation of this function and link it in with your application.
|
||||
|
||||
For example, when the ``ispc`` program runs, this function might be called
|
||||
as follows:
|
||||
|
||||
::
|
||||
|
||||
ISPCInstrument("foo.ispc", "function entry", 55, 0xf);
|
||||
|
||||
This call indicates that at the currently executing program has just
|
||||
entered the function defined at line 55 of the file ``foo.ispc``, with a
|
||||
mask of all lanes currently executing (assuming a four-wide Intel® SSE
|
||||
target machine).
|
||||
|
||||
For a fuller example of the utility of this functionality, see
|
||||
``examples/aobench_instrumented`` in the ``ispc`` distribution. Ths
|
||||
example includes an implementation of the ``ISPCInstrument`` function that
|
||||
collects aggregate data about the program's execution behavior.
|
||||
|
||||
When running this example, you will want to direct to the ``ao`` executable
|
||||
to generate a low resolution image, because the instrumentation adds
|
||||
substantial execution overhead. For example:
|
||||
|
||||
::
|
||||
|
||||
% ./ao 1 32 32
|
||||
|
||||
After the ``ao`` program exits, a summary report along the following lines
|
||||
will be printed. In the first few lines, you can see how many times a few
|
||||
functions were called, and the average percentage of SIMD lanes that were
|
||||
active upon function entry.
|
||||
|
||||
::
|
||||
|
||||
ao.ispc(0067) - function entry: 342424 calls (0 / 0.00% all off!), 95.86% active lanes
|
||||
ao.ispc(0067) - return: uniform control flow: 342424 calls (0 / 0.00% all off!), 95.86% active lanes
|
||||
ao.ispc(0071) - function entry: 1122 calls (0 / 0.00% all off!), 97.33% active lanes
|
||||
ao.ispc(0075) - return: uniform control flow: 1122 calls (0 / 0.00% all off!), 97.33% active lanes
|
||||
ao.ispc(0079) - function entry: 10072 calls (0 / 0.00% all off!), 45.09% active lanes
|
||||
ao.ispc(0088) - function entry: 36928 calls (0 / 0.00% all off!), 97.40% active lanes
|
||||
...
|
||||
|
||||
|
||||
Choosing A Target Vector Width
|
||||
------------------------------
|
||||
|
||||
By default, ``ispc`` compiles to the natural vector width of the target
|
||||
instruction set. For example, for SSE2 and SSE4, it compiles four-wide,
|
||||
and for AVX, it complies 8-wide. For some programs, higher performance may
|
||||
be seen if the program is compiled to a doubled vector width--8-wide for
|
||||
SSE and 16-wide for AVX.
|
||||
|
||||
For workloads that don't require many of registers, this method can lead to
|
||||
significantly more efficient execution thanks to greater instruction level
|
||||
parallelism and amortization of various overhead over more program
|
||||
instances. For other workloads, it may lead to a slowdown due to higher
|
||||
register pressure; trying both approaches for key kernels may be
|
||||
worthwhile.
|
||||
|
||||
This option is only available for each of the SSE2, SSE4 and AVX targets.
|
||||
It is selected with the ``--target=sse2-x2``, ``--target=sse4-x2`` and
|
||||
``--target=avx-x2`` options, respectively.
|
||||
|
||||
|
||||
Implementing Reductions Efficiently
|
||||
-----------------------------------
|
||||
|
||||
It's often necessary to compute a "reduction" over a data set--for example,
|
||||
one might want to add all of the values in an array, compute their minimum,
|
||||
etc. ``ispc`` provides a few capabilities that make it easy to efficiently
|
||||
compute reductions like these. However, it's important to use these
|
||||
capabilities appropriately for best results.
|
||||
|
||||
As an example, consider the task of computing the sum of all of the values
|
||||
in an array. In C code, we might have:
|
||||
|
||||
::
|
||||
|
||||
/* C implementation of a sum reduction */
|
||||
float sum(const float array[], int count) {
|
||||
float sum = 0;
|
||||
for (int i = 0; i < count; ++i)
|
||||
sum += array[i];
|
||||
return sum;
|
||||
}
|
||||
|
||||
Of course, exactly this computation could also be expressed in ``ispc``,
|
||||
though without any benefit from vectorization:
|
||||
|
||||
::
|
||||
|
||||
/* inefficient ispc implementation of a sum reduction */
|
||||
uniform float sum(const uniform float array[], uniform int count) {
|
||||
uniform float sum = 0;
|
||||
for (uniform int i = 0; i < count; ++i)
|
||||
sum += array[i];
|
||||
return sum;
|
||||
}
|
||||
|
||||
As a first try, one might try using the ``reduce_add()`` function from the
|
||||
``ispc`` standard library; it takes a ``varying`` value and returns the sum
|
||||
of that value across all of the active program instances.
|
||||
|
||||
::
|
||||
|
||||
/* inefficient ispc implementation of a sum reduction */
|
||||
uniform float sum(const uniform float array[], uniform int count) {
|
||||
uniform float sum = 0;
|
||||
// Assumes programCount evenly divides count
|
||||
for (uniform int i = 0; i < count; i += programCount)
|
||||
sum += reduce_add(array[i+programIndex]);
|
||||
return sum;
|
||||
}
|
||||
|
||||
This implementation loads a set of ``programCount`` values from the array,
|
||||
one for each of the program instances, and then uses ``reduce_add`` to
|
||||
reduce across the program instances and then update the sum. Unfortunately
|
||||
this approach loses most benefit from vectorization, as it does more work
|
||||
on the cross-program instance ``reduce_add()`` call than it saves from the
|
||||
vector load of values.
|
||||
|
||||
The most efficient approach is to do the reduction in two phases: rather
|
||||
than using a ``uniform`` variable to store the sum, we maintain a varying
|
||||
value, such that each program instance is effectively computing a local
|
||||
partial sum on the subset of array values that it has loaded from the
|
||||
array. When the loop over array elements concludes, a single call to
|
||||
``reduce_add()`` computes the final reduction across each of the program
|
||||
instances' elements of ``sum``. This approach effectively compiles to a
|
||||
single vector load and a single vector add for each ``programCount`` worth
|
||||
of values--very efficient code in the end.
|
||||
|
||||
::
|
||||
|
||||
/* good ispc implementation of a sum reduction */
|
||||
uniform float sum(const uniform float array[], uniform int count) {
|
||||
float sum = 0;
|
||||
// Assumes programCount evenly divides count
|
||||
for (uniform int i = 0; i < count; i += programCount)
|
||||
sum += array[i+programIndex];
|
||||
return reduce_add(sum);
|
||||
}
|
||||
|
||||
|
||||
Disclaimer and Legal Information
|
||||
================================
|
||||
|
||||
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS.
|
||||
NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
|
||||
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS
|
||||
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,
|
||||
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE
|
||||
OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
|
||||
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT
|
||||
OR OTHER INTELLECTUAL PROPERTY RIGHT.
|
||||
|
||||
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED
|
||||
NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD
|
||||
CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
|
||||
|
||||
Intel may make changes to specifications and product descriptions at any time,
|
||||
without notice. Designers must not rely on the absence or characteristics of any
|
||||
features or instructions marked "reserved" or "undefined." Intel reserves these
|
||||
for future definition and shall have no responsibility whatsoever for conflicts
|
||||
or incompatibilities arising from future changes to them. The information here
|
||||
is subject to change without notice. Do not finalize a design with this
|
||||
information.
|
||||
|
||||
The products described in this document may contain design defects or errors
|
||||
known as errata which may cause the product to deviate from published
|
||||
specifications. Current characterized errata are available on request.
|
||||
|
||||
Contact your local Intel sales office or your distributor to obtain the latest
|
||||
specifications and before placing your product order.
|
||||
|
||||
Copies of documents which have an order number and are referenced in this
|
||||
document, or other Intel literature, may be obtained by calling 1-800-548-4725,
|
||||
or by visiting Intel's Web Site.
|
||||
|
||||
Intel processor numbers are not a measure of performance. Processor numbers
|
||||
differentiate features within each processor family, not across different
|
||||
processor families. See http://www.intel.com/products/processor_number for
|
||||
details.
|
||||
|
||||
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom,
|
||||
Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile,
|
||||
i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4,
|
||||
IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside,
|
||||
Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst,
|
||||
Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep,
|
||||
Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium,
|
||||
Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside,
|
||||
skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon,
|
||||
and Xeon Inside are trademarks of Intel Corporation in the U.S. and other
|
||||
countries.
|
||||
|
||||
* Other names and brands may be claimed as the property of others.
|
||||
|
||||
Copyright(C) 2011, Intel Corporation. All rights reserved.
|
||||
|
||||
|
||||
Optimization Notice
|
||||
===================
|
||||
|
||||
Intel compilers, associated libraries and associated development tools may
|
||||
include or utilize options that optimize for instruction sets that are
|
||||
available in both Intel and non-Intel microprocessors (for example SIMD
|
||||
instruction sets), but do not optimize equally for non-Intel
|
||||
microprocessors. In addition, certain compiler options for Intel
|
||||
compilers, including some that are not specific to Intel
|
||||
micro-architecture, are reserved for Intel microprocessors. For a detailed
|
||||
description of Intel compiler options, including the instruction sets and
|
||||
specific microprocessors they implicate, please refer to the "Intel
|
||||
Compiler User and Reference Guides" under "Compiler Options." Many library
|
||||
routines that are part of Intel compiler products are more highly optimized
|
||||
for Intel microprocessors than for other microprocessors. While the
|
||||
compilers and libraries in Intel compiler products offer optimizations for
|
||||
both Intel and Intel-compatible microprocessors, depending on the options
|
||||
you select, your code and other factors, you likely will get extra
|
||||
performance on Intel microprocessors.
|
||||
|
||||
Intel compilers, associated libraries and associated development tools may
|
||||
or may not optimize to the same degree for non-Intel microprocessors for
|
||||
optimizations that are not unique to Intel microprocessors. These
|
||||
optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2),
|
||||
Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental
|
||||
Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other
|
||||
optimizations. Intel does not guarantee the availability, functionality,
|
||||
or effectiveness of any optimization on microprocessors not manufactured by
|
||||
Intel. Microprocessor-dependent optimizations in this product are intended
|
||||
for use with Intel microprocessors.
|
||||
|
||||
While Intel believes our compilers and libraries are excellent choices to
|
||||
assist in obtaining the best performance on Intel and non-Intel
|
||||
microprocessors, Intel recommends that you evaluate other compilers and
|
||||
libraries to determine which best meet your requirements. We hope to win
|
||||
your business by striving to offer the best performance of any compiler or
|
||||
library; please let us know if you find we do not.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user