Documentation refactoring, initial pass at FAQ
This commit is contained in:
@@ -1,6 +1,8 @@
|
|||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
|
|
||||||
rst2html.py ispc.txt > ispc.html
|
rst2html.py ispc.txt > ispc.html
|
||||||
|
rst2html.py perf.txt > perf.html
|
||||||
|
rst2html.py faq.txt > faq.html
|
||||||
|
|
||||||
#rst2latex --section-numbering --documentclass=article --documentoptions=DIV=9,10pt,letterpaper ispc.txt > ispc.tex
|
#rst2latex --section-numbering --documentclass=article --documentoptions=DIV=9,10pt,letterpaper ispc.txt > ispc.tex
|
||||||
#pdflatex ispc.tex
|
#pdflatex ispc.tex
|
||||||
|
|||||||
4
docs/faq.txt
Normal file
4
docs/faq.txt
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
=============================================================
|
||||||
|
Intel® SPMD Program Compiler Frequently Asked Questions (FAQ)
|
||||||
|
=============================================================
|
||||||
|
|
||||||
536
docs/ispc.txt
536
docs/ispc.txt
@@ -58,6 +58,7 @@ Contents:
|
|||||||
+ `Basic Command-line Options`_
|
+ `Basic Command-line Options`_
|
||||||
+ `Selecting The Compilation Target`_
|
+ `Selecting The Compilation Target`_
|
||||||
+ `The Preprocessor`_
|
+ `The Preprocessor`_
|
||||||
|
+ `Debugging`_
|
||||||
|
|
||||||
* `The ISPC Language`_
|
* `The ISPC Language`_
|
||||||
|
|
||||||
@@ -106,26 +107,8 @@ Contents:
|
|||||||
+ `Interoperability Overview`_
|
+ `Interoperability Overview`_
|
||||||
+ `Data Layout`_
|
+ `Data Layout`_
|
||||||
+ `Data Alignment and Aliasing`_
|
+ `Data Alignment and Aliasing`_
|
||||||
|
|
||||||
* `Using ISPC Effectively`_
|
|
||||||
|
|
||||||
+ `Restructuring Existing Programs to Use ISPC`_
|
+ `Restructuring Existing Programs to Use ISPC`_
|
||||||
+ `Understanding How to Interoperate With the Application's Data`_
|
+ `Understanding How to Interoperate With the Application's Data`_
|
||||||
+ `Communicating Between SPMD Program Instances`_
|
|
||||||
+ `Gather and Scatter`_
|
|
||||||
+ `8 and 16-bit Integer Types`_
|
|
||||||
+ `Low-level Vector Tricks`_
|
|
||||||
+ `Debugging`_
|
|
||||||
+ `The "Fast math" Option`_
|
|
||||||
+ `"Inline" Aggressively`_
|
|
||||||
+ `Small Performance Tricks`_
|
|
||||||
+ `Instrumenting Your ISPC Programs`_
|
|
||||||
+ `Using Scan Operations For Variable Output`_
|
|
||||||
+ `Application-Supplied Execution Masks`_
|
|
||||||
+ `Explicit Vector Programming With Uniform Short Vector Types`_
|
|
||||||
+ `Choosing A Target Vector Width`_
|
|
||||||
+ `Compiling With Support For Multiple Instruction Sets`_
|
|
||||||
+ `Implementing Reductions Efficiently`_
|
|
||||||
|
|
||||||
* `Disclaimer and Legal Information`_
|
* `Disclaimer and Legal Information`_
|
||||||
|
|
||||||
@@ -397,6 +380,23 @@ indicating the target instruction set is defined. With an SSE2 target,
|
|||||||
and ``ISPC_TARGET_AVX`` for AVX. Finally, ``PI`` is defined for
|
and ``ISPC_TARGET_AVX`` for AVX. Finally, ``PI`` is defined for
|
||||||
convenience, having the value 3.1415926535.
|
convenience, having the value 3.1415926535.
|
||||||
|
|
||||||
|
ISPC_MAJOR_VERSION, ISPC_MINOR_VERSION
|
||||||
|
|
||||||
|
Debugging
|
||||||
|
---------
|
||||||
|
|
||||||
|
Support for debugging in ``ispc`` is in progress. On Linux\* and Mac
|
||||||
|
OS\*, the ``-g`` command-line flag can be supplied to the compiler,
|
||||||
|
which causes it to generate debugging symbols. Running ``ispc`` programs
|
||||||
|
in the debugger, setting breakpoints, printing out variables and the like
|
||||||
|
all generally works, though there is occasional unexpected behavior.
|
||||||
|
|
||||||
|
Another option for debugging (the only current option on Windows\*) is to
|
||||||
|
use the ``print`` statement for ``printf()`` style debugging. (See `Output
|
||||||
|
Functions`_ for more information.) You can also use the ability to call
|
||||||
|
back to application code at particular points in the program, passing a set
|
||||||
|
of variable values to be logged or otherwise analyzed from there.
|
||||||
|
|
||||||
|
|
||||||
The ISPC Language
|
The ISPC Language
|
||||||
=================
|
=================
|
||||||
@@ -2762,9 +2762,6 @@ to the compiler's requirement of no aliasing.
|
|||||||
(In the future, ``ispc`` will have a mechanism to indicate that pointers
|
(In the future, ``ispc`` will have a mechanism to indicate that pointers
|
||||||
may alias.)
|
may alias.)
|
||||||
|
|
||||||
Using ISPC Effectively
|
|
||||||
======================
|
|
||||||
|
|
||||||
Restructuring Existing Programs to Use ISPC
|
Restructuring Existing Programs to Use ISPC
|
||||||
-------------------------------------------
|
-------------------------------------------
|
||||||
|
|
||||||
@@ -2786,13 +2783,15 @@ style is often effective.
|
|||||||
|
|
||||||
Carefully choose how to do the exact mapping of computation to SPMD program
|
Carefully choose how to do the exact mapping of computation to SPMD program
|
||||||
instances. This choice can impact the mix of gather/scatter memory access
|
instances. This choice can impact the mix of gather/scatter memory access
|
||||||
versus coherent memory access, for example. (See more on this in the
|
versus coherent memory access, for example. (See more on this topic in the
|
||||||
section `Gather and Scatter`_ below.) This decision can also impact the
|
`ispc Performance Tuning Guide`_.) This decision can also impact the
|
||||||
coherence of control flow across the running SPMD program instances, which
|
coherence of control flow across the running SPMD program instances, which
|
||||||
can also have a significant effect on performance; in general, creating
|
can also have a significant effect on performance; in general, creating
|
||||||
groups of work that will tend to do similar computation across the SPMD
|
groups of work that will tend to do similar computation across the SPMD
|
||||||
program instances improves performance.
|
program instances improves performance.
|
||||||
|
|
||||||
|
.. _ispc Performance Tuning Guide: http://ispc.github.com/perf.html
|
||||||
|
|
||||||
Understanding How to Interoperate With the Application's Data
|
Understanding How to Interoperate With the Application's Data
|
||||||
-------------------------------------------------------------
|
-------------------------------------------------------------
|
||||||
|
|
||||||
@@ -2953,497 +2952,6 @@ elements to work with and then proceeds with the computation.
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
Communicating Between SPMD Program Instances
|
|
||||||
--------------------------------------------
|
|
||||||
|
|
||||||
The ``broadcast()``, ``rotate()``, and ``shuffle()`` standard library
|
|
||||||
routines provide a variety of mechanisms for the running program instances
|
|
||||||
to communicate values to each other during execution. See the section
|
|
||||||
`Cross-Program Instance Operations`_ for more information about their
|
|
||||||
operation.
|
|
||||||
|
|
||||||
|
|
||||||
Gather and Scatter
|
|
||||||
------------------
|
|
||||||
|
|
||||||
The CPU is a poor fit for SPMD execution in some ways, the worst of which
|
|
||||||
is handling of general memory reads and writes from SPMD program instances.
|
|
||||||
For example, in a "simple" array index:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
int i = ....;
|
|
||||||
uniform float x[10] = { ... };
|
|
||||||
float f = x[i];
|
|
||||||
|
|
||||||
Since the index ``i`` is a varying value, the various SPMD program
|
|
||||||
instances will in general be reading different locations in the array
|
|
||||||
``x``. Because the CPU doesn't have a gather instruction, the ``ispc``
|
|
||||||
compiler has to serialize these memory reads, performing a separate memory
|
|
||||||
load for each running program instance, packing the result into ``f``.
|
|
||||||
(And the analogous case would happen for a write into ``x[i]``.)
|
|
||||||
|
|
||||||
In many cases, gathers like these are unavoidable; the running program
|
|
||||||
instances just need to access incoherent memory locations. However, if the
|
|
||||||
array index ``i`` could actually be declared and used as a ``uniform``
|
|
||||||
variable, the resulting array index is substantially more
|
|
||||||
efficient. This is another case where using ``uniform`` whenever applicable
|
|
||||||
is of benefit.
|
|
||||||
|
|
||||||
In some cases, the ``ispc`` compiler is able to deduce that the memory
|
|
||||||
locations accessed are either all the same or are uniform. For example,
|
|
||||||
given:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
uniform int x = ...;
|
|
||||||
int y = x;
|
|
||||||
return array[y];
|
|
||||||
|
|
||||||
The compiler is able to determine that all of the program instances are
|
|
||||||
loading from the same location, even though ``y`` is not a ``uniform``
|
|
||||||
variable. In this case, the compiler will transform this load to a regular vector
|
|
||||||
load, rather than a general gather.
|
|
||||||
|
|
||||||
Sometimes the running program instances will access a
|
|
||||||
linear sequence of memory locations; this happens most frequently when
|
|
||||||
array indexing is done based on the built-in ``programIndex`` variable. In
|
|
||||||
many of these cases, the compiler is also able to detect this case and then
|
|
||||||
do a vector load. For example, given:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
uniform int x = ...;
|
|
||||||
return array[2*x + programIndex];
|
|
||||||
|
|
||||||
A regular vector load is done from array, starting at offset ``2*x``.
|
|
||||||
|
|
||||||
|
|
||||||
8 and 16-bit Integer Types
|
|
||||||
--------------------------
|
|
||||||
|
|
||||||
The code generated for 8 and 16-bit integer types is generally not as
|
|
||||||
efficient as the code generated for 32-bit integer types. It is generally
|
|
||||||
worthwhile to use 32-bit integer types for intermediate computations, even
|
|
||||||
if the final result will be stored in a smaller integer type.
|
|
||||||
|
|
||||||
Low-level Vector Tricks
|
|
||||||
-----------------------
|
|
||||||
|
|
||||||
Many low-level Intel® SSE coding constructs can be implemented in ``ispc``
|
|
||||||
code. For example, the following code efficiently reverses the sign of the
|
|
||||||
given values.
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
float flipsign(float a) {
|
|
||||||
unsigned int i = intbits(a);
|
|
||||||
i ^= 0x80000000;
|
|
||||||
return floatbits(i);
|
|
||||||
}
|
|
||||||
|
|
||||||
This code compiles down to a single XOR instruction.
|
|
||||||
|
|
||||||
Debugging
|
|
||||||
---------
|
|
||||||
|
|
||||||
Support for debugging in ``ispc`` is in progress. On Linux\* and Mac
|
|
||||||
OS\*, the ``-g`` command-line flag can be supplied to the compiler,
|
|
||||||
which causes it to generate debugging symbols. Running ``ispc`` programs
|
|
||||||
in the debugger, setting breakpoints, printing out variables and the like
|
|
||||||
all generally works, though there is occasional unexpected behavior.
|
|
||||||
|
|
||||||
Another option for debugging (the only current option on Windows\*) is
|
|
||||||
to use the ``print`` statement for ``printf()``
|
|
||||||
style debugging. You can also use the ability to call back to
|
|
||||||
application code at particular points in the program, passing a set of
|
|
||||||
variable values to be logged or otherwise analyzed from there.
|
|
||||||
|
|
||||||
The "Fast math" Option
|
|
||||||
----------------------
|
|
||||||
|
|
||||||
``ispc`` has a ``--fast-math`` command-line flag that enables a number of
|
|
||||||
optimizations that may be undesirable in code where numerical preceision is
|
|
||||||
critically important. For many graphics applications, the
|
|
||||||
approximations may be acceptable. The following two optimizations are
|
|
||||||
performed when ``--fast-math`` is used. By default, the ``--fast-math``
|
|
||||||
flag is off.
|
|
||||||
|
|
||||||
* Expressions like ``x / y``, where ``y`` is a compile-time constant, are
|
|
||||||
transformed to ``x * (1./y)``, where the inverse value of ``y`` is
|
|
||||||
precomputed at compile time.
|
|
||||||
|
|
||||||
* Expressions like ``x / y``, where ``y`` is not a compile-time constant,
|
|
||||||
are transformed to ``x * rcp(y)``, where ``rcp()`` maps to the
|
|
||||||
approximate reciprocal instruction from the standard library.
|
|
||||||
|
|
||||||
|
|
||||||
"Inline" Aggressively
|
|
||||||
---------------------
|
|
||||||
|
|
||||||
Inlining functions aggressively is generally beneficial for performance
|
|
||||||
with ``ispc``. Definitely use the ``inline`` qualifier for any short
|
|
||||||
functions (a few lines long), and experiment with it for longer functions.
|
|
||||||
|
|
||||||
Small Performance Tricks
|
|
||||||
------------------------
|
|
||||||
|
|
||||||
Performance is slightly improved by declaring variables at the same block
|
|
||||||
scope where they are first used. For example, in code like the
|
|
||||||
following, if the lifetime of ``foo`` is only within the scope of the
|
|
||||||
``if`` clause, write the code like this:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
float func() {
|
|
||||||
....
|
|
||||||
if (x < y) {
|
|
||||||
float foo;
|
|
||||||
... use foo ...
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
Try not to write code as:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
float func() {
|
|
||||||
float foo;
|
|
||||||
....
|
|
||||||
if (x < y) {
|
|
||||||
... use foo ...
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
Doing so can reduce the amount of masked store instructions that the
|
|
||||||
compiler needs to generate.
|
|
||||||
|
|
||||||
Instrumenting Your ISPC Programs
|
|
||||||
--------------------------------
|
|
||||||
|
|
||||||
``ispc`` has an optional instrumentation feature that can help you
|
|
||||||
understand performance issues. If a program is compiled using the
|
|
||||||
``--instrument`` flag, the compiler emits calls to a function with the
|
|
||||||
following signature at various points in the program (for
|
|
||||||
example, at interesting points in the control flow, when scatters or
|
|
||||||
gathers happen.)
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
extern "C" {
|
|
||||||
void ISPCInstrument(const char *fn, const char *note,
|
|
||||||
int line, int mask);
|
|
||||||
}
|
|
||||||
|
|
||||||
This function is passed the file name of the ``ispc`` file running, a short
|
|
||||||
note indicating what is happening, the line number in the source file, and
|
|
||||||
the current mask of active SPMD program lanes. You must provide an
|
|
||||||
implementation of this function and link it in with your application.
|
|
||||||
|
|
||||||
For example, when the ``ispc`` program runs, this function might be called
|
|
||||||
as follows:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
ISPCInstrument("foo.ispc", "function entry", 55, 0xf);
|
|
||||||
|
|
||||||
This call indicates that at the currently executing program has just
|
|
||||||
entered the function defined at line 55 of the file ``foo.ispc``, with a
|
|
||||||
mask of all lanes currently executing (assuming a four-wide Intel® SSE
|
|
||||||
target machine).
|
|
||||||
|
|
||||||
For a fuller example of the utility of this functionality, see
|
|
||||||
``examples/aobench_instrumented`` in the ``ispc`` distribution. Ths
|
|
||||||
example includes an implementation of the ``ISPCInstrument`` function that
|
|
||||||
collects aggregate data about the program's execution behavior.
|
|
||||||
|
|
||||||
When running this example, you will want to direct to the ``ao`` executable
|
|
||||||
to generate a low resolution image, because the instrumentation adds
|
|
||||||
substantial execution overhead. For example:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
% ./ao 1 32 32
|
|
||||||
|
|
||||||
After the ``ao`` program exits, a summary report along the following lines
|
|
||||||
will be printed. In the first few lines, you can see how many times a few
|
|
||||||
functions were called, and the average percentage of SIMD lanes that were
|
|
||||||
active upon function entry.
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
ao.ispc(0067) - function entry: 342424 calls (0 / 0.00% all off!), 95.86% active lanes
|
|
||||||
ao.ispc(0067) - return: uniform control flow: 342424 calls (0 / 0.00% all off!), 95.86% active lanes
|
|
||||||
ao.ispc(0071) - function entry: 1122 calls (0 / 0.00% all off!), 97.33% active lanes
|
|
||||||
ao.ispc(0075) - return: uniform control flow: 1122 calls (0 / 0.00% all off!), 97.33% active lanes
|
|
||||||
ao.ispc(0079) - function entry: 10072 calls (0 / 0.00% all off!), 45.09% active lanes
|
|
||||||
ao.ispc(0088) - function entry: 36928 calls (0 / 0.00% all off!), 97.40% active lanes
|
|
||||||
...
|
|
||||||
|
|
||||||
|
|
||||||
Using Scan Operations For Variable Output
|
|
||||||
-----------------------------------------
|
|
||||||
|
|
||||||
One important application of the ``exclusive_scan_add()`` function in the
|
|
||||||
standard library is when program instances want to generate a variable amount
|
|
||||||
of output and when one would like that output to be densely packed in a
|
|
||||||
single array. For example, consider the code fragment below:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
uniform int func(uniform float outArray[], ...) {
|
|
||||||
int numOut = ...; // figure out how many to be output
|
|
||||||
float outLocal[MAX_OUT]; // staging area
|
|
||||||
// put results in outLocal[0], ..., outLocal[numOut-1]
|
|
||||||
int startOffset = exclusive_scan_add(numOut);
|
|
||||||
for (int i = 0; i < numOut; ++i)
|
|
||||||
outArray[startOffset + i] = outLocal[i];
|
|
||||||
return reduce_add(numOut);
|
|
||||||
}
|
|
||||||
|
|
||||||
Here, each program instance has computed a number, ``numOut``, of values to
|
|
||||||
output, and has stored them in the ``outLocal`` array. Assume that four
|
|
||||||
program instances are running and that the first one wants to output one
|
|
||||||
value, the second two values, and the third and fourth three values each.
|
|
||||||
In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6)
|
|
||||||
to the four program instances, respectively. The first program instance
|
|
||||||
will write its one result to ``outArray[0]``, the second will write its two
|
|
||||||
values to ``outArray[1]`` and ``outArray[2]``, and so forth. The
|
|
||||||
``reduce_add`` call at the end returns the total number of values that the
|
|
||||||
program instances have written to the array.
|
|
||||||
|
|
||||||
Application-Supplied Execution Masks
|
|
||||||
------------------------------------
|
|
||||||
|
|
||||||
Recall that when execution transitions from the application code to an
|
|
||||||
``ispc`` function, all of the program instances are initially executing.
|
|
||||||
In some cases, it may desired that only some of them are running, based on
|
|
||||||
a data-dependent condition computed in the application program. This
|
|
||||||
situation can easily be handled via an additional parameter from the
|
|
||||||
application.
|
|
||||||
|
|
||||||
As a simple example, consider a case where the application code has an
|
|
||||||
array of ``float`` values and we'd like the ``ispc`` code to update
|
|
||||||
just specific values in that array, where which of those values to be
|
|
||||||
updated has been determined by the application. In C++ code, we might
|
|
||||||
have:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
int count = ...;
|
|
||||||
float *array = new float[count];
|
|
||||||
bool *shouldUpdate = new bool[count];
|
|
||||||
// initialize array and shouldUpdate
|
|
||||||
ispc_func(array, shouldUpdate, count);
|
|
||||||
|
|
||||||
Then, the ``ispc`` code could process this update as:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
export void ispc_func(uniform float array[], uniform bool update[],
|
|
||||||
uniform int count) {
|
|
||||||
for (uniform int i = 0; i < count; i += programCount) {
|
|
||||||
cif (update[i+programIndex] == true)
|
|
||||||
// update array[i+programIndex]...
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
(In this case a "coherent" if statement is likely to be worthwhile if the
|
|
||||||
``update`` array will tend to have sections that are either all-true or
|
|
||||||
all-false.)
|
|
||||||
|
|
||||||
Explicit Vector Programming With Uniform Short Vector Types
|
|
||||||
-----------------------------------------------------------
|
|
||||||
|
|
||||||
The typical model for programming in ``ispc`` is an *implicit* parallel
|
|
||||||
model, where one writes a program that is apparently doing scalar
|
|
||||||
computation on values and the program is then vectorized to run in parallel
|
|
||||||
across the SIMD lanes of a processor. However, ``ispc`` also has some
|
|
||||||
support for explicit vector unit programming, where the vectorization is
|
|
||||||
explicit. Some computations may be more effectively described in the
|
|
||||||
explicit model rather than the implicit model.
|
|
||||||
|
|
||||||
This support is provided via ``uniform`` instances of short vectors
|
|
||||||
(as were introduced in the `Short Vector Types`_ section). Specifically,
|
|
||||||
if this short program
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
export uniform float<8> madd(uniform float<8> a,
|
|
||||||
uniform float<8> b, uniform float<8> c) {
|
|
||||||
return a + b * c;
|
|
||||||
}
|
|
||||||
|
|
||||||
is compiled with the AVX target, ``ispc`` generates the following assembly:
|
|
||||||
|
|
||||||
::
|
|
||||||
_madd:
|
|
||||||
vmulps %ymm2, %ymm1, %ymm1
|
|
||||||
vaddps %ymm0, %ymm1, %ymm0
|
|
||||||
ret
|
|
||||||
|
|
||||||
(And similarly, if compiled with a 4-wide SSE target, two ``mulps`` and two
|
|
||||||
``addps`` instructions are generated, and so forth.)
|
|
||||||
|
|
||||||
Note that ``ispc`` doesn't currently support control-flow based on
|
|
||||||
``uniform`` short vector types; it is thus not possible to write code like:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
export uniform int<8> count(uniform float<8> a, uniform float<8> b) {
|
|
||||||
uniform int<8> sum = 0;
|
|
||||||
while (a++ < b)
|
|
||||||
++sum;
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
Choosing A Target Vector Width
|
|
||||||
------------------------------
|
|
||||||
|
|
||||||
By default, ``ispc`` compiles to the natural vector width of the target
|
|
||||||
instruction set. For example, for SSE2 and SSE4, it compiles four-wide,
|
|
||||||
and for AVX, it complies 8-wide. For some programs, higher performance may
|
|
||||||
be seen if the program is compiled to a doubled vector width--8-wide for
|
|
||||||
SSE and 16-wide for AVX.
|
|
||||||
|
|
||||||
For workloads that don't require many of registers, this method can lead to
|
|
||||||
significantly more efficient execution thanks to greater instruction level
|
|
||||||
parallelism and amortization of various overhead over more program
|
|
||||||
instances. For other workloads, it may lead to a slowdown due to higher
|
|
||||||
register pressure; trying both approaches for key kernels may be
|
|
||||||
worthwhile.
|
|
||||||
|
|
||||||
This option is only available for each of the SSE2, SSE4 and AVX targets.
|
|
||||||
It is selected with the ``--target=sse2-x2``, ``--target=sse4-x2`` and
|
|
||||||
``--target=avx-x2`` options, respectively.
|
|
||||||
|
|
||||||
|
|
||||||
Compiling With Support For Multiple Instruction Sets
|
|
||||||
----------------------------------------------------
|
|
||||||
|
|
||||||
``ispc`` can also generate output that supports multiple target instruction
|
|
||||||
sets, choosing the most appropriate one at runtime. For example, if you
|
|
||||||
run the command:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
ispc foo.ispc -o foo.o --target=sse2,sse4-x2,avx-x2
|
|
||||||
|
|
||||||
Then four object files will be generated: ``foo_sse2.o``, ``foo_sse4.o``,
|
|
||||||
``foo_avx.o``, and ``foo.o``.[#]_ Link all of these into your executable, and
|
|
||||||
when you call a function in ``foo.ispc`` from your application code,
|
|
||||||
``ispc`` will determine which instruction sets are supported by the CPU the
|
|
||||||
code is running on and will call the most appropraite version of the
|
|
||||||
function available.
|
|
||||||
|
|
||||||
.. [#] Similarly, if you choose to generate assembly langauage output or
|
|
||||||
LLVM bitcode output, multiple versions of those files will be created.
|
|
||||||
|
|
||||||
In general, the version of the function that runs will be the one in the
|
|
||||||
most general instruction set that is supported by the system. If you only
|
|
||||||
compile SSE2 and SSE4 variants and run on a system that supports AVX, for
|
|
||||||
example, then the SSE4 variant will be executed. If the system doesn't
|
|
||||||
is not able to run any of the available variants of the function (for
|
|
||||||
example, trying to run a function that only has SSE4 and AVX variants on a
|
|
||||||
system that only supports SSE2), then the standard library ``abort()``
|
|
||||||
function will be called.
|
|
||||||
|
|
||||||
One subtlety is that all non-static global variables (if any) must have the
|
|
||||||
same size and layout with all of the targets used. For example, if you
|
|
||||||
have the global variables:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
uniform int foo[2*programCount];
|
|
||||||
int bar;
|
|
||||||
|
|
||||||
and compile to both SSE2 and AVX targets, both of these variables will have
|
|
||||||
different sizes (the first due to program count having the value 4 for SSE2
|
|
||||||
and 8 for AVX, and the second due to ``varying`` types having different
|
|
||||||
numbers of elements with the two targets--essentially the same issue as the
|
|
||||||
first.)
|
|
||||||
|
|
||||||
|
|
||||||
Implementing Reductions Efficiently
|
|
||||||
-----------------------------------
|
|
||||||
|
|
||||||
It's often necessary to compute a "reduction" over a data set--for example,
|
|
||||||
one might want to add all of the values in an array, compute their minimum,
|
|
||||||
etc. ``ispc`` provides a few capabilities that make it easy to efficiently
|
|
||||||
compute reductions like these. However, it's important to use these
|
|
||||||
capabilities appropriately for best results.
|
|
||||||
|
|
||||||
As an example, consider the task of computing the sum of all of the values
|
|
||||||
in an array. In C code, we might have:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
/* C implementation of a sum reduction */
|
|
||||||
float sum(const float array[], int count) {
|
|
||||||
float sum = 0;
|
|
||||||
for (int i = 0; i < count; ++i)
|
|
||||||
sum += array[i];
|
|
||||||
return sum;
|
|
||||||
}
|
|
||||||
|
|
||||||
Of course, exactly this computation could also be expressed in ``ispc``,
|
|
||||||
though without any benefit from vectorization:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
/* inefficient ispc implementation of a sum reduction */
|
|
||||||
uniform float sum(const uniform float array[], uniform int count) {
|
|
||||||
uniform float sum = 0;
|
|
||||||
for (uniform int i = 0; i < count; ++i)
|
|
||||||
sum += array[i];
|
|
||||||
return sum;
|
|
||||||
}
|
|
||||||
|
|
||||||
As a first try, one might try using the ``reduce_add()`` function from the
|
|
||||||
``ispc`` standard library; it takes a ``varying`` value and returns the sum
|
|
||||||
of that value across all of the active program instances (see
|
|
||||||
`Cross-Program Instance Operations`_ for more details).
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
/* inefficient ispc implementation of a sum reduction */
|
|
||||||
uniform float sum(const uniform float array[], uniform int count) {
|
|
||||||
uniform float sum = 0;
|
|
||||||
// Assumes programCount evenly divides count
|
|
||||||
for (uniform int i = 0; i < count; i += programCount)
|
|
||||||
sum += reduce_add(array[i+programIndex]);
|
|
||||||
return sum;
|
|
||||||
}
|
|
||||||
|
|
||||||
This implementation loads a set of ``programCount`` values from the array,
|
|
||||||
one for each of the program instances, and then uses ``reduce_add`` to
|
|
||||||
reduce across the program instances and then update the sum. Unfortunately
|
|
||||||
this approach loses most benefit from vectorization, as it does more work
|
|
||||||
on the cross-program instance ``reduce_add()`` call than it saves from the
|
|
||||||
vector load of values.
|
|
||||||
|
|
||||||
The most efficient approach is to do the reduction in two phases: rather
|
|
||||||
than using a ``uniform`` variable to store the sum, we maintain a varying
|
|
||||||
value, such that each program instance is effectively computing a local
|
|
||||||
partial sum on the subset of array values that it has loaded from the
|
|
||||||
array. When the loop over array elements concludes, a single call to
|
|
||||||
``reduce_add()`` computes the final reduction across each of the program
|
|
||||||
instances' elements of ``sum``. This approach effectively compiles to a
|
|
||||||
single vector load and a single vector add for each ``programCount`` worth
|
|
||||||
of values--very efficient code in the end.
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
/* good ispc implementation of a sum reduction */
|
|
||||||
uniform float sum(const uniform float array[], uniform int count) {
|
|
||||||
float sum = 0;
|
|
||||||
// Assumes programCount evenly divides count
|
|
||||||
for (uniform int i = 0; i < count; i += programCount)
|
|
||||||
sum += array[i+programIndex];
|
|
||||||
return reduce_add(sum);
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
Disclaimer and Legal Information
|
Disclaimer and Legal Information
|
||||||
================================
|
================================
|
||||||
|
|
||||||
|
|||||||
4
docs/perf.txt
Normal file
4
docs/perf.txt
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
==============================================
|
||||||
|
Intel® SPMD Program Compiler Performance Guide
|
||||||
|
==============================================
|
||||||
|
|
||||||
Reference in New Issue
Block a user