429 lines
17 KiB
Plaintext
429 lines
17 KiB
Plaintext
==============================================
|
|
Intel® SPMD Program Compiler Performance Guide
|
|
==============================================
|
|
|
|
|
|
* `Using ISPC Effectively`_
|
|
|
|
+ `Gather and Scatter`_
|
|
+ `8 and 16-bit Integer Types`_
|
|
+ `Low-level Vector Tricks`_
|
|
+ `The "Fast math" Option`_
|
|
+ `"Inline" Aggressively`_
|
|
+ `Small Performance Tricks`_
|
|
+ `Instrumenting Your ISPC Programs`_
|
|
+ `Choosing A Target Vector Width`_
|
|
+ `Implementing Reductions Efficiently`_
|
|
|
|
* `Disclaimer and Legal Information`_
|
|
|
|
* `Optimization Notice`_
|
|
|
|
|
|
don't use the system math library unless it's absolutely necessary
|
|
|
|
opt=32-bit-addressing
|
|
|
|
Using ISPC Effectively
|
|
======================
|
|
|
|
|
|
Gather and Scatter
|
|
------------------
|
|
|
|
The CPU is a poor fit for SPMD execution in some ways, the worst of which
|
|
is handling of general memory reads and writes from SPMD program instances.
|
|
For example, in a "simple" array index:
|
|
|
|
::
|
|
|
|
int i = ....;
|
|
uniform float x[10] = { ... };
|
|
float f = x[i];
|
|
|
|
Since the index ``i`` is a varying value, the various SPMD program
|
|
instances will in general be reading different locations in the array
|
|
``x``. Because the CPU doesn't have a gather instruction, the ``ispc``
|
|
compiler has to serialize these memory reads, performing a separate memory
|
|
load for each running program instance, packing the result into ``f``.
|
|
(And the analogous case would happen for a write into ``x[i]``.)
|
|
|
|
In many cases, gathers like these are unavoidable; the running program
|
|
instances just need to access incoherent memory locations. However, if the
|
|
array index ``i`` could actually be declared and used as a ``uniform``
|
|
variable, the resulting array index is substantially more
|
|
efficient. This is another case where using ``uniform`` whenever applicable
|
|
is of benefit.
|
|
|
|
In some cases, the ``ispc`` compiler is able to deduce that the memory
|
|
locations accessed are either all the same or are uniform. For example,
|
|
given:
|
|
|
|
::
|
|
|
|
uniform int x = ...;
|
|
int y = x;
|
|
return array[y];
|
|
|
|
The compiler is able to determine that all of the program instances are
|
|
loading from the same location, even though ``y`` is not a ``uniform``
|
|
variable. In this case, the compiler will transform this load to a regular vector
|
|
load, rather than a general gather.
|
|
|
|
Sometimes the running program instances will access a
|
|
linear sequence of memory locations; this happens most frequently when
|
|
array indexing is done based on the built-in ``programIndex`` variable. In
|
|
many of these cases, the compiler is also able to detect this case and then
|
|
do a vector load. For example, given:
|
|
|
|
::
|
|
|
|
uniform int x = ...;
|
|
return array[2*x + programIndex];
|
|
|
|
A regular vector load is done from array, starting at offset ``2*x``.
|
|
|
|
|
|
8 and 16-bit Integer Types
|
|
--------------------------
|
|
|
|
The code generated for 8 and 16-bit integer types is generally not as
|
|
efficient as the code generated for 32-bit integer types. It is generally
|
|
worthwhile to use 32-bit integer types for intermediate computations, even
|
|
if the final result will be stored in a smaller integer type.
|
|
|
|
Low-level Vector Tricks
|
|
-----------------------
|
|
|
|
Many low-level Intel® SSE coding constructs can be implemented in ``ispc``
|
|
code. For example, the following code efficiently reverses the sign of the
|
|
given values.
|
|
|
|
::
|
|
|
|
float flipsign(float a) {
|
|
unsigned int i = intbits(a);
|
|
i ^= 0x80000000;
|
|
return floatbits(i);
|
|
}
|
|
|
|
This code compiles down to a single XOR instruction.
|
|
|
|
The "Fast math" Option
|
|
----------------------
|
|
|
|
``ispc`` has a ``--fast-math`` command-line flag that enables a number of
|
|
optimizations that may be undesirable in code where numerical preceision is
|
|
critically important. For many graphics applications, the
|
|
approximations may be acceptable. The following two optimizations are
|
|
performed when ``--fast-math`` is used. By default, the ``--fast-math``
|
|
flag is off.
|
|
|
|
* Expressions like ``x / y``, where ``y`` is a compile-time constant, are
|
|
transformed to ``x * (1./y)``, where the inverse value of ``y`` is
|
|
precomputed at compile time.
|
|
|
|
* Expressions like ``x / y``, where ``y`` is not a compile-time constant,
|
|
are transformed to ``x * rcp(y)``, where ``rcp()`` maps to the
|
|
approximate reciprocal instruction from the standard library.
|
|
|
|
|
|
"Inline" Aggressively
|
|
---------------------
|
|
|
|
Inlining functions aggressively is generally beneficial for performance
|
|
with ``ispc``. Definitely use the ``inline`` qualifier for any short
|
|
functions (a few lines long), and experiment with it for longer functions.
|
|
|
|
Small Performance Tricks
|
|
------------------------
|
|
|
|
Performance is slightly improved by declaring variables at the same block
|
|
scope where they are first used. For example, in code like the
|
|
following, if the lifetime of ``foo`` is only within the scope of the
|
|
``if`` clause, write the code like this:
|
|
|
|
::
|
|
|
|
float func() {
|
|
....
|
|
if (x < y) {
|
|
float foo;
|
|
... use foo ...
|
|
}
|
|
}
|
|
|
|
Try not to write code as:
|
|
|
|
::
|
|
|
|
float func() {
|
|
float foo;
|
|
....
|
|
if (x < y) {
|
|
... use foo ...
|
|
}
|
|
}
|
|
|
|
Doing so can reduce the amount of masked store instructions that the
|
|
compiler needs to generate.
|
|
|
|
Instrumenting Your ISPC Programs
|
|
--------------------------------
|
|
|
|
``ispc`` has an optional instrumentation feature that can help you
|
|
understand performance issues. If a program is compiled using the
|
|
``--instrument`` flag, the compiler emits calls to a function with the
|
|
following signature at various points in the program (for
|
|
example, at interesting points in the control flow, when scatters or
|
|
gathers happen.)
|
|
|
|
::
|
|
|
|
extern "C" {
|
|
void ISPCInstrument(const char *fn, const char *note,
|
|
int line, int mask);
|
|
}
|
|
|
|
This function is passed the file name of the ``ispc`` file running, a short
|
|
note indicating what is happening, the line number in the source file, and
|
|
the current mask of active SPMD program lanes. You must provide an
|
|
implementation of this function and link it in with your application.
|
|
|
|
For example, when the ``ispc`` program runs, this function might be called
|
|
as follows:
|
|
|
|
::
|
|
|
|
ISPCInstrument("foo.ispc", "function entry", 55, 0xf);
|
|
|
|
This call indicates that at the currently executing program has just
|
|
entered the function defined at line 55 of the file ``foo.ispc``, with a
|
|
mask of all lanes currently executing (assuming a four-wide Intel® SSE
|
|
target machine).
|
|
|
|
For a fuller example of the utility of this functionality, see
|
|
``examples/aobench_instrumented`` in the ``ispc`` distribution. Ths
|
|
example includes an implementation of the ``ISPCInstrument`` function that
|
|
collects aggregate data about the program's execution behavior.
|
|
|
|
When running this example, you will want to direct to the ``ao`` executable
|
|
to generate a low resolution image, because the instrumentation adds
|
|
substantial execution overhead. For example:
|
|
|
|
::
|
|
|
|
% ./ao 1 32 32
|
|
|
|
After the ``ao`` program exits, a summary report along the following lines
|
|
will be printed. In the first few lines, you can see how many times a few
|
|
functions were called, and the average percentage of SIMD lanes that were
|
|
active upon function entry.
|
|
|
|
::
|
|
|
|
ao.ispc(0067) - function entry: 342424 calls (0 / 0.00% all off!), 95.86% active lanes
|
|
ao.ispc(0067) - return: uniform control flow: 342424 calls (0 / 0.00% all off!), 95.86% active lanes
|
|
ao.ispc(0071) - function entry: 1122 calls (0 / 0.00% all off!), 97.33% active lanes
|
|
ao.ispc(0075) - return: uniform control flow: 1122 calls (0 / 0.00% all off!), 97.33% active lanes
|
|
ao.ispc(0079) - function entry: 10072 calls (0 / 0.00% all off!), 45.09% active lanes
|
|
ao.ispc(0088) - function entry: 36928 calls (0 / 0.00% all off!), 97.40% active lanes
|
|
...
|
|
|
|
|
|
Choosing A Target Vector Width
|
|
------------------------------
|
|
|
|
By default, ``ispc`` compiles to the natural vector width of the target
|
|
instruction set. For example, for SSE2 and SSE4, it compiles four-wide,
|
|
and for AVX, it complies 8-wide. For some programs, higher performance may
|
|
be seen if the program is compiled to a doubled vector width--8-wide for
|
|
SSE and 16-wide for AVX.
|
|
|
|
For workloads that don't require many of registers, this method can lead to
|
|
significantly more efficient execution thanks to greater instruction level
|
|
parallelism and amortization of various overhead over more program
|
|
instances. For other workloads, it may lead to a slowdown due to higher
|
|
register pressure; trying both approaches for key kernels may be
|
|
worthwhile.
|
|
|
|
This option is only available for each of the SSE2, SSE4 and AVX targets.
|
|
It is selected with the ``--target=sse2-x2``, ``--target=sse4-x2`` and
|
|
``--target=avx-x2`` options, respectively.
|
|
|
|
|
|
Implementing Reductions Efficiently
|
|
-----------------------------------
|
|
|
|
It's often necessary to compute a "reduction" over a data set--for example,
|
|
one might want to add all of the values in an array, compute their minimum,
|
|
etc. ``ispc`` provides a few capabilities that make it easy to efficiently
|
|
compute reductions like these. However, it's important to use these
|
|
capabilities appropriately for best results.
|
|
|
|
As an example, consider the task of computing the sum of all of the values
|
|
in an array. In C code, we might have:
|
|
|
|
::
|
|
|
|
/* C implementation of a sum reduction */
|
|
float sum(const float array[], int count) {
|
|
float sum = 0;
|
|
for (int i = 0; i < count; ++i)
|
|
sum += array[i];
|
|
return sum;
|
|
}
|
|
|
|
Of course, exactly this computation could also be expressed in ``ispc``,
|
|
though without any benefit from vectorization:
|
|
|
|
::
|
|
|
|
/* inefficient ispc implementation of a sum reduction */
|
|
uniform float sum(const uniform float array[], uniform int count) {
|
|
uniform float sum = 0;
|
|
for (uniform int i = 0; i < count; ++i)
|
|
sum += array[i];
|
|
return sum;
|
|
}
|
|
|
|
As a first try, one might try using the ``reduce_add()`` function from the
|
|
``ispc`` standard library; it takes a ``varying`` value and returns the sum
|
|
of that value across all of the active program instances.
|
|
|
|
::
|
|
|
|
/* inefficient ispc implementation of a sum reduction */
|
|
uniform float sum(const uniform float array[], uniform int count) {
|
|
uniform float sum = 0;
|
|
// Assumes programCount evenly divides count
|
|
for (uniform int i = 0; i < count; i += programCount)
|
|
sum += reduce_add(array[i+programIndex]);
|
|
return sum;
|
|
}
|
|
|
|
This implementation loads a set of ``programCount`` values from the array,
|
|
one for each of the program instances, and then uses ``reduce_add`` to
|
|
reduce across the program instances and then update the sum. Unfortunately
|
|
this approach loses most benefit from vectorization, as it does more work
|
|
on the cross-program instance ``reduce_add()`` call than it saves from the
|
|
vector load of values.
|
|
|
|
The most efficient approach is to do the reduction in two phases: rather
|
|
than using a ``uniform`` variable to store the sum, we maintain a varying
|
|
value, such that each program instance is effectively computing a local
|
|
partial sum on the subset of array values that it has loaded from the
|
|
array. When the loop over array elements concludes, a single call to
|
|
``reduce_add()`` computes the final reduction across each of the program
|
|
instances' elements of ``sum``. This approach effectively compiles to a
|
|
single vector load and a single vector add for each ``programCount`` worth
|
|
of values--very efficient code in the end.
|
|
|
|
::
|
|
|
|
/* good ispc implementation of a sum reduction */
|
|
uniform float sum(const uniform float array[], uniform int count) {
|
|
float sum = 0;
|
|
// Assumes programCount evenly divides count
|
|
for (uniform int i = 0; i < count; i += programCount)
|
|
sum += array[i+programIndex];
|
|
return reduce_add(sum);
|
|
}
|
|
|
|
|
|
Disclaimer and Legal Information
|
|
================================
|
|
|
|
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS.
|
|
NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
|
|
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS
|
|
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,
|
|
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE
|
|
OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
|
|
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT
|
|
OR OTHER INTELLECTUAL PROPERTY RIGHT.
|
|
|
|
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED
|
|
NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD
|
|
CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
|
|
|
|
Intel may make changes to specifications and product descriptions at any time,
|
|
without notice. Designers must not rely on the absence or characteristics of any
|
|
features or instructions marked "reserved" or "undefined." Intel reserves these
|
|
for future definition and shall have no responsibility whatsoever for conflicts
|
|
or incompatibilities arising from future changes to them. The information here
|
|
is subject to change without notice. Do not finalize a design with this
|
|
information.
|
|
|
|
The products described in this document may contain design defects or errors
|
|
known as errata which may cause the product to deviate from published
|
|
specifications. Current characterized errata are available on request.
|
|
|
|
Contact your local Intel sales office or your distributor to obtain the latest
|
|
specifications and before placing your product order.
|
|
|
|
Copies of documents which have an order number and are referenced in this
|
|
document, or other Intel literature, may be obtained by calling 1-800-548-4725,
|
|
or by visiting Intel's Web Site.
|
|
|
|
Intel processor numbers are not a measure of performance. Processor numbers
|
|
differentiate features within each processor family, not across different
|
|
processor families. See http://www.intel.com/products/processor_number for
|
|
details.
|
|
|
|
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom,
|
|
Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile,
|
|
i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4,
|
|
IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside,
|
|
Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst,
|
|
Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep,
|
|
Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium,
|
|
Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside,
|
|
skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon,
|
|
and Xeon Inside are trademarks of Intel Corporation in the U.S. and other
|
|
countries.
|
|
|
|
* Other names and brands may be claimed as the property of others.
|
|
|
|
Copyright(C) 2011, Intel Corporation. All rights reserved.
|
|
|
|
|
|
Optimization Notice
|
|
===================
|
|
|
|
Intel compilers, associated libraries and associated development tools may
|
|
include or utilize options that optimize for instruction sets that are
|
|
available in both Intel and non-Intel microprocessors (for example SIMD
|
|
instruction sets), but do not optimize equally for non-Intel
|
|
microprocessors. In addition, certain compiler options for Intel
|
|
compilers, including some that are not specific to Intel
|
|
micro-architecture, are reserved for Intel microprocessors. For a detailed
|
|
description of Intel compiler options, including the instruction sets and
|
|
specific microprocessors they implicate, please refer to the "Intel
|
|
Compiler User and Reference Guides" under "Compiler Options." Many library
|
|
routines that are part of Intel compiler products are more highly optimized
|
|
for Intel microprocessors than for other microprocessors. While the
|
|
compilers and libraries in Intel compiler products offer optimizations for
|
|
both Intel and Intel-compatible microprocessors, depending on the options
|
|
you select, your code and other factors, you likely will get extra
|
|
performance on Intel microprocessors.
|
|
|
|
Intel compilers, associated libraries and associated development tools may
|
|
or may not optimize to the same degree for non-Intel microprocessors for
|
|
optimizations that are not unique to Intel microprocessors. These
|
|
optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2),
|
|
Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental
|
|
Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other
|
|
optimizations. Intel does not guarantee the availability, functionality,
|
|
or effectiveness of any optimization on microprocessors not manufactured by
|
|
Intel. Microprocessor-dependent optimizations in this product are intended
|
|
for use with Intel microprocessors.
|
|
|
|
While Intel believes our compilers and libraries are excellent choices to
|
|
assist in obtaining the best performance on Intel and non-Intel
|
|
microprocessors, Intel recommends that you evaluate other compilers and
|
|
libraries to determine which best meet your requirements. We hope to win
|
|
your business by striving to offer the best performance of any compiler or
|
|
library; please let us know if you find we do not.
|
|
|