Documentation work; first pass perf guide complete
This commit is contained in:
124
docs/faq.txt
124
docs/faq.txt
@@ -23,7 +23,7 @@ distribution.
|
|||||||
* Programming Techniques
|
* Programming Techniques
|
||||||
|
|
||||||
+ `What primitives are there for communicating between SPMD program instances?`_
|
+ `What primitives are there for communicating between SPMD program instances?`_
|
||||||
+ `How can a gang of program instances generate variable output efficiently?`_
|
+ `How can a gang of program instances generate variable amounts of output efficiently?`_
|
||||||
+ `Is it possible to use ispc for explicit vector programming?`_
|
+ `Is it possible to use ispc for explicit vector programming?`_
|
||||||
|
|
||||||
|
|
||||||
@@ -48,8 +48,7 @@ If the SSE4 target is used, then the following assembly is printed:
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
_foo: ## @foo
|
_foo:
|
||||||
## BB#0: ## %allocas
|
|
||||||
addl %esi, %edi
|
addl %esi, %edi
|
||||||
movl %edi, %eax
|
movl %edi, %eax
|
||||||
ret
|
ret
|
||||||
@@ -98,7 +97,7 @@ output array.
|
|||||||
}
|
}
|
||||||
|
|
||||||
Here is the assembly code for the application-callable instance of the
|
Here is the assembly code for the application-callable instance of the
|
||||||
function--note that the selected instructions are ideal.
|
function.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@@ -111,21 +110,7 @@ function--note that the selected instructions are ideal.
|
|||||||
|
|
||||||
|
|
||||||
And here is the assembly code for the ``ispc``-callable instance of the
|
And here is the assembly code for the ``ispc``-callable instance of the
|
||||||
function. There are a few things to notice in this code.
|
function.
|
||||||
|
|
||||||
The current program mask is coming in via the %xmm0 register and the
|
|
||||||
initial few instructions in the function essentially check to see if the
|
|
||||||
mask is all-on or all-off. If the mask is all on, the code at the label
|
|
||||||
LBB0_3 executes; it's the same as the code that was generated for ``_foo``
|
|
||||||
above. If the mask is all off, then there's nothing to be done, and the
|
|
||||||
function can return immediately.
|
|
||||||
|
|
||||||
In the case of a mixed mask, a substantial amount of code is generated to
|
|
||||||
load from and then store to only the array elements that correspond to
|
|
||||||
program instances where the mask is on. (This code is elided below). This
|
|
||||||
general pattern of having two-code paths for the "all on" and "mixed" mask
|
|
||||||
cases is used in the code generated for almost all but the most simple
|
|
||||||
functions (where the overhead of the test isn't worthwhile.)
|
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@@ -148,11 +133,84 @@ functions (where the overhead of the test isn't worthwhile.)
|
|||||||
####
|
####
|
||||||
ret
|
ret
|
||||||
|
|
||||||
|
There are a few things to notice in this code. First, the current program
|
||||||
|
mask is coming in via the ``%xmm0`` register and the initial few
|
||||||
|
instructions in the function essentially check to see if the mask is all on
|
||||||
|
or all off. If the mask is all on, the code at the label LBB0_3 executes;
|
||||||
|
it's the same as the code that was generated for ``_foo`` above. If the
|
||||||
|
mask is all off, then there's nothing to be done, and the function can
|
||||||
|
return immediately.
|
||||||
|
|
||||||
|
In the case of a mixed mask, a substantial amount of code is generated to
|
||||||
|
load from and then store to only the array elements that correspond to
|
||||||
|
program instances where the mask is on. (This code is elided below). This
|
||||||
|
general pattern of having two-code paths for the "all on" and "mixed" mask
|
||||||
|
cases is used in the code generated for almost all but the most simple
|
||||||
|
functions (where the overhead of the test isn't worthwhile.)
|
||||||
|
|
||||||
How can I more easily see gathers and scatters in generated assembly?
|
How can I more easily see gathers and scatters in generated assembly?
|
||||||
---------------------------------------------------------------------
|
---------------------------------------------------------------------
|
||||||
|
|
||||||
FIXME
|
Because CPU vector ISAs don't have native gather and scatter instructions,
|
||||||
|
these memory operations are turned into sequences of a series of
|
||||||
|
instructions in the code that ``ispc`` generates. In some cases, it can be
|
||||||
|
useful to see where gathers and scatters actually happen in code; there is
|
||||||
|
an otherwise undocumented command-line flag that provides this information.
|
||||||
|
|
||||||
|
Consider this simple program:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
void set(uniform int a[], int value, int index) {
|
||||||
|
a[index] = value;
|
||||||
|
}
|
||||||
|
|
||||||
|
When compiled normally to the SSE4 target, this program generates this
|
||||||
|
extensive code sequence, which makes it more difficult to see what the
|
||||||
|
program is actually doing.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
"_set___uptr<Ui>ii":
|
||||||
|
pmulld LCPI0_0(%rip), %xmm1
|
||||||
|
movmskps %xmm2, %eax
|
||||||
|
testb $1, %al
|
||||||
|
je LBB0_2
|
||||||
|
movd %xmm1, %ecx
|
||||||
|
movd %xmm0, (%rcx,%rdi)
|
||||||
|
LBB0_2:
|
||||||
|
testb $2, %al
|
||||||
|
je LBB0_4
|
||||||
|
pextrd $1, %xmm1, %ecx
|
||||||
|
pextrd $1, %xmm0, (%rcx,%rdi)
|
||||||
|
LBB0_4:
|
||||||
|
testb $4, %al
|
||||||
|
je LBB0_6
|
||||||
|
pextrd $2, %xmm1, %ecx
|
||||||
|
pextrd $2, %xmm0, (%rcx,%rdi)
|
||||||
|
LBB0_6:
|
||||||
|
testb $8, %al
|
||||||
|
je LBB0_8
|
||||||
|
pextrd $3, %xmm1, %eax
|
||||||
|
pextrd $3, %xmm0, (%rax,%rdi)
|
||||||
|
LBB0_8:
|
||||||
|
ret
|
||||||
|
|
||||||
|
If this program is compiled with the
|
||||||
|
``--opt=disable-handle-pseudo-memory-ops`` command-line flag, then the
|
||||||
|
scatter is left as an unresolved function call. The resulting program
|
||||||
|
won't link without unresolved symbols, but the assembly output is much
|
||||||
|
easier to understand:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
"_set___uptr<Ui>ii":
|
||||||
|
movaps %xmm0, %xmm3
|
||||||
|
pmulld LCPI0_0(%rip), %xmm1
|
||||||
|
movdqa %xmm1, %xmm0
|
||||||
|
movaps %xmm3, %xmm1
|
||||||
|
jmp ___pseudo_scatter_base_offsets32_32 ## TAILCALL
|
||||||
|
|
||||||
|
|
||||||
Interoperability
|
Interoperability
|
||||||
================
|
================
|
||||||
@@ -301,13 +359,17 @@ need to synchronize the program instances before communicating between
|
|||||||
them, due to the synchronized execution model of gangs of program instances
|
them, due to the synchronized execution model of gangs of program instances
|
||||||
in ``ispc``.
|
in ``ispc``.
|
||||||
|
|
||||||
How can a gang of program instances generate variable output efficiently?
|
How can a gang of program instances generate variable amounts of output efficiently?
|
||||||
-------------------------------------------------------------------------
|
------------------------------------------------------------------------------------
|
||||||
|
|
||||||
A useful application of the ``exclusive_scan_add()`` function in the
|
It's not unusual to have a gang of program instances where each program
|
||||||
standard library is when program instances want to generate a variable
|
instance generates a variable amount of output (perhaps some generate no
|
||||||
amount of output and when one would like that output to be densely packed
|
output, some generate one output value, some generate many output values
|
||||||
in a single array. For example, consider the code fragment below:
|
and so forth), and where one would like to have the output densely packed
|
||||||
|
in an output array. The ``exclusive_scan_add()`` function from the
|
||||||
|
standard library is quite useful in this situation.
|
||||||
|
|
||||||
|
Consider the following function:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@@ -331,11 +393,11 @@ value, the second two values, and the third and fourth three values each.
|
|||||||
In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6)
|
In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6)
|
||||||
to the four program instances, respectively.
|
to the four program instances, respectively.
|
||||||
|
|
||||||
The first program instance will write its one result to ``outArray[0]``,
|
The first program instance will then write its one result to
|
||||||
the second will write its two values to ``outArray[1]`` and
|
``outArray[0]``, the second will write its two values to ``outArray[1]``
|
||||||
``outArray[2]``, and so forth. The ``reduce_add`` call at the end returns
|
and ``outArray[2]``, and so forth. The ``reduce_add()`` call at the end
|
||||||
the total number of values that all of the program instances have written
|
returns the total number of values that all of the program instances have
|
||||||
to the array.
|
written to the array.
|
||||||
|
|
||||||
FIXME: add discussion of foreach_active as an option here once that's in
|
FIXME: add discussion of foreach_active as an option here once that's in
|
||||||
|
|
||||||
|
|||||||
157
docs/ispc.txt
157
docs/ispc.txt
@@ -48,6 +48,8 @@ Contents:
|
|||||||
|
|
||||||
* `Recent Changes to ISPC`_
|
* `Recent Changes to ISPC`_
|
||||||
|
|
||||||
|
+ `Updating ISPC Programs For Changes In ISPC 1.1`_
|
||||||
|
|
||||||
* `Getting Started with ISPC`_
|
* `Getting Started with ISPC`_
|
||||||
|
|
||||||
+ `Installing ISPC`_
|
+ `Installing ISPC`_
|
||||||
@@ -62,18 +64,27 @@ Contents:
|
|||||||
|
|
||||||
* `The ISPC Language`_
|
* `The ISPC Language`_
|
||||||
|
|
||||||
|
+ `Relationship To The C Programming Language`_
|
||||||
+ `Lexical Structure`_
|
+ `Lexical Structure`_
|
||||||
+ `Basic Types and Type Qualifiers`_
|
+ `Types`_
|
||||||
+ `Function Pointer Types`_
|
|
||||||
+ `Enumeration Types`_
|
* `Basic Types and Type Qualifiers`_
|
||||||
+ `Short Vector Types`_
|
* `Function Pointer Types`_
|
||||||
+ `Struct and Array Types`_
|
* `Enumeration Types`_
|
||||||
|
* `Short Vector Types`_
|
||||||
|
* `Struct and Array Types`_
|
||||||
|
|
||||||
+ `Declarations and Initializers`_
|
+ `Declarations and Initializers`_
|
||||||
+ `Function Declarations`_
|
|
||||||
+ `Expressions`_
|
+ `Expressions`_
|
||||||
+ `Control Flow`_
|
+ `Control Flow`_
|
||||||
+ `Functions`_
|
|
||||||
+ `C Constructs not in ISPC`_
|
* `Conditional Statements: "if"`_
|
||||||
|
* `Basic Iteration Statements: "for", "while", and "do"`_
|
||||||
|
* `Parallel Iteration Statements: "foreach" and "foreach_tiled"`_
|
||||||
|
* `Functions and Function Calls`_
|
||||||
|
|
||||||
|
+ `Function Declarations`_
|
||||||
|
+ `Function Overloading`_
|
||||||
|
|
||||||
* `Parallel Execution Model in ISPC`_
|
* `Parallel Execution Model in ISPC`_
|
||||||
|
|
||||||
@@ -110,6 +121,8 @@ Contents:
|
|||||||
+ `Restructuring Existing Programs to Use ISPC`_
|
+ `Restructuring Existing Programs to Use ISPC`_
|
||||||
+ `Understanding How to Interoperate With the Application's Data`_
|
+ `Understanding How to Interoperate With the Application's Data`_
|
||||||
|
|
||||||
|
* `Related Languages`_
|
||||||
|
|
||||||
* `Disclaimer and Legal Information`_
|
* `Disclaimer and Legal Information`_
|
||||||
|
|
||||||
* `Optimization Notice`_
|
* `Optimization Notice`_
|
||||||
@@ -120,6 +133,9 @@ Recent Changes to ISPC
|
|||||||
See the file ``ReleaseNotes.txt`` in the ``ispc`` distribution for a list
|
See the file ``ReleaseNotes.txt`` in the ``ispc`` distribution for a list
|
||||||
of recent changes to the compiler.
|
of recent changes to the compiler.
|
||||||
|
|
||||||
|
Updating ISPC Programs For Changes In ISPC 1.1
|
||||||
|
----------------------------------------------
|
||||||
|
|
||||||
Getting Started with ISPC
|
Getting Started with ISPC
|
||||||
=========================
|
=========================
|
||||||
|
|
||||||
@@ -407,6 +423,9 @@ parallel execution model (versus C's serial model), C code is not directly
|
|||||||
portable to ``ispc``, although starting with working C code and porting it
|
portable to ``ispc``, although starting with working C code and porting it
|
||||||
to ``ispc`` can be an efficient way to write ``ispc`` programs.
|
to ``ispc`` can be an efficient way to write ``ispc`` programs.
|
||||||
|
|
||||||
|
Relationship To The C Programming Language
|
||||||
|
------------------------------------------
|
||||||
|
|
||||||
Lexical Structure
|
Lexical Structure
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
@@ -541,6 +560,9 @@ A number of tokens are used for grouping in ``ispc``:
|
|||||||
- Compound statements
|
- Compound statements
|
||||||
|
|
||||||
|
|
||||||
|
Types
|
||||||
|
-----
|
||||||
|
|
||||||
Basic Types and Type Qualifiers
|
Basic Types and Type Qualifiers
|
||||||
-------------------------------
|
-------------------------------
|
||||||
|
|
||||||
@@ -906,31 +928,6 @@ Structures can also be initialized only with element values in braces:
|
|||||||
Color d = { 0.5, .75, 1.0 }; // r = 0.5, ...
|
Color d = { 0.5, .75, 1.0 }; // r = 0.5, ...
|
||||||
|
|
||||||
|
|
||||||
Function Declarations
|
|
||||||
---------------------
|
|
||||||
|
|
||||||
Functions can be declared with a number of qualifiers that affect their
|
|
||||||
visibility and capabilities. As in C/C++, functions have global visibility
|
|
||||||
by default. If a function is declared with a ``static`` qualifier, then it
|
|
||||||
is only visible in the file in which it was declared.
|
|
||||||
|
|
||||||
Any function that can be launched with the ``launch`` construct in ``ispc``
|
|
||||||
must have a ``task`` qualifier; see `Task Parallelism: Language Syntax`_
|
|
||||||
for more discussion of launching tasks in ``ispc``.
|
|
||||||
|
|
||||||
Functions that are intended to be called from C/C++ application code must
|
|
||||||
have the ``export`` qualifier. This causes them to have regular C linkage
|
|
||||||
and to have their declarations included in header files, if the ``ispc``
|
|
||||||
compiler is directed to generated a C/C++ header file for the file it
|
|
||||||
compiled.
|
|
||||||
|
|
||||||
Finally, any function defined with an ``inline`` qualifier will always be
|
|
||||||
inlined by ``ispc``; ``inline`` is not a hint, but forces inlining. The
|
|
||||||
compiler will opportunistically inline short functions depending on their
|
|
||||||
complexity, but any function that should always be inlined should have the
|
|
||||||
``inline`` qualifier.
|
|
||||||
|
|
||||||
|
|
||||||
Expressions
|
Expressions
|
||||||
-----------
|
-----------
|
||||||
|
|
||||||
@@ -959,6 +956,46 @@ Structure member access and array indexing also work as in C.
|
|||||||
Control Flow
|
Control Flow
|
||||||
------------
|
------------
|
||||||
|
|
||||||
|
Conditional Statements: "if"
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
Basic Iteration Statements: "for", "while", and "do"
|
||||||
|
----------------------------------------------------
|
||||||
|
|
||||||
|
Parallel Iteration Statements: "foreach" and "foreach_tiled"
|
||||||
|
------------------------------------------------------------
|
||||||
|
|
||||||
|
Functions and Function Calls
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
Function Declarations
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
Functions can be declared with a number of qualifiers that affect their
|
||||||
|
visibility and capabilities. As in C/C++, functions have global visibility
|
||||||
|
by default. If a function is declared with a ``static`` qualifier, then it
|
||||||
|
is only visible in the file in which it was declared.
|
||||||
|
|
||||||
|
Any function that can be launched with the ``launch`` construct in ``ispc``
|
||||||
|
must have a ``task`` qualifier; see `Task Parallelism: Language Syntax`_
|
||||||
|
for more discussion of launching tasks in ``ispc``.
|
||||||
|
|
||||||
|
Functions that are intended to be called from C/C++ application code must
|
||||||
|
have the ``export`` qualifier. This causes them to have regular C linkage
|
||||||
|
and to have their declarations included in header files, if the ``ispc``
|
||||||
|
compiler is directed to generated a C/C++ header file for the file it
|
||||||
|
compiled.
|
||||||
|
|
||||||
|
Finally, any function defined with an ``inline`` qualifier will always be
|
||||||
|
inlined by ``ispc``; ``inline`` is not a hint, but forces inlining. The
|
||||||
|
compiler will opportunistically inline short functions depending on their
|
||||||
|
complexity, but any function that should always be inlined should have the
|
||||||
|
``inline`` qualifier.
|
||||||
|
|
||||||
|
|
||||||
|
Function Overloading
|
||||||
|
--------------------
|
||||||
|
|
||||||
``ispc`` supports most of C's control flow constructs, including ``if``,
|
``ispc`` supports most of C's control flow constructs, including ``if``,
|
||||||
``for``, ``while``, ``do``. You can use ``break`` and ``continue``
|
``for``, ``while``, ``do``. You can use ``break`` and ``continue``
|
||||||
statements in ``for``, ``while``, and ``do`` loops.
|
statements in ``for``, ``while``, and ``do`` loops.
|
||||||
@@ -1335,13 +1372,8 @@ at run-time, in which case it can jump to a simpler code path or otherwise
|
|||||||
save work.
|
save work.
|
||||||
|
|
||||||
The first of these statements is ``cif``, indicating an ``if`` statement
|
The first of these statements is ``cif``, indicating an ``if`` statement
|
||||||
that is expected to be coherent. Recall from the `The
|
that is expected to be coherent. The usage of ``cif`` in code is just the
|
||||||
SPMD-on-SIMD Execution Model`_ section that ``if`` statements with a
|
same as ``if``:
|
||||||
``uniform`` test compile to more efficient code than ``if`` tests with
|
|
||||||
varying tests. ``cif`` can provide many benefits of ``if`` with a
|
|
||||||
uniform test in the case where the test is actually varying.
|
|
||||||
|
|
||||||
The usage of ``cif`` in code is just the same as ``if``:
|
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@@ -1353,47 +1385,7 @@ The usage of ``cif`` in code is just the same as ``if``:
|
|||||||
|
|
||||||
``cif`` provides a hint to the compiler that you expect that most of the
|
``cif`` provides a hint to the compiler that you expect that most of the
|
||||||
executing SPMD programs will all have the same result for the ``if``
|
executing SPMD programs will all have the same result for the ``if``
|
||||||
condition. In this case, the code the compiler generates for the ``if``
|
condition.
|
||||||
test is along the lines of the following pseudo-code:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
bool expr = /* evaluate cif condition */
|
|
||||||
if (all(expr)) {
|
|
||||||
// run "true" case of if test only
|
|
||||||
} else if (!any(expr)) {
|
|
||||||
// run "false" case of if test only
|
|
||||||
} else {
|
|
||||||
// run both true and false cases, updating mask appropriately
|
|
||||||
}
|
|
||||||
|
|
||||||
(For comparison, see the discussion of how regular ``if`` statements are
|
|
||||||
executed from the `The SPMD-on-SIMD Execution Model`_
|
|
||||||
section.)
|
|
||||||
|
|
||||||
For ``if`` statements where the different running SPMD program instances
|
|
||||||
don't have coherent values for the boolean ``if`` test, using ``cif``
|
|
||||||
introduces some additional overhead from the ``all`` and ``any`` tests as
|
|
||||||
well as the corresponding branches. For cases where the program
|
|
||||||
instances often do compute the same boolean value, this overhead is
|
|
||||||
worthwhile. If the control flow is in fact usually incoherent, this
|
|
||||||
overhead only costs performance.
|
|
||||||
|
|
||||||
In a similar fashion, ``ispc`` provides ``cfor``, ``cwhile``, ``cdo``,
|
|
||||||
``cbreak``, ``ccontinue``, and ``creturn`` statements. These statements
|
|
||||||
are semantically the same as the corresponding non-"c"-prefixed functions.
|
|
||||||
|
|
||||||
For example, when ``ispc`` encounters a regular ``continue`` statement in
|
|
||||||
the middle of loop, it disables the mask bits for the program instances
|
|
||||||
that executed the ``continue`` and then executes the remainder of the loop
|
|
||||||
body, under the expectation that other executing program instances will
|
|
||||||
still need to run those instructions. If you expect that all running
|
|
||||||
program instances will often execute ``continue`` together, then
|
|
||||||
``ccontinue`` provides the compiler a hint to do extra work to check if
|
|
||||||
every running program instance continued, in which case it can jump to the
|
|
||||||
end of the loop, saving the work of executing the otherwise meaningless
|
|
||||||
instructions.
|
|
||||||
|
|
||||||
|
|
||||||
Program Instance Convergence
|
Program Instance Convergence
|
||||||
----------------------------
|
----------------------------
|
||||||
@@ -2952,6 +2944,9 @@ elements to work with and then proceeds with the computation.
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
Related Languages
|
||||||
|
=================
|
||||||
|
|
||||||
Disclaimer and Legal Information
|
Disclaimer and Legal Information
|
||||||
================================
|
================================
|
||||||
|
|
||||||
|
|||||||
587
docs/perf.txt
587
docs/perf.txt
@@ -2,38 +2,275 @@
|
|||||||
Intel® SPMD Program Compiler Performance Guide
|
Intel® SPMD Program Compiler Performance Guide
|
||||||
==============================================
|
==============================================
|
||||||
|
|
||||||
|
The SPMD programming model provided by ``ispc`` naturally delivers
|
||||||
|
excellent performance for many workloads thanks to efficient use of CPU
|
||||||
|
SIMD vector hardware. This guide provides more details about how to get
|
||||||
|
the most out of ``ispc`` in practice.
|
||||||
|
|
||||||
* `Using ISPC Effectively`_
|
* `Key Concepts`_
|
||||||
|
|
||||||
+ `Gather and Scatter`_
|
+ `Efficient Iteration With "foreach"`_
|
||||||
+ `8 and 16-bit Integer Types`_
|
+ `Improving Control Flow Coherence With "foreach_tiled"`_
|
||||||
+ `Low-level Vector Tricks`_
|
+ `Using Coherent Control Flow Constructs`_
|
||||||
+ `The "Fast math" Option`_
|
+ `Use "uniform" Whenever Appropriate`_
|
||||||
+ `"Inline" Aggressively`_
|
|
||||||
+ `Small Performance Tricks`_
|
* `Tips and Techniques`_
|
||||||
+ `Instrumenting Your ISPC Programs`_
|
|
||||||
+ `Choosing A Target Vector Width`_
|
+ `Understanding Gather and Scatter`_
|
||||||
|
+ `Avoid 64-bit Addressing Calculations When Possible`_
|
||||||
|
+ `Avoid Computation With 8 and 16-bit Integer Types`_
|
||||||
+ `Implementing Reductions Efficiently`_
|
+ `Implementing Reductions Efficiently`_
|
||||||
|
+ `Using Low-level Vector Tricks`_
|
||||||
|
+ `The "Fast math" Option`_
|
||||||
|
+ `"inline" Aggressively`_
|
||||||
|
+ `Avoid The System Math Library`_
|
||||||
|
+ `Declare Variables In The Scope Where They're Used`_
|
||||||
|
+ `Instrumenting ISPC Programs To Understand Runtime Behavior`_
|
||||||
|
+ `Choosing A Target Vector Width`_
|
||||||
|
|
||||||
* `Disclaimer and Legal Information`_
|
* `Disclaimer and Legal Information`_
|
||||||
|
|
||||||
* `Optimization Notice`_
|
* `Optimization Notice`_
|
||||||
|
|
||||||
|
Key Concepts
|
||||||
|
============
|
||||||
|
|
||||||
don't use the system math library unless it's absolutely necessary
|
This section describes the four most important concepts to understand and
|
||||||
|
keep in mind when writing high-performance ``ispc`` programs. It assumes
|
||||||
|
good familiarity with the topics covered in the ``ispc`` `Users Guide`_.
|
||||||
|
|
||||||
opt=32-bit-addressing
|
.. _Users Guide: ispc.html
|
||||||
|
|
||||||
Using ISPC Effectively
|
Efficient Iteration With "foreach"
|
||||||
======================
|
----------------------------------
|
||||||
|
|
||||||
|
The ``foreach`` parallel iteration construct is semantically equivalent to
|
||||||
|
a regular ``for()`` loop, though it offers meaningful performance benefits.
|
||||||
|
(See the `documentation on "foreach" in the Users Guide`_ for a review of
|
||||||
|
its syntax and semantics.) As an example, consider this simple function
|
||||||
|
that iterates over some number of elements in an array, doing computation
|
||||||
|
on each one:
|
||||||
|
|
||||||
|
.. _documentation on "foreach" in the Users Guide: ispc.html#parallel-iteration-statements-foreach-and-foreach-tiled
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
export void foo(uniform int a[], uniform int count) {
|
||||||
|
for (int i = programIndex; i < count; i += programCount) {
|
||||||
|
// do some computation on a[i]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
Depending on the specifics of the computation being performed, the code
|
||||||
|
generated for this function could likely be improved by modifying the code
|
||||||
|
so that the loop only goes as far through the data as is possible to pack
|
||||||
|
an entire gang of program instances with computation each time thorugh the
|
||||||
|
loop. Doing so enables the ``ispc`` compiler to generate more efficient
|
||||||
|
code for cases where it knows that the execution mask is "all on". Then,
|
||||||
|
an ``if`` statement at the end handles processing the ragged extra bits of
|
||||||
|
data that didn't fully fill a gang.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
export void foo(uniform int a[], uniform int count) {
|
||||||
|
// First, just loop up to the point where all program instances
|
||||||
|
// in the gang will be active at the loop iteration start
|
||||||
|
uniform int countBase = count & ~(programCount-1);
|
||||||
|
for (uniform int i = 0; i < countBase; i += programCount) {
|
||||||
|
int index = i + programIndex;
|
||||||
|
// do some computation on a[index]
|
||||||
|
}
|
||||||
|
// Now handle the ragged extra bits at the end
|
||||||
|
if (countBase < count) {
|
||||||
|
int index = countBase + programIndex;
|
||||||
|
// do some computation on a[index]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
While the performance of the above code will likely be better than the
|
||||||
|
first version of the function, the loop body code has been duplicated (or
|
||||||
|
has been forced to move into a separate utility function).
|
||||||
|
|
||||||
|
Using the ``foreach`` looping construct as below provides all of the
|
||||||
|
performance benefits of the second version of this function, with the
|
||||||
|
compactness of the first.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
export void foo(uniform int a[], uniform int count) {
|
||||||
|
foreach (i = 0 ... count) {
|
||||||
|
// do some computation on a[i]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
Improving Control Flow Coherence With "foreach_tiled"
|
||||||
|
-----------------------------------------------------
|
||||||
|
|
||||||
|
Depending on the computation being performed, ``foreach_tiled`` may give
|
||||||
|
better performance than ``foreach``. (See the `documentation in the Users
|
||||||
|
Guide`_ for the syntax and semantics of ``foreach_tiled``.) Given a
|
||||||
|
multi-dimensional iteration like:
|
||||||
|
|
||||||
|
.. _documentation in the Users Guide: ispc.html#parallel-iteration-statements-foreach-and-foreach-tiled
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
foreach (i = 0 ... width, j = 0 ... height) {
|
||||||
|
// do computation on element (i,j)
|
||||||
|
}
|
||||||
|
|
||||||
|
if the ``foreach`` statement is used, elements in the gang of program
|
||||||
|
instances will be mapped to values of ``i`` and ``j`` by taking spans of
|
||||||
|
``programCount`` elements across ``i`` with a single value of ``j``. For
|
||||||
|
example, the ``foreach`` statement above roughly corresponds to:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
for (uniform int j = 0; j < height; ++j)
|
||||||
|
for (int i = 0; i < width; i += programCount) {
|
||||||
|
// do computation
|
||||||
|
}
|
||||||
|
|
||||||
|
When a multi-dimensional domain is being iterated over, ``foreach_tiled``
|
||||||
|
statement maps program instances to data in a way that tries to select
|
||||||
|
square n-dimensional segments of the domain. For example, on a compilation
|
||||||
|
target with 8-wide gangs of program instances, it generates code that
|
||||||
|
iterates over the domain the same way as the following code (though more
|
||||||
|
efficiently):
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
for (int j = programIndex/4; j < height; j += 2)
|
||||||
|
for (int i = programIndex%4; i < width; i += 4) {
|
||||||
|
// do computation
|
||||||
|
}
|
||||||
|
|
||||||
|
Thus, each gang of program instances operates on a 2x4 tile of the domain.
|
||||||
|
With higher-dimensional iteration and different gang sizes, a similar
|
||||||
|
mapping is performed--e.g. for 2D iteration with a 16-wide gang size, 4x4
|
||||||
|
tiles are iterated over; for 4D iteration with a 8-gang, 1x2x2x2 tiles are
|
||||||
|
processed, and so forth.
|
||||||
|
|
||||||
|
Performance benefit can come from using ``foreach_tiled`` in that it
|
||||||
|
essentially optimizes for the benefit of iterating over *compact* regions
|
||||||
|
of the domian (while ``foreach`` iterates over the domain in a way that
|
||||||
|
generally allows linear memory access.) There are two benefits from
|
||||||
|
processing compact regions of the domain.
|
||||||
|
|
||||||
|
First, it's often the case that the control flow coherence of the program
|
||||||
|
instances in the gang is improved; if data-dependent control flow decisions
|
||||||
|
are related to the values of the data in the domain being processed, and if
|
||||||
|
the data values have some coherence, iterating with compact regions will
|
||||||
|
improve control flow coherence.
|
||||||
|
|
||||||
|
Second, processing compact regions may mean that the data accessed by
|
||||||
|
program instances in the gang is be more coherent, leading to performance
|
||||||
|
benefits from better cache hit rates.
|
||||||
|
|
||||||
|
As a concrete example, for the ray tracer example in the ``ispc``
|
||||||
|
distribution (in the ``examples/rt`` directory), performance is 20% better
|
||||||
|
when the pixels are iterated over using ``foreach_tiled`` than ``foreach``,
|
||||||
|
because more coherent regions of the scene are accessed by the set of rays
|
||||||
|
in the gang of program instances.
|
||||||
|
|
||||||
|
|
||||||
Gather and Scatter
|
Using Coherent Control Flow Constructs
|
||||||
------------------
|
--------------------------------------
|
||||||
|
|
||||||
The CPU is a poor fit for SPMD execution in some ways, the worst of which
|
Recall from the ``ispc`` Users Guide, in the `SPMD-on-SIMD Execution Model
|
||||||
is handling of general memory reads and writes from SPMD program instances.
|
section`_ that ``if`` statements with a ``uniform`` test compile to more
|
||||||
For example, in a "simple" array index:
|
efficient code than ``if`` tests with varying tests. The coherent ``cif``
|
||||||
|
statement can provide many benefits of ``if`` with a uniform test in the
|
||||||
|
case where the test is actually varying.
|
||||||
|
|
||||||
|
.. _SPMD-on-SIMD Execution Model section: ispc.html#the-spmd-on-simd-execution-model
|
||||||
|
|
||||||
|
In this case, the code the compiler generates for the ``if``
|
||||||
|
test is along the lines of the following pseudo-code:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
bool expr = /* evaluate cif condition */
|
||||||
|
if (all(expr)) {
|
||||||
|
// run "true" case of if test only
|
||||||
|
} else if (!any(expr)) {
|
||||||
|
// run "false" case of if test only
|
||||||
|
} else {
|
||||||
|
// run both true and false cases, updating mask appropriately
|
||||||
|
}
|
||||||
|
|
||||||
|
For ``if`` statements where the different running SPMD program instances
|
||||||
|
don't have coherent values for the boolean ``if`` test, using ``cif``
|
||||||
|
introduces some additional overhead from the ``all`` and ``any`` tests as
|
||||||
|
well as the corresponding branches. For cases where the program
|
||||||
|
instances often do compute the same boolean value, this overhead is
|
||||||
|
worthwhile. If the control flow is in fact usually incoherent, this
|
||||||
|
overhead only costs performance.
|
||||||
|
|
||||||
|
In a similar fashion, ``ispc`` provides ``cfor``, ``cwhile``, ``cdo``,
|
||||||
|
``cbreak``, ``ccontinue``, and ``creturn`` statements. These statements
|
||||||
|
are semantically the same as the corresponding non-"c"-prefixed functions.
|
||||||
|
|
||||||
|
For example, when ``ispc`` encounters a regular ``continue`` statement in
|
||||||
|
the middle of loop, it disables the mask bits for the program instances
|
||||||
|
that executed the ``continue`` and then executes the remainder of the loop
|
||||||
|
body, under the expectation that other executing program instances will
|
||||||
|
still need to run those instructions. If you expect that all running
|
||||||
|
program instances will often execute ``continue`` together, then
|
||||||
|
``ccontinue`` provides the compiler a hint to do extra work to check if
|
||||||
|
every running program instance continued, in which case it can jump to the
|
||||||
|
end of the loop, saving the work of executing the otherwise meaningless
|
||||||
|
instructions.
|
||||||
|
|
||||||
|
Use "uniform" Whenever Appropriate
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
For any variable that will always have the same value across all of the
|
||||||
|
program instances in a gang, declare the variable with the ``unfiorm``
|
||||||
|
qualifier. Doing so enables the ``ispc`` compiler to emit better code in
|
||||||
|
many different ways.
|
||||||
|
|
||||||
|
As a simple example, consider a ``for`` loop that always does the same
|
||||||
|
number of iterations:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
for (int i = 0; i < 10; ++i)
|
||||||
|
// do something ten times
|
||||||
|
|
||||||
|
If this is written with ``i`` as a ``varying`` variable, as above, there's
|
||||||
|
additional overhead in the code generated for the loop as the compiler
|
||||||
|
emits instructions to handle the possibilty of not all program instances
|
||||||
|
following the same control flow path (as might be the case if the loop
|
||||||
|
limit, 10, was itself a ``varying`` value.)
|
||||||
|
|
||||||
|
If the above loop is instead written with ``i`` ``uniform``, as:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
for (uniform int i = 0; i < 10; ++i)
|
||||||
|
// do something ten times
|
||||||
|
|
||||||
|
Then better code can be generated (and the loop possibly unrolled).
|
||||||
|
|
||||||
|
In some cases, the compiler may be able to detect simple cases like these,
|
||||||
|
but it's always best to provide the compiler with as much help as possible
|
||||||
|
to understand the actual form of your computation.
|
||||||
|
|
||||||
|
|
||||||
|
Tips and Techniques
|
||||||
|
===================
|
||||||
|
|
||||||
|
This section introduces a number of additional techniques that are worth
|
||||||
|
keeping in mind when writing ``ispc`` programs.
|
||||||
|
|
||||||
|
Understanding Gather and Scatter
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
Memory reads and writes from the program instances in a gang that access
|
||||||
|
irregular memory locations (rather than a consecutive set of locations, or
|
||||||
|
a single location) can be relatively inefficient. As an example, consider
|
||||||
|
the "simple" array indexing calculation below:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@@ -41,23 +278,23 @@ For example, in a "simple" array index:
|
|||||||
uniform float x[10] = { ... };
|
uniform float x[10] = { ... };
|
||||||
float f = x[i];
|
float f = x[i];
|
||||||
|
|
||||||
Since the index ``i`` is a varying value, the various SPMD program
|
Since the index ``i`` is a varying value, the program instances in the gang
|
||||||
instances will in general be reading different locations in the array
|
will in general be reading different locations in the array ``x``. Because
|
||||||
``x``. Because the CPU doesn't have a gather instruction, the ``ispc``
|
current CPUs have a "gather" instruction, the ``ispc`` compiler has to
|
||||||
compiler has to serialize these memory reads, performing a separate memory
|
serialize these memory reads, performing a separate memory load for each
|
||||||
load for each running program instance, packing the result into ``f``.
|
running program instance, packing the result into ``f``. (The analogous
|
||||||
(And the analogous case would happen for a write into ``x[i]``.)
|
case happens for a write into ``x[i]``.)
|
||||||
|
|
||||||
In many cases, gathers like these are unavoidable; the running program
|
In many cases, gathers like these are unavoidable; the program instances
|
||||||
instances just need to access incoherent memory locations. However, if the
|
just need to access incoherent memory locations. However, if the array
|
||||||
array index ``i`` could actually be declared and used as a ``uniform``
|
index ``i`` actually has the same value for all of the program instances or
|
||||||
variable, the resulting array index is substantially more
|
if it represents an access to a consecutive set of array locations, much
|
||||||
efficient. This is another case where using ``uniform`` whenever applicable
|
more efficient load and store instructions can be generated instead of
|
||||||
is of benefit.
|
gathers and scatters, respectively.
|
||||||
|
|
||||||
In some cases, the ``ispc`` compiler is able to deduce that the memory
|
In many cases, the ``ispc`` compiler is able to deduce that the memory
|
||||||
locations accessed are either all the same or are uniform. For example,
|
locations accessed by a varying index are either all the same or are
|
||||||
given:
|
uniform. For example, given:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@@ -67,37 +304,160 @@ given:
|
|||||||
|
|
||||||
The compiler is able to determine that all of the program instances are
|
The compiler is able to determine that all of the program instances are
|
||||||
loading from the same location, even though ``y`` is not a ``uniform``
|
loading from the same location, even though ``y`` is not a ``uniform``
|
||||||
variable. In this case, the compiler will transform this load to a regular vector
|
variable. In this case, the compiler will transform this load to a regular
|
||||||
load, rather than a general gather.
|
vector load, rather than a general gather.
|
||||||
|
|
||||||
Sometimes the running program instances will access a
|
Sometimes the running program instances will access a linear sequence of
|
||||||
linear sequence of memory locations; this happens most frequently when
|
memory locations; this happens most frequently when array indexing is done
|
||||||
array indexing is done based on the built-in ``programIndex`` variable. In
|
based on the built-in ``programIndex`` variable. In many of these cases,
|
||||||
many of these cases, the compiler is also able to detect this case and then
|
the compiler is also able to detect this case and then do a vector load.
|
||||||
do a vector load. For example, given:
|
For example, given:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
uniform int x = ...;
|
for (int i = programIndex; i < count; i += programCount)
|
||||||
return array[2*x + programIndex];
|
// process array[i];
|
||||||
|
|
||||||
A regular vector load is done from array, starting at offset ``2*x``.
|
Regular vector loads and stores are issued for accesses to ``array[i]``.
|
||||||
|
|
||||||
|
Both of these cases have been ones where the compiler is able to determine
|
||||||
|
statically that the index has the same value at compile-time. It's
|
||||||
|
often the case that this determination can't be made at compile time, but
|
||||||
|
this is often the case at run time. The ``reduce_equal()`` function from
|
||||||
|
the standard library can be used in this case; it checks to see if the
|
||||||
|
given value is the same across over all of the running program instances,
|
||||||
|
returning true and its ``uniform`` value if so.
|
||||||
|
|
||||||
8 and 16-bit Integer Types
|
The following function shows the use of ``reduce_equal()`` to check for an
|
||||||
--------------------------
|
equal index at execution time and then either do a scalar load and
|
||||||
|
broadcast or a general gather.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
uniform float array[..] = { ... };
|
||||||
|
float value;
|
||||||
|
int i = ...;
|
||||||
|
uniform int ui;
|
||||||
|
if (reduce_equal(i, &ui) == true)
|
||||||
|
value = array[ui]; // scalar load + broadcast
|
||||||
|
else
|
||||||
|
value = array[i]; // gather
|
||||||
|
|
||||||
|
For a simple case like the one above, the overhead of doing the
|
||||||
|
``reduce_equal()`` check is likely not worthwhile compared to just always
|
||||||
|
doing a gather. In more complex cases, where a number of accesses are done
|
||||||
|
based on the index, it can be worth doing. See the example
|
||||||
|
``examples/volume_rendering`` in the ``ispc`` distribution for the use of
|
||||||
|
this technique in an instance where it is beneficial to performance.
|
||||||
|
|
||||||
|
Avoid 64-bit Addressing Calculations When Possible
|
||||||
|
--------------------------------------------------
|
||||||
|
|
||||||
|
Even when compiling to a 64-bit architecture target, ``ispc`` does many of
|
||||||
|
the addressing calculations in 32-bit precision by default--this behavior
|
||||||
|
can be overridden with the ``--addressing=64`` command-line argument. This
|
||||||
|
option should only be used if it's necessary to be able to address over 4GB
|
||||||
|
of memory in the ``ispc`` code, as it essentially doubles the cost of
|
||||||
|
memory addressing calculations in the generated code.
|
||||||
|
|
||||||
|
Avoid Computation With 8 and 16-bit Integer Types
|
||||||
|
-------------------------------------------------
|
||||||
|
|
||||||
The code generated for 8 and 16-bit integer types is generally not as
|
The code generated for 8 and 16-bit integer types is generally not as
|
||||||
efficient as the code generated for 32-bit integer types. It is generally
|
efficient as the code generated for 32-bit integer types. It is generally
|
||||||
worthwhile to use 32-bit integer types for intermediate computations, even
|
worthwhile to use 32-bit integer types for intermediate computations, even
|
||||||
if the final result will be stored in a smaller integer type.
|
if the final result will be stored in a smaller integer type.
|
||||||
|
|
||||||
Low-level Vector Tricks
|
Implementing Reductions Efficiently
|
||||||
-----------------------
|
-----------------------------------
|
||||||
|
|
||||||
Many low-level Intel® SSE coding constructs can be implemented in ``ispc``
|
It's often necessary to compute a reduction over a data set--for example,
|
||||||
code. For example, the following code efficiently reverses the sign of the
|
one might want to add all of the values in an array, compute their minimum,
|
||||||
given values.
|
etc. ``ispc`` provides a few capabilities that make it easy to efficiently
|
||||||
|
compute reductions like these. However, it's important to use these
|
||||||
|
capabilities appropriately for best results.
|
||||||
|
|
||||||
|
As an example, consider the task of computing the sum of all of the values
|
||||||
|
in an array. In C code, we might have:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
/* C implementation of a sum reduction */
|
||||||
|
float sum(const float array[], int count) {
|
||||||
|
float sum = 0;
|
||||||
|
for (int i = 0; i < count; ++i)
|
||||||
|
sum += array[i];
|
||||||
|
return sum;
|
||||||
|
}
|
||||||
|
|
||||||
|
Exactly this computation could also be expressed as a purely uniform
|
||||||
|
computation in ``ispc``, though without any benefit from vectorization:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
/* inefficient ispc implementation of a sum reduction */
|
||||||
|
uniform float sum(const uniform float array[], uniform int count) {
|
||||||
|
uniform float sum = 0;
|
||||||
|
for (uniform int i = 0; i < count; ++i)
|
||||||
|
sum += array[i];
|
||||||
|
return sum;
|
||||||
|
}
|
||||||
|
|
||||||
|
As a first try, one might try using the ``reduce_add()`` function from the
|
||||||
|
``ispc`` standard library; it takes a ``varying`` value and returns the sum
|
||||||
|
of that value across all of the active program instances.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
/* inefficient ispc implementation of a sum reduction */
|
||||||
|
uniform float sum(const uniform float array[], uniform int count) {
|
||||||
|
uniform float sum = 0;
|
||||||
|
foreach (i = 0 ... count)
|
||||||
|
sum += reduce_add(array[i+programIndex]);
|
||||||
|
return sum;
|
||||||
|
}
|
||||||
|
|
||||||
|
This implementation loads a gang's worth of values from the array, one for
|
||||||
|
each of the program instances, and then uses ``reduce_add()`` to reduce
|
||||||
|
across the program instances and then update the sum. Unfortunately this
|
||||||
|
approach loses most benefit from vectorization, as it does more work on the
|
||||||
|
cross-program instance ``reduce_add()`` call than it saves from the vector
|
||||||
|
load of values.
|
||||||
|
|
||||||
|
The most efficient approach is to do the reduction in two phases: rather
|
||||||
|
than using a ``uniform`` variable to store the sum, we maintain a varying
|
||||||
|
value, such that each program instance is effectively computing a local
|
||||||
|
partial sum on the subset of array values that it has loaded from the
|
||||||
|
array. When the loop over array elements concludes, a single call to
|
||||||
|
``reduce_add()`` computes the final reduction across each of the program
|
||||||
|
instances' elements of ``sum``. This approach effectively compiles to a
|
||||||
|
single vector load and a single vector add for each loop iteration's of
|
||||||
|
values--very efficient code in the end.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
/* good ispc implementation of a sum reduction */
|
||||||
|
uniform float sum(const uniform float array[], uniform int count) {
|
||||||
|
float sum = 0;
|
||||||
|
foreach (i = 0 ... count)
|
||||||
|
sum += array[i+programIndex];
|
||||||
|
return reduce_add(sum);
|
||||||
|
}
|
||||||
|
|
||||||
|
Using Low-level Vector Tricks
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
Many low-level Intel® SSE and AVX coding constructs can be implemented in
|
||||||
|
``ispc`` code. The ``ispc`` standard library functions ``intbits()`` and
|
||||||
|
``floatbits()`` are often useful in this context. Recall that
|
||||||
|
``intbits()`` takes a ``float`` value and returns it as an integer where
|
||||||
|
the bits of the integer are the same as the bit representation in memory of
|
||||||
|
the ``float``. (In other words, it does *not* perform an integer to
|
||||||
|
floating-point conversion.) ``floatbits()``, then, performs the inverse
|
||||||
|
computation.
|
||||||
|
|
||||||
|
As an example of the use of these functions, the following code efficiently
|
||||||
|
reverses the sign of the given values.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@@ -112,12 +472,12 @@ This code compiles down to a single XOR instruction.
|
|||||||
The "Fast math" Option
|
The "Fast math" Option
|
||||||
----------------------
|
----------------------
|
||||||
|
|
||||||
``ispc`` has a ``--fast-math`` command-line flag that enables a number of
|
``ispc`` has a ``--opt=fast-math`` command-line flag that enables a number of
|
||||||
optimizations that may be undesirable in code where numerical preceision is
|
optimizations that may be undesirable in code where numerical precision is
|
||||||
critically important. For many graphics applications, the
|
critically important. For many graphics applications, for example, the
|
||||||
approximations may be acceptable. The following two optimizations are
|
approximations introduced may be acceptable, however. The following two
|
||||||
performed when ``--fast-math`` is used. By default, the ``--fast-math``
|
optimizations are performed when ``--opt=fast-math`` is used. By default, the
|
||||||
flag is off.
|
``--opt=fast-math`` flag is off.
|
||||||
|
|
||||||
* Expressions like ``x / y``, where ``y`` is a compile-time constant, are
|
* Expressions like ``x / y``, where ``y`` is a compile-time constant, are
|
||||||
transformed to ``x * (1./y)``, where the inverse value of ``y`` is
|
transformed to ``x * (1./y)``, where the inverse value of ``y`` is
|
||||||
@@ -125,18 +485,34 @@ flag is off.
|
|||||||
|
|
||||||
* Expressions like ``x / y``, where ``y`` is not a compile-time constant,
|
* Expressions like ``x / y``, where ``y`` is not a compile-time constant,
|
||||||
are transformed to ``x * rcp(y)``, where ``rcp()`` maps to the
|
are transformed to ``x * rcp(y)``, where ``rcp()`` maps to the
|
||||||
approximate reciprocal instruction from the standard library.
|
approximate reciprocal instruction from the ``ispc`` standard library.
|
||||||
|
|
||||||
|
|
||||||
"Inline" Aggressively
|
"inline" Aggressively
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
Inlining functions aggressively is generally beneficial for performance
|
Inlining functions aggressively is generally beneficial for performance
|
||||||
with ``ispc``. Definitely use the ``inline`` qualifier for any short
|
with ``ispc``. Definitely use the ``inline`` qualifier for any short
|
||||||
functions (a few lines long), and experiment with it for longer functions.
|
functions (a few lines long), and experiment with it for longer functions.
|
||||||
|
|
||||||
Small Performance Tricks
|
Avoid The System Math Library
|
||||||
------------------------
|
-----------------------------
|
||||||
|
|
||||||
|
The default math library for transcendentals and the like that ``ispc`` has
|
||||||
|
higher error than the system's math library, though is much more efficient
|
||||||
|
due to being vectorized across the program instances and due to the fact
|
||||||
|
that the functions can be inlined in the final code. (It generally has
|
||||||
|
errors in the range of 10ulps, while the system math library generally has
|
||||||
|
no more than 1ulp of error for transcendentals.)
|
||||||
|
|
||||||
|
If the ``--math-lib=system`` command-line option is used when compiling an
|
||||||
|
``ispc`` program, then calls to the system math library will be generated
|
||||||
|
instead. This option should only be used if the higher precision is
|
||||||
|
absolutely required as the performance impact of using it can be
|
||||||
|
significant.
|
||||||
|
|
||||||
|
Declare Variables In The Scope Where They're Used
|
||||||
|
-------------------------------------------------
|
||||||
|
|
||||||
Performance is slightly improved by declaring variables at the same block
|
Performance is slightly improved by declaring variables at the same block
|
||||||
scope where they are first used. For example, in code like the
|
scope where they are first used. For example, in code like the
|
||||||
@@ -168,8 +544,8 @@ Try not to write code as:
|
|||||||
Doing so can reduce the amount of masked store instructions that the
|
Doing so can reduce the amount of masked store instructions that the
|
||||||
compiler needs to generate.
|
compiler needs to generate.
|
||||||
|
|
||||||
Instrumenting Your ISPC Programs
|
Instrumenting ISPC Programs To Understand Runtime Behavior
|
||||||
--------------------------------
|
----------------------------------------------------------
|
||||||
|
|
||||||
``ispc`` has an optional instrumentation feature that can help you
|
``ispc`` has an optional instrumentation feature that can help you
|
||||||
understand performance issues. If a program is compiled using the
|
understand performance issues. If a program is compiled using the
|
||||||
@@ -187,7 +563,7 @@ gathers happen.)
|
|||||||
|
|
||||||
This function is passed the file name of the ``ispc`` file running, a short
|
This function is passed the file name of the ``ispc`` file running, a short
|
||||||
note indicating what is happening, the line number in the source file, and
|
note indicating what is happening, the line number in the source file, and
|
||||||
the current mask of active SPMD program lanes. You must provide an
|
the current mask of active program instances in the gang. You must provide an
|
||||||
implementation of this function and link it in with your application.
|
implementation of this function and link it in with your application.
|
||||||
|
|
||||||
For example, when the ``ispc`` program runs, this function might be called
|
For example, when the ``ispc`` program runs, this function might be called
|
||||||
@@ -199,13 +575,13 @@ as follows:
|
|||||||
|
|
||||||
This call indicates that at the currently executing program has just
|
This call indicates that at the currently executing program has just
|
||||||
entered the function defined at line 55 of the file ``foo.ispc``, with a
|
entered the function defined at line 55 of the file ``foo.ispc``, with a
|
||||||
mask of all lanes currently executing (assuming a four-wide Intel® SSE
|
mask of all lanes currently executing (assuming a four-wide gang size
|
||||||
target machine).
|
target machine).
|
||||||
|
|
||||||
For a fuller example of the utility of this functionality, see
|
For a fuller example of the utility of this functionality, see
|
||||||
``examples/aobench_instrumented`` in the ``ispc`` distribution. Ths
|
``examples/aobench_instrumented`` in the ``ispc`` distribution. Ths
|
||||||
example includes an implementation of the ``ISPCInstrument`` function that
|
example includes an implementation of the ``ISPCInstrument()`` function
|
||||||
collects aggregate data about the program's execution behavior.
|
that collects aggregate data about the program's execution behavior.
|
||||||
|
|
||||||
When running this example, you will want to direct to the ``ao`` executable
|
When running this example, you will want to direct to the ``ao`` executable
|
||||||
to generate a low resolution image, because the instrumentation adds
|
to generate a low resolution image, because the instrumentation adds
|
||||||
@@ -252,85 +628,6 @@ It is selected with the ``--target=sse2-x2``, ``--target=sse4-x2`` and
|
|||||||
``--target=avx-x2`` options, respectively.
|
``--target=avx-x2`` options, respectively.
|
||||||
|
|
||||||
|
|
||||||
Implementing Reductions Efficiently
|
|
||||||
-----------------------------------
|
|
||||||
|
|
||||||
It's often necessary to compute a "reduction" over a data set--for example,
|
|
||||||
one might want to add all of the values in an array, compute their minimum,
|
|
||||||
etc. ``ispc`` provides a few capabilities that make it easy to efficiently
|
|
||||||
compute reductions like these. However, it's important to use these
|
|
||||||
capabilities appropriately for best results.
|
|
||||||
|
|
||||||
As an example, consider the task of computing the sum of all of the values
|
|
||||||
in an array. In C code, we might have:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
/* C implementation of a sum reduction */
|
|
||||||
float sum(const float array[], int count) {
|
|
||||||
float sum = 0;
|
|
||||||
for (int i = 0; i < count; ++i)
|
|
||||||
sum += array[i];
|
|
||||||
return sum;
|
|
||||||
}
|
|
||||||
|
|
||||||
Of course, exactly this computation could also be expressed in ``ispc``,
|
|
||||||
though without any benefit from vectorization:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
/* inefficient ispc implementation of a sum reduction */
|
|
||||||
uniform float sum(const uniform float array[], uniform int count) {
|
|
||||||
uniform float sum = 0;
|
|
||||||
for (uniform int i = 0; i < count; ++i)
|
|
||||||
sum += array[i];
|
|
||||||
return sum;
|
|
||||||
}
|
|
||||||
|
|
||||||
As a first try, one might try using the ``reduce_add()`` function from the
|
|
||||||
``ispc`` standard library; it takes a ``varying`` value and returns the sum
|
|
||||||
of that value across all of the active program instances.
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
/* inefficient ispc implementation of a sum reduction */
|
|
||||||
uniform float sum(const uniform float array[], uniform int count) {
|
|
||||||
uniform float sum = 0;
|
|
||||||
// Assumes programCount evenly divides count
|
|
||||||
for (uniform int i = 0; i < count; i += programCount)
|
|
||||||
sum += reduce_add(array[i+programIndex]);
|
|
||||||
return sum;
|
|
||||||
}
|
|
||||||
|
|
||||||
This implementation loads a set of ``programCount`` values from the array,
|
|
||||||
one for each of the program instances, and then uses ``reduce_add`` to
|
|
||||||
reduce across the program instances and then update the sum. Unfortunately
|
|
||||||
this approach loses most benefit from vectorization, as it does more work
|
|
||||||
on the cross-program instance ``reduce_add()`` call than it saves from the
|
|
||||||
vector load of values.
|
|
||||||
|
|
||||||
The most efficient approach is to do the reduction in two phases: rather
|
|
||||||
than using a ``uniform`` variable to store the sum, we maintain a varying
|
|
||||||
value, such that each program instance is effectively computing a local
|
|
||||||
partial sum on the subset of array values that it has loaded from the
|
|
||||||
array. When the loop over array elements concludes, a single call to
|
|
||||||
``reduce_add()`` computes the final reduction across each of the program
|
|
||||||
instances' elements of ``sum``. This approach effectively compiles to a
|
|
||||||
single vector load and a single vector add for each ``programCount`` worth
|
|
||||||
of values--very efficient code in the end.
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
/* good ispc implementation of a sum reduction */
|
|
||||||
uniform float sum(const uniform float array[], uniform int count) {
|
|
||||||
float sum = 0;
|
|
||||||
// Assumes programCount evenly divides count
|
|
||||||
for (uniform int i = 0; i < count; i += programCount)
|
|
||||||
sum += array[i+programIndex];
|
|
||||||
return reduce_add(sum);
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
Disclaimer and Legal Information
|
Disclaimer and Legal Information
|
||||||
================================
|
================================
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user