880 lines
34 KiB
ReStructuredText
880 lines
34 KiB
ReStructuredText
=====================================
|
|
Frequently Asked Questions About ispc
|
|
=====================================
|
|
|
|
This document includes a number of frequently (and not frequently) asked
|
|
questions about ispc, the Intel® SPMD Program Compiler. The source to this
|
|
document is in the file ``docs/faq.rst`` in the ``ispc`` source
|
|
distribution.
|
|
|
|
* Understanding ispc's Output
|
|
|
|
+ `How can I see the assembly language generated by ispc?`_
|
|
+ `How can I have the assembly output be printed using Intel assembly syntax?`_
|
|
+ `Why are there multiple versions of exported ispc functions in the assembly output?`_
|
|
+ `How can I more easily see gathers and scatters in generated assembly?`_
|
|
|
|
* Running The Compiler
|
|
|
|
+ `Why is it required to use one of the "generic" targets with C++ output?`_
|
|
+ `Why won't the compiler generate an object file or assembly output with the "generic" targets?`_
|
|
|
|
* Language Details
|
|
|
|
+ `What is the difference between "int *foo" and "int foo[]"?`_
|
|
+ `Why are pointed-to types "uniform" by default?`_
|
|
+ `What am I getting an error about assigning a varying lvalue to a reference type?`_
|
|
|
|
* Interoperability
|
|
|
|
+ `How can I supply an initial execution mask in the call from the application?`_
|
|
+ `How can I generate a single binary executable with support for multiple instruction sets?`_
|
|
+ `How can I determine at run-time which vector instruction set's instructions were selected to execute?`_
|
|
+ `Is it possible to inline ispc functions in C/C++ code?`_
|
|
+ `Why is it illegal to pass "varying" values from C/C++ to ispc functions?`_
|
|
|
|
* Programming Techniques
|
|
|
|
+ `What primitives are there for communicating between SPMD program instances?`_
|
|
+ `How can a gang of program instances generate variable amounts of output efficiently?`_
|
|
+ `Is it possible to use ispc for explicit vector programming?`_
|
|
+ `How can I debug my ispc programs using Valgrind?`_
|
|
+ `foreach statements generate more complex assembly than I'd expect; what's going on?`_
|
|
+ `How do I launch an individual task for each active program instance?`_
|
|
|
|
Understanding ispc's Output
|
|
===========================
|
|
|
|
How can I see the assembly language generated by ispc?
|
|
------------------------------------------------------
|
|
|
|
The ``--emit-asm`` flag causes assembly output to be generated. If the
|
|
``-o`` command-line flag is also supplied, the assembly is stored in the
|
|
given file, or printed to standard output if ``-`` is specified for the
|
|
filename. For example, given the simple ``ispc`` program:
|
|
|
|
::
|
|
|
|
export uniform int foo(uniform int a, uniform int b) {
|
|
return a+b;
|
|
}
|
|
|
|
If the SSE4 target is used, then the following assembly is printed:
|
|
|
|
::
|
|
|
|
_foo:
|
|
addl %esi, %edi
|
|
movl %edi, %eax
|
|
ret
|
|
|
|
|
|
How can I have the assembly output be printed using Intel assembly syntax?
|
|
--------------------------------------------------------------------------
|
|
|
|
The ``ispc`` compiler is currently only able to emit assembly with AT+T
|
|
syntax, where the destination operand is the last operand after an
|
|
instruction. If you'd prefer Intel assembly output, one option is to use
|
|
Agner Fog's ``objconv`` tool: have ``ispc`` emit a native object file and
|
|
then use ``objconv`` to disassemble it, specifying the assembler syntax
|
|
that you prefer. ``objconv`` `is available for download here`_.
|
|
|
|
.. _is available for download here: http://www.agner.org/optimize/#objconv
|
|
|
|
Why are there multiple versions of exported ispc functions in the assembly output?
|
|
----------------------------------------------------------------------------------
|
|
|
|
Two generations of all functions qualified with ``export`` are generated:
|
|
one of them is for being be called by other ``ispc`` functions, and the
|
|
other is to be called by the application. The application callable
|
|
function has the original function's name, while the ``ispc``-callable
|
|
function has a mangled name that encodes the types of the function's
|
|
parameters.
|
|
|
|
The crucial difference between these two functions is that the
|
|
application-callable function doesn't take a parameter encoding the current
|
|
execution mask, while ``ispc``-callable functions have a hidden mask
|
|
parameter. An implication of this difference is that the ``export``
|
|
function starts with the execution mask "all on". This allows a number of
|
|
improvements in the generated code, particularly on architectures that
|
|
don't have support for masked load and store instructions.
|
|
|
|
As an example, consider this short function, which loads a vector's worth
|
|
values from two arrays in memory, adds them, and writes the result to an
|
|
output array.
|
|
|
|
::
|
|
|
|
export void foo(uniform float a[], uniform float b[],
|
|
uniform float result[]) {
|
|
float aa = a[programIndex], bb = b[programIndex];
|
|
result[programIndex] = aa+bb;
|
|
}
|
|
|
|
Here is the assembly code for the application-callable instance of the
|
|
function.
|
|
|
|
::
|
|
|
|
_foo:
|
|
movups (%rsi), %xmm1
|
|
movups (%rdi), %xmm0
|
|
addps %xmm1, %xmm0
|
|
movups %xmm0, (%rdx)
|
|
ret
|
|
|
|
|
|
And here is the assembly code for the ``ispc``-callable instance of the
|
|
function.
|
|
|
|
::
|
|
|
|
"_foo___uptr<Uf>uptr<Uf>uptr<Uf>":
|
|
movmskps %xmm0, %eax
|
|
cmpl $15, %eax
|
|
je LBB0_3
|
|
testl %eax, %eax
|
|
jne LBB0_4
|
|
ret
|
|
LBB0_3:
|
|
movups (%rsi), %xmm1
|
|
movups (%rdi), %xmm0
|
|
addps %xmm1, %xmm0
|
|
movups %xmm0, (%rdx)
|
|
ret
|
|
LBB0_4:
|
|
####
|
|
#### Code elided; handle mixed mask case..
|
|
####
|
|
ret
|
|
|
|
There are a few things to notice in this code. First, the current program
|
|
mask is coming in via the ``%xmm0`` register and the initial few
|
|
instructions in the function essentially check to see if the mask is all on
|
|
or all off. If the mask is all on, the code at the label LBB0_3 executes;
|
|
it's the same as the code that was generated for ``_foo`` above. If the
|
|
mask is all off, then there's nothing to be done, and the function can
|
|
return immediately.
|
|
|
|
In the case of a mixed mask, a substantial amount of code is generated to
|
|
load from and then store to only the array elements that correspond to
|
|
program instances where the mask is on. (This code is elided below). This
|
|
general pattern of having two-code paths for the "all on" and "mixed" mask
|
|
cases is used in the code generated for almost all but the most simple
|
|
functions (where the overhead of the test isn't worthwhile.)
|
|
|
|
How can I more easily see gathers and scatters in generated assembly?
|
|
---------------------------------------------------------------------
|
|
|
|
Because CPU vector ISAs don't have native gather and scatter instructions,
|
|
these memory operations are turned into sequences of a series of
|
|
instructions in the code that ``ispc`` generates. In some cases, it can be
|
|
useful to see where gathers and scatters actually happen in code; there is
|
|
an otherwise undocumented command-line flag that provides this information.
|
|
|
|
Consider this simple program:
|
|
|
|
::
|
|
|
|
void set(uniform int a[], int value, int index) {
|
|
a[index] = value;
|
|
}
|
|
|
|
When compiled normally to the SSE4 target, this program generates this
|
|
extensive code sequence, which makes it more difficult to see what the
|
|
program is actually doing.
|
|
|
|
::
|
|
|
|
"_set___uptr<Ui>ii":
|
|
pmulld LCPI0_0(%rip), %xmm1
|
|
movmskps %xmm2, %eax
|
|
testb $1, %al
|
|
je LBB0_2
|
|
movd %xmm1, %ecx
|
|
movd %xmm0, (%rcx,%rdi)
|
|
LBB0_2:
|
|
testb $2, %al
|
|
je LBB0_4
|
|
pextrd $1, %xmm1, %ecx
|
|
pextrd $1, %xmm0, (%rcx,%rdi)
|
|
LBB0_4:
|
|
testb $4, %al
|
|
je LBB0_6
|
|
pextrd $2, %xmm1, %ecx
|
|
pextrd $2, %xmm0, (%rcx,%rdi)
|
|
LBB0_6:
|
|
testb $8, %al
|
|
je LBB0_8
|
|
pextrd $3, %xmm1, %eax
|
|
pextrd $3, %xmm0, (%rax,%rdi)
|
|
LBB0_8:
|
|
ret
|
|
|
|
If this program is compiled with the
|
|
``--opt=disable-handle-pseudo-memory-ops`` command-line flag, then the
|
|
scatter is left as an unresolved function call. The resulting program
|
|
won't link without unresolved symbols, but the assembly output is much
|
|
easier to understand:
|
|
|
|
::
|
|
|
|
"_set___uptr<Ui>ii":
|
|
movaps %xmm0, %xmm3
|
|
pmulld LCPI0_0(%rip), %xmm1
|
|
movdqa %xmm1, %xmm0
|
|
movaps %xmm3, %xmm1
|
|
jmp ___pseudo_scatter_base_offsets32_32 ## TAILCALL
|
|
|
|
|
|
Running The Compiler
|
|
====================
|
|
|
|
Why is it required to use one of the "generic" targets with C++ output?
|
|
-----------------------------------------------------------------------
|
|
|
|
The C++ output option transforms the provided ``ispc`` program source into
|
|
C++ code where each basic operation in the program (addition, comparison,
|
|
etc.) is represented as a function call to an as-yet-undefined function,
|
|
chaining the results of these calls together to perform the required
|
|
computations. It is then expected that the user will provide the
|
|
implementation of these functions via a header file with ``inline``
|
|
functions defined for each of these functions and then use a C++ compiler
|
|
to generate a final object file. (Examples of these headers include
|
|
``examples/intrinsics/sse4.h`` and ``examples/intrinsics/knc.h`` in the
|
|
``ispc`` distribution.)
|
|
|
|
If a target other than one of the "generic" ones is used with C++ output,
|
|
then the compiler will transform certain operations into particular code
|
|
sequences that may not be desired for the actual final target; for example,
|
|
SSE targets that don't have hardware "gather" instructions will transform a
|
|
gather into a sequence of scalar load instructions. When this in turn is
|
|
transformed to C++ code, the fact that the loads were originally a gather
|
|
is lost, and the header file of function definitions wouldn't have a chance
|
|
to map the "gather" to a target-specific operation, as the ``knc.h`` header
|
|
does, for example. Thus, the "generic" targets exist to provide basic
|
|
targets of various vector widths, without imposing any limitations on the
|
|
final target's capabilities.
|
|
|
|
Why won't the compiler generate an object file or assembly output with the "generic" targets?
|
|
---------------------------------------------------------------------------------------------
|
|
|
|
As described in the above FAQ entry, when compiling to the "generic"
|
|
targets, ``ispc`` generates vector code for the source program that
|
|
transforms every basic operation in the program (addition, comparison,
|
|
etc.) into a separate function call.
|
|
|
|
While there is no fundamental reason that the compiler couldn't generate
|
|
target-specific object code with a function call to an undefined function
|
|
for each primitive operation, doing so wouldn't actually be useful in
|
|
practice--providing definitions of these functions in a separate object
|
|
file and actually performing function calls for each of them (versus
|
|
turning them into inline function calls) would be a highly inefficient way
|
|
to run the program.
|
|
|
|
Therefore, in the interests of encouraging the use of the system,
|
|
these types of output are disallowed.
|
|
|
|
|
|
Language Details
|
|
================
|
|
|
|
What is the difference between "int \*foo" and "int foo[]"?
|
|
-----------------------------------------------------------
|
|
|
|
In C and C++, declaring a function to take a parameter ``int *foo`` and
|
|
``int foo[]`` results in the same type for the parameter. Both are
|
|
pointers to integers. In ``ispc``, these are different types. The first
|
|
one is a varying pointer to a uniform integer value in memory, while the
|
|
second results in a uniform pointer to the start of an array of varying
|
|
integer values in memory.
|
|
|
|
To understand why the first is a varying pointer to a uniform integer,
|
|
first recall that types without explicit rate qualifiers (``uniform``,
|
|
``varying``, or ``soa<>``) are ``varying`` by default. Second, recall from
|
|
the `discussion of pointer types in the ispc User's Guide`_ that pointed-to
|
|
types without rate qualifiers are ``uniform`` by default. (This second
|
|
rule is discussed further below, in `Why are pointed-to types "uniform" by
|
|
default?`_.) The type of ``int *foo`` follows from these.
|
|
|
|
.. _discussion of pointer types in the ispc User's Guide: ispc.html#pointer-types
|
|
|
|
Conversely, in a function body, ``int foo[10]`` represents a declaration of
|
|
a 10-element array of varying ``int`` values. In that we'd certainly like
|
|
to be able to pass such an array to a function that takes a ``int []``
|
|
parameter, the natural type for an ``int []`` parameter is a uniform
|
|
pointer to varying integer values.
|
|
|
|
In terms of compatibility with C/C++, it's unfortunate that this
|
|
distinction exists, though any other set of rules seems to introduce more
|
|
awkwardness than this one. (Though we're interested to hear ideas to
|
|
improve these rules!).
|
|
|
|
Why are pointed-to types "uniform" by default?
|
|
----------------------------------------------
|
|
|
|
In ``ispc``, types without rate qualifiers are "varying" by default, but
|
|
types pointed to by pointers without rate qualifiers are "uniform" by
|
|
default. Why this difference?
|
|
|
|
::
|
|
|
|
int foo; // no rate qualifier, "varying int".
|
|
uniform int *foo; // pointer type has no rate qualifier, pointed-to does.
|
|
// "varying pointer to uniform int".
|
|
int *foo; // neither pointer type nor pointed-to type ("int") have
|
|
// rate qualifiers. Pointer type is varying by default,
|
|
// pointed-to is uniform. "varying pointer to uniform int".
|
|
varying int *foo; // varying pointer to varying int
|
|
|
|
The first rule, having types without rate qualifiers be varying by default,
|
|
is a default that keeps the number of "uniform" or "varying" qualifiers in
|
|
``ispc`` programs low. Most ``ispc`` programs use mostly "varying"
|
|
variables, so this rule allows most variables to be declared without also
|
|
requiring rate qualifiers.
|
|
|
|
On a related note, this rule allows many C/C++ functions to be used to
|
|
define equivalent functions in the SPMD execution model that ``ispc``
|
|
provides with little or no modification:
|
|
|
|
::
|
|
|
|
// scalar add in C/C++, SPMD/vector add in ispc
|
|
int add(int a, int b) { return a + b; }
|
|
|
|
This motivation also explains why ``uniform int *foo`` represents a varying
|
|
pointer; having pointers be varying by default if they don't have rate
|
|
qualifiers similarly helps with porting code from C/C++ to ``ispc``.
|
|
|
|
The tricker issue is why pointed-to types are "uniform" by default. In our
|
|
experience, data in memory that is accessed via pointers is most often
|
|
uniform; this generally includes all data that has been allocated and
|
|
initialized by the C/C++ application code. In practice, "varying" types are
|
|
more generally (but not exclusively) used for local data in ``ispc``
|
|
functions. Thus, making the pointed-to type uniform by default leads to
|
|
more concise code for the most common cases.
|
|
|
|
|
|
What am I getting an error about assigning a varying lvalue to a reference type?
|
|
--------------------------------------------------------------------------------
|
|
|
|
Given code like the following:
|
|
|
|
::
|
|
|
|
uniform float a[...];
|
|
int index = ...;
|
|
float &r = a[index];
|
|
|
|
``ispc`` issues the error "Initializer for reference-type variable "r" must
|
|
have a uniform lvalue type.". The underlying issue stems from how
|
|
references are represented in the code generated by ``ispc``. Recall that
|
|
``ispc`` supports both uniform and varying pointer types--a uniform pointer
|
|
points to the same location in memory for all program instances in the
|
|
gang, while a varying pointer allows each program instance to have its own
|
|
pointer value.
|
|
|
|
References are represented a pointer in the code generated by ``ispc``,
|
|
though this is generally opaque to the user; in ``ispc``, they are
|
|
specifically uniform pointers. This design decision was made so that given
|
|
code like this:
|
|
|
|
::
|
|
|
|
extern void func(float &val);
|
|
float foo = ...;
|
|
func(foo);
|
|
|
|
Then the reference would be handled efficiently as a single pointer, rather
|
|
than unnecessarily being turned into a gang-size of pointers.
|
|
|
|
However, an implication of this decision is that it's not possible for
|
|
references to refer to completely different things for each of the program
|
|
instances. (And hence the error that is issued). In cases where a unique
|
|
per-program-instance pointer is needed, a varying pointer should be used
|
|
instead of a reference.
|
|
|
|
|
|
Interoperability
|
|
================
|
|
|
|
How can I supply an initial execution mask in the call from the application?
|
|
----------------------------------------------------------------------------
|
|
|
|
Recall that when execution transitions from the application code to an
|
|
``ispc`` function, all of the program instances are initially executing.
|
|
In some cases, it may desired that only some of them are running, based on
|
|
a data-dependent condition computed in the application program. This
|
|
situation can easily be handled via an additional parameter from the
|
|
application.
|
|
|
|
As a simple example, consider a case where the application code has an
|
|
array of ``float`` values and we'd like the ``ispc`` code to update
|
|
just specific values in that array, where which of those values to be
|
|
updated has been determined by the application. In C++ code, we might
|
|
have:
|
|
|
|
::
|
|
|
|
int count = ...;
|
|
float *array = new float[count];
|
|
bool *shouldUpdate = new bool[count];
|
|
// initialize array and shouldUpdate
|
|
ispc_func(array, shouldUpdate, count);
|
|
|
|
Then, the ``ispc`` code could process this update as:
|
|
|
|
::
|
|
|
|
export void ispc_func(uniform float array[], uniform bool update[],
|
|
uniform int count) {
|
|
foreach (i = 0 ... count) {
|
|
cif (update[i] == true)
|
|
// update array[i+programIndex]...
|
|
}
|
|
}
|
|
|
|
(In this case a "coherent" if statement is likely to be worthwhile if the
|
|
``update`` array will tend to have sections that are either all-true or
|
|
all-false.)
|
|
|
|
How can I generate a single binary executable with support for multiple instruction sets?
|
|
-----------------------------------------------------------------------------------------
|
|
|
|
``ispc`` can also generate output that supports multiple target instruction
|
|
sets, also generating code that chooses the most appropriate one at runtime
|
|
if multiple targets are specified with the ``--target`` command-line
|
|
argument.
|
|
|
|
For example, if you run the command:
|
|
|
|
::
|
|
|
|
ispc foo.ispc -o foo.o --target=sse2,sse4-x2,avx-x2
|
|
|
|
Then four object files will be generated: ``foo_sse2.o``, ``foo_sse4.o``,
|
|
``foo_avx.o``, and ``foo.o``.[#]_ Link all of these into your executable, and
|
|
when you call a function in ``foo.ispc`` from your application code,
|
|
``ispc`` will determine which instruction sets are supported by the CPU the
|
|
code is running on and will call the most appropriate version of the
|
|
function available.
|
|
|
|
.. [#] Similarly, if you choose to generate assembly language output or
|
|
LLVM bitcode output, multiple versions of those files will be created.
|
|
|
|
In general, the version of the function that runs will be the one in the
|
|
most general instruction set that is supported by the system. If you only
|
|
compile SSE2 and SSE4 variants and run on a system that supports AVX, for
|
|
example, then the SSE4 variant will be executed. If the system doesn't
|
|
is not able to run any of the available variants of the function (for
|
|
example, trying to run a function that only has SSE4 and AVX variants on a
|
|
system that only supports SSE2), then the standard library ``abort()``
|
|
function will be called.
|
|
|
|
One subtlety is that all non-static global variables (if any) must have the
|
|
same size and layout with all of the targets used. For example, if you
|
|
have the global variables:
|
|
|
|
::
|
|
|
|
uniform int foo[2*programCount];
|
|
int bar;
|
|
|
|
and compile to both SSE2 and AVX targets, both of these variables will have
|
|
different sizes (the first due to program count having the value 4 for SSE2
|
|
and 8 for AVX, and the second due to ``varying`` types having different
|
|
numbers of elements with the two targets--essentially the same issue as the
|
|
first.) ``ispc`` issues an error in this case.
|
|
|
|
|
|
How can I determine at run-time which vector instruction set's instructions were selected to execute?
|
|
-----------------------------------------------------------------------------------------------------
|
|
|
|
``ispc`` doesn't provide any API that allows querying which vector ISA's
|
|
instructions are running when multi-target compilation was used. However,
|
|
this can be solved in "user space" by writing a small helper function.
|
|
Specifically, if you implement a function like this
|
|
|
|
::
|
|
|
|
export uniform int isa() {
|
|
#if defined(ISPC_TARGET_SSE2)
|
|
return 0;
|
|
#elif defined(ISPC_TARGET_SSE4)
|
|
return 1;
|
|
#elif defined(ISPC_TARGET_AVX)
|
|
return 2;
|
|
#else
|
|
return -1;
|
|
#endif
|
|
}
|
|
|
|
And then call it from your application code at runtime, it will return 0,
|
|
1, or 2, depending on which target's instructions are running.
|
|
|
|
The way this works is a little surprising, but it's a useful trick. Of
|
|
course the preprocessor ``#if`` checks are all compile-time only
|
|
operations. What's actually happening is that the function is compiled
|
|
multiple times, once for each target, with the appropriate ``ISPC_TARGET``
|
|
preprocessor symbol set. Then, a small dispatch function is generated for
|
|
the application to actually call. This dispatch function in turn calls the
|
|
appropriate version of the function based on the CPU of the system it's
|
|
executing on, which in turn returns the appropriate value.
|
|
|
|
In a similar fashion, it's possible to find out at run-time the value of
|
|
``programCount`` for the target that's actually being used.
|
|
|
|
::
|
|
|
|
export uniform int width() { return programCount; }
|
|
|
|
|
|
Is it possible to inline ispc functions in C/C++ code?
|
|
------------------------------------------------------
|
|
|
|
If you're willing to use the ``clang`` C/C++ compiler that's part of the
|
|
LLVM tool suite, then it is possible to inline ``ispc`` code with C/C++
|
|
(and conversely, to inline C/C++ calls in ``ispc``). Doing so can provide
|
|
performance advantages when calling out to short functions written in the
|
|
"other" language. Note that you don't need to use ``clang`` to compile all
|
|
of your C/C++ code, but only for the files where you want to be able to
|
|
inline. In order to do this, you must have a full installation of LLVM
|
|
version 3.0 or later, including the ``clang`` compiler.
|
|
|
|
The basic approach is to have the various compilers emit LLVM intermediate
|
|
representation (IR) code and to then use tools from LLVM to link together
|
|
the IR from the compilers and then re-optimize it, which gives the LLVM
|
|
optimizer the opportunity to do additional inlining and cross-function
|
|
optimizations. If you have source files ``foo.ispc`` and ``foo.cpp``,
|
|
first emit LLVM IR:
|
|
|
|
::
|
|
|
|
ispc --emit-llvm -o foo_ispc.bc foo.ispc
|
|
clang -O2 -c -emit-llvm -o foo_cpp.bc foo.cpp
|
|
|
|
Next, link the two IR files into a single file and run the LLVM optimizer
|
|
on the result:
|
|
|
|
::
|
|
|
|
llvm-link foo_ispc.bc foo_cpp.bc -o - | opt -O3 -o foo_opt.bc
|
|
|
|
And finally, generate a native object file:
|
|
|
|
::
|
|
|
|
llc -filetype=obj foo_opt.bc -o foo.o
|
|
|
|
This file can in turn be linked in with the rest of your object files when
|
|
linking your applicaiton.
|
|
|
|
(Note that if you're using the AVX instruction set, you must provide the
|
|
``-mattr=+avx`` flag to ``llc``.)
|
|
|
|
|
|
Why is it illegal to pass "varying" values from C/C++ to ispc functions?
|
|
------------------------------------------------------------------------
|
|
|
|
If any of the types in the parameter list to an exported function is
|
|
"varying" (including recursively, and members of structure types, etc.),
|
|
then ``ispc`` will issue an error and refuse to compile the function:
|
|
|
|
::
|
|
|
|
% echo "export int add(int x) { return ++x; }" | ispc
|
|
<stdin>:1:12: Error: Illegal to return a "varying" type from exported function "foo"
|
|
<stdin>:1:20: Error: Varying parameter "x" is illegal in an exported function.
|
|
|
|
While there's no fundamental reason why this isn't possible, recall the
|
|
definition of "varying" variables: they have one value for each program
|
|
instance in the gang. As such, the number of values and amount of storage
|
|
required to represent a varying variable depends on the gang size
|
|
(i.e. ``programCount``), which can have different values depending on the
|
|
compilation target.
|
|
|
|
``ispc`` therefore prohibits passing "varying" values between the
|
|
application and the ``ispc`` program in order to prevent the
|
|
application-side code from depending on a particular gang size, in order to
|
|
encourage portability to different gang sizes. (A generally desirable
|
|
programming practice.)
|
|
|
|
For cases where the size of data is actually fixed from the application
|
|
side, the value can be passed via a pointer to a short ``uniform`` array,
|
|
as follows:
|
|
|
|
::
|
|
|
|
export void add4(uniform int ptr[4]) {
|
|
foreach (i = 0 ... 4)
|
|
ptr[i]++;
|
|
}
|
|
|
|
On the 4-wide SSE instruction set, this compiles to a single vector add
|
|
instruction (and associated move instructions), while it still also
|
|
efficiently computes the correct result on 8-wide AVX targets.
|
|
|
|
|
|
Programming Techniques
|
|
======================
|
|
|
|
What primitives are there for communicating between SPMD program instances?
|
|
---------------------------------------------------------------------------
|
|
|
|
The ``broadcast()``, ``rotate()``, and ``shuffle()`` standard library
|
|
routines provide a variety of mechanisms for the running program instances
|
|
to communicate values to each other during execution. Note that there's no
|
|
need to synchronize the program instances before communicating between
|
|
them, due to the synchronized execution model of gangs of program instances
|
|
in ``ispc``.
|
|
|
|
How can a gang of program instances generate variable amounts of output efficiently?
|
|
------------------------------------------------------------------------------------
|
|
|
|
It's not unusual to have a gang of program instances where each program
|
|
instance generates a variable amount of output (perhaps some generate no
|
|
output, some generate one output value, some generate many output values
|
|
and so forth), and where one would like to have the output densely packed
|
|
in an output array. The ``exclusive_scan_add()`` function from the
|
|
standard library is quite useful in this situation.
|
|
|
|
Consider the following function:
|
|
|
|
::
|
|
|
|
uniform int func(uniform float outArray[], ...) {
|
|
int numOut = ...; // figure out how many to be output
|
|
float outLocal[MAX_OUT]; // staging area
|
|
|
|
// each program instance in the gang puts its results in
|
|
// outLocal[0], ..., outLocal[numOut-1]
|
|
|
|
int startOffset = exclusive_scan_add(numOut);
|
|
for (int i = 0; i < numOut; ++i)
|
|
outArray[startOffset + i] = outLocal[i];
|
|
return reduce_add(numOut);
|
|
}
|
|
|
|
Here, each program instance has computed a number, ``numOut``, of values to
|
|
output, and has stored them in the ``outLocal`` array. Assume that four
|
|
program instances are running and that the first one wants to output one
|
|
value, the second two values, and the third and fourth three values each.
|
|
In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6)
|
|
to the four program instances, respectively.
|
|
|
|
The first program instance will then write its one result to
|
|
``outArray[0]``, the second will write its two values to ``outArray[1]``
|
|
and ``outArray[2]``, and so forth. The ``reduce_add()`` call at the end
|
|
returns the total number of values that all of the program instances have
|
|
written to the array.
|
|
|
|
FIXME: add discussion of foreach_active as an option here once that's in
|
|
|
|
Is it possible to use ispc for explicit vector programming?
|
|
-----------------------------------------------------------
|
|
|
|
The typical model for programming in ``ispc`` is an *implicit* parallel
|
|
model, where one writes a program that is apparently doing scalar
|
|
computation on values and the program is then vectorized to run in parallel
|
|
across the SIMD lanes of a processor. However, ``ispc`` also has some
|
|
support for explicit vector unit programming, where the vectorization is
|
|
explicit. Some computations may be more effectively described in the
|
|
explicit model rather than the implicit model.
|
|
|
|
This support is provided via ``uniform`` instances of short vectors
|
|
Specifically, if this short program
|
|
|
|
::
|
|
|
|
export uniform float<8> madd(uniform float<8> a, uniform float<8> b,
|
|
uniform float<8> c) {
|
|
return a + b * c;
|
|
}
|
|
|
|
is compiled with the AVX target, ``ispc`` generates the following assembly:
|
|
|
|
::
|
|
|
|
_madd:
|
|
vmulps %ymm2, %ymm1, %ymm1
|
|
vaddps %ymm0, %ymm1, %ymm0
|
|
ret
|
|
|
|
(And similarly, if compiled with a 4-wide SSE target, two ``mulps`` and two
|
|
``addps`` instructions are generated, and so forth.)
|
|
|
|
Note that ``ispc`` doesn't currently support control-flow based on
|
|
``uniform`` short vector types; it is thus not possible to write code like:
|
|
|
|
::
|
|
|
|
export uniform int<8> count(uniform float<8> a, uniform float<8> b) {
|
|
uniform int<8> sum = 0;
|
|
while (a++ < b)
|
|
++sum;
|
|
}
|
|
|
|
|
|
How can I debug my ispc programs using Valgrind?
|
|
------------------------------------------------
|
|
|
|
The `valgrind`_ memory checker is an extremely useful memory checker for
|
|
Linux and OSX; it detects a range of memory errors, including accessing
|
|
memory after it has been freed, accessing memory beyond the end of an
|
|
array, accessing uninitialized stack variables, and so forth.
|
|
In general, applications that use ``ispc`` code run with ``valgrind``
|
|
without modification and ``valgrind`` will detect the same range of memory
|
|
errors in ``ispc`` code that it does in C/C++ code.
|
|
|
|
.. _valgrind: http://valgrind.org
|
|
|
|
One issue to be aware of is that until recently, ``valgrind`` only
|
|
supported the SSE2 vector instructions; if you are using a version of
|
|
``valgrind`` older than the 3.7.0 release (5 November 2011), you should
|
|
compile your ``ispc`` programs with ``--target=sse2`` before running them
|
|
through ``valgrind``. (Note that if no target is specified, then ``ispc``
|
|
chooses a target based on the capabilities of the system you're running
|
|
``ispc`` on.) If you run an ``ispc`` program that uses instructions that
|
|
``valgrind`` doesn't support, you'll see an error message like:
|
|
|
|
::
|
|
|
|
vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0x10 0x0 0xC5 0xFA 0x11 0x84
|
|
==46059== valgrind: Unrecognised instruction at address 0x100002707.
|
|
|
|
The just-released valgrind 3.7.0 adds support for the SSE4.2 instruction
|
|
set; if you're using that version (and your system supports SSE4.2), then
|
|
you can use ``--target=sse4`` when compiling to run with ``valgrind``.
|
|
|
|
Note that ``valgrind`` does not yet support programs that use the AVX
|
|
instruction set.
|
|
|
|
foreach statements generate more complex assembly than I'd expect; what's going on?
|
|
-----------------------------------------------------------------------------------
|
|
|
|
Given a simple ``foreach`` loop like the following:
|
|
|
|
::
|
|
|
|
void foo(uniform float a[], uniform int count) {
|
|
foreach (i = 0 ... count)
|
|
a[i] *= 2;
|
|
}
|
|
|
|
|
|
the ``ispc`` compiler generates approximately 40 instructions--why isn't
|
|
the generated code simpler?
|
|
|
|
There are two main components to the code: one handles
|
|
``programCount``-sized chunks of elements of the array, and the other
|
|
handles any excess elements at the end of the array that don't completely
|
|
fill a gang. The code for the main loop is essentially what one would
|
|
expect: a vector of values are laoded from the array, the multiply is done,
|
|
and the result is stored.
|
|
|
|
::
|
|
|
|
LBB0_2: ## %foreach_full_body
|
|
movslq %edx, %rdx
|
|
vmovups (%rdi,%rdx), %ymm1
|
|
vmulps %ymm0, %ymm1, %ymm1
|
|
vmovups %ymm1, (%rdi,%rdx)
|
|
addl $32, %edx
|
|
addl $8, %eax
|
|
cmpl %ecx, %eax
|
|
jl LBB0_2
|
|
|
|
|
|
Then, there is a sequence of instructions that handles any additional
|
|
elements at the end of the array. (These instructions don't execute if
|
|
there aren't any left-over values to process, but they do lengthen the
|
|
amount of generated code.)
|
|
|
|
::
|
|
|
|
## BB#4: ## %partial_inner_only
|
|
vmovd %eax, %xmm0
|
|
vinsertf128 $1, %xmm0, %ymm0, %ymm0
|
|
vpermilps $0, %ymm0, %ymm0 ## ymm0 = ymm0[0,0,0,0,4,4,4,4]
|
|
vextractf128 $1, %ymm0, %xmm3
|
|
vmovd %esi, %xmm2
|
|
vmovaps LCPI0_1(%rip), %ymm1
|
|
vextractf128 $1, %ymm1, %xmm4
|
|
vpaddd %xmm4, %xmm3, %xmm3
|
|
# ....
|
|
vmulps LCPI0_0(%rip), %ymm1, %ymm1
|
|
vmaskmovps %ymm1, %ymm0, (%rdi,%rax)
|
|
|
|
|
|
If you know that the number of elements to be processed will always be an
|
|
exact multiple of the 8, 16, etc., then adding a simple assignment to
|
|
``count`` like the one below gives the compiler enough information to be
|
|
able to eliminate the code for the additional array elements.
|
|
|
|
::
|
|
|
|
void foo(uniform float a[], uniform int count) {
|
|
// This assignment doesn't change the value of count
|
|
// if it's a multiple of 16, but it gives the compiler
|
|
// insight into this fact, allowing for simpler code to
|
|
// be generated for the foreach loop.
|
|
count = (count & ~(16-1));
|
|
foreach (i = 0 ... count)
|
|
a[i] *= 2;
|
|
}
|
|
|
|
With this new version of ``foo()``, only the code for the first loop above
|
|
is generated.
|
|
|
|
|
|
How do I launch an individual task for each active program instance?
|
|
--------------------------------------------------------------------
|
|
|
|
Recall from the `discussion of "launch" in the ispc User's Guide`_ that a
|
|
``launch`` statement launches a single task corresponding to a single gang
|
|
of executing program instances, where the indices of the active program
|
|
instances are the same as were active when the ``launch`` statement
|
|
executed.
|
|
|
|
.. _discussion of "launch" in the ispc User's Guide: ispc.html#task-parallelism-launch-and-sync-statements
|
|
|
|
In some situations, it's desirable to be able to launch an individual task
|
|
for each executing program instance. For example, we might be performing
|
|
an iterative computation where a subset of the program instances determine
|
|
that an item they are responsible for requires additional processing.
|
|
|
|
::
|
|
|
|
bool itemNeedsMoreProcessing(int);
|
|
int itemNum = ...;
|
|
if (itemNeedsMoreProcessing(itemNum)) {
|
|
// do additional work
|
|
}
|
|
|
|
For performance reasons, it may be desirable to apply an entire gang's
|
|
worth of comptuation to each item that needs additional processing;
|
|
there may be available parallelism in this computation such that we'd like
|
|
to process each of the items with SPMD computation.
|
|
|
|
In this case, the ``foreach_active`` and ``unmasked`` constructs can be
|
|
applied together to accomplish this goal.
|
|
|
|
::
|
|
|
|
// do additional work
|
|
task void doWork(uniform int index);
|
|
foreach_active (index) {
|
|
unmasked {
|
|
launch doWork(extract(itemNum, index));
|
|
}
|
|
}
|
|
|
|
Recall that the body of the ``foreach_active`` loop runs once for each
|
|
active program instance, with each active program instance's
|
|
``programIndex`` value available in ``index`` in the above. In the loop,
|
|
we can re-establish an "all on" execution mask, enabling execution in all
|
|
of the program instances in the gang, such that execution in ``doWork()``
|
|
starts with all instances running. (Alternatively, the ``unmasked`` block
|
|
could be in the definition of ``doWork()``.)
|
|
|