1.1 Users guide final (for now)

This commit is contained in:
Matt Pharr
2011-12-02 17:04:39 -08:00
parent 3e4d69cbd3
commit e07ef6d46a

View File

@@ -2,18 +2,19 @@
Intel® SPMD Program Compiler User's Guide Intel® SPMD Program Compiler User's Guide
========================================= =========================================
``ispc`` is a compiler for writing SPMD (single program multiple data) The Intel® SPMD Program Compiler (``ispc``) is a compiler for writing SPMD
programs to run on the CPU. The SPMD programming approach is widely known (single program multiple data) programs to run on the CPU. The SPMD
to graphics and GPGPU programmers; it is used for GPU shaders and CUDA\* and programming approach is widely known to graphics and GPGPU programmers; it
OpenCL\* kernels, for example. The main idea behind SPMD is that one writes is used for GPU shaders and CUDA\* and OpenCL\* kernels, for example. The
programs as if they were operating on a single data element (a pixel for a main idea behind SPMD is that one writes programs as if they were operating
pixel shader, for example), but then the underlying hardware and runtime on a single data element (a pixel for a pixel shader, for example), but
system executes multiple invocations of the program in parallel with then the underlying hardware and runtime system executes multiple
different inputs (the values for different pixels, for example). invocations of the program in parallel with different inputs (the values
for different pixels, for example).
The main goals behind ``ispc`` are to: The main goals behind ``ispc`` are to:
* Build a small variant of the C programming language that delivers good * Build a variant of the C programming language that delivers good
performance to performance-oriented programmers who want to run SPMD performance to performance-oriented programmers who want to run SPMD
programs on CPUs. programs on CPUs.
* Provide a thin abstraction layer between the programmer and the * Provide a thin abstraction layer between the programmer and the
@@ -162,10 +163,11 @@ of recent changes to the compiler.
Updating ISPC Programs For Changes In ISPC 1.1 Updating ISPC Programs For Changes In ISPC 1.1
---------------------------------------------- ----------------------------------------------
The 1.1 release of ``ispc`` features first-class support for pointers in The major changes introduced in the 1.1 release of ``ispc`` are first-class
the language. Adding this functionality led to a number of syntactic support for pointers in the language and new parallel loop constructs.
changes to the language. These should generally require only Adding this functionality required a number of syntactic changes to the
straightforward modification of existing programs. language. These changes should generally lead to straightforward minor
modifications of existing ``ispc`` programs.
These are the relevant changes to the language: These are the relevant changes to the language:
@@ -179,11 +181,6 @@ These are the relevant changes to the language:
qualifier should just have ``reference`` removed: ``void foo(reference qualifier should just have ``reference`` removed: ``void foo(reference
float bar[])`` can just be ``void foo(float bar[])``. float bar[])`` can just be ``void foo(float bar[])``.
* It is no longer legal to pass a varying lvalue to a function that takes a
reference parameter; references can only be to uniform lvalue types. In
this case, the function should be rewritten to take a varying pointer
parameter.
* It is now a compile-time error to assign an entire array to another * It is now a compile-time error to assign an entire array to another
array. array.
@@ -196,6 +193,11 @@ These are the relevant changes to the language:
as its first parameter rather than taking a ``uniform unsigned int[]`` as as its first parameter rather than taking a ``uniform unsigned int[]`` as
its first parameter and a ``uniform int`` offset as its second parameter. its first parameter and a ``uniform int`` offset as its second parameter.
* It is no longer legal to pass a varying lvalue to a function that takes a
reference parameter; references can only be to uniform lvalue types. In
this case, the function should be rewritten to take a varying pointer
parameter.
* There are new iteration constructs for looping over computation domains, * There are new iteration constructs for looping over computation domains,
``foreach`` and ``foreach_tiled``. In addition to being syntactically ``foreach`` and ``foreach_tiled``. In addition to being syntactically
cleaner than regular ``for`` loops, these can provide performance cleaner than regular ``for`` loops, these can provide performance
@@ -505,14 +507,14 @@ parallel computation. Understanding the details of ``ispc``'s parallel
execution model that are introduced in this section is critical for writing execution model that are introduced in this section is critical for writing
efficient and correct programs in ``ispc``. efficient and correct programs in ``ispc``.
``ispc`` supports two types of parallelism: both task parallelism to ``ispc`` supports two types of parallelism: task parallelism to parallelize
parallelize across multiple processor cores and SPMD parallelism to across multiple processor cores and SPMD parallelism to parallelize across
parallelize across the SIMD vector lanes on a single core. Most of this the SIMD vector lanes on a single core. Most of this section focuses on
section focuses on SPMD parallelism, but see `Tasking Model`_ at the end of SPMD parallelism, but see `Tasking Model`_ at the end of this section for
this section for discussion of task parallelism in ``ispc``. discussion of task parallelism in ``ispc``.
This section will use some snippets of ``ispc`` code to illustrate various This section will use some snippets of ``ispc`` code to illustrate various
concepts. Given ``ispc``'s relationship to C, these should generally be concepts. Given ``ispc``'s relationship to C, these should be
understandable on their own, but you may want to refer to the `The ISPC understandable on their own, but you may want to refer to the `The ISPC
Language`_ section for details on language syntax. Language`_ section for details on language syntax.
@@ -523,15 +525,15 @@ Basic Concepts: Program Instances and Gangs of Program Instances
Upon entry to a ``ispc`` function called from C/C++ code, the execution Upon entry to a ``ispc`` function called from C/C++ code, the execution
model switches from the application's serial model to ``ispc``'s execution model switches from the application's serial model to ``ispc``'s execution
model. Conceptually, a number of ``ispc`` *program instances* start model. Conceptually, a number of ``ispc`` *program instances* start
running in parallel. The group of running program instances is a called running in concurrently. The group of running program instances is a
*gang* (harkening to "gang scheduling", since ``ispc`` provides certain called *gang* (harkening to "gang scheduling", since ``ispc`` provides
guarantees about the control flow coherence of program instances running certain guarantees about the control flow coherence of program instances
in a gang.) An ``ispc`` program instance is thus roughly similar to a running in a gang, detailed in `Gang Convergence Guarantees`_.) An
CUDA* "thread" or an OpenCL* "work-item", and an ``ispc`` gang is roughly ``ispc`` program instance is thus similar to a CUDA* "thread" or an OpenCL*
similar to a CUDA* "warp". "work-item", and an ``ispc`` gang is similar to a CUDA* "warp".
An ``ispc`` program, then, expresses the computation performed by a gang of An ``ispc`` program expresses the computation performed by a gang of
program instances, using an "implicitly parallel" model, where the ``ispc`` program instances, using an "implicit parallel" model, where the ``ispc``
program generally describes the behavior of a single program instance, even program generally describes the behavior of a single program instance, even
though a gang of them is actually executing together. This implicit model though a gang of them is actually executing together. This implicit model
is the same that is used for shaders in programmable graphics pipelines, is the same that is used for shaders in programmable graphics pipelines,
@@ -592,7 +594,7 @@ the same results for each program instance in a gang as would have been
computed if the equivalent code ran serially in C to compute each program computed if the equivalent code ran serially in C to compute each program
instance's result individually. However, here we will more precisely instance's result individually. However, here we will more precisely
define the execution model for control flow in order to be able to define the execution model for control flow in order to be able to
precisely define the language's behavior. precisely define the language's behavior in specific situations.
We will specify the notion of a *program counter* and how it is updated to We will specify the notion of a *program counter* and how it is updated to
step through the program, and an *execution mask* that indicates which step through the program, and an *execution mask* that indicates which
@@ -615,13 +617,14 @@ of an ``ispc`` function.
conservative execution path through the function, wherein if *any* conservative execution path through the function, wherein if *any*
program instance wants to execute a statement, the program counter will program instance wants to execute a statement, the program counter will
pass through that statement. pass through that statement.
2. At each statement the program counter passes through, the execution 2. At each statement the program counter passes through, the execution
mask will be set such that its value for a particular program instance is mask will be set such that its value for a particular program instance is
"on" if the program instance wants to execute that statements. "on" if and only if the program instance wants to execute that statement.
Note that these definition provide the compiler some latitude; for example, Note that these definition provide the compiler some latitude; for example,
the program counter is allowed pass through a series of statements with the the program counter is allowed pass through a series of statements with the
execution mask "all off" as long as doing so has no observable side-effects. execution mask "all off" because doing so has no observable side-effects.
Elsewhere, we will speak informally of the *control flow coherence* of a Elsewhere, we will speak informally of the *control flow coherence* of a
program; this notion describes the degree to which the program instances in program; this notion describes the degree to which the program instances in
@@ -638,7 +641,7 @@ Control Flow Example: If Statements
As a concrete example of the interplay between program counter and As a concrete example of the interplay between program counter and
execution mask, one way that an ``if`` statement like the one in the execution mask, one way that an ``if`` statement like the one in the
previous section can be represented is shown by the following pseudo-code previous section can be represented is shown by the following pseudo-code
``ispc`` compiler output: compiler output:
:: ::
@@ -654,7 +657,8 @@ previous section can be represented is shown by the following pseudo-code
In other words, the program counter steps through the statements for both In other words, the program counter steps through the statements for both
the "true" case and the "false" case, with the execution mask set so that the "true" case and the "false" case, with the execution mask set so that
no side-effects from the true statements affect the program instances that no side-effects from the true statements affect the program instances that
want to run the false statements, and vice versa. want to run the false statements, and vice versa. the execution mask is
then restored to the value it had before the ``if`` statement.
However, the compiler is free to generate different code for an ``if`` However, the compiler is free to generate different code for an ``if``
test, such as: test, such as:
@@ -681,8 +685,8 @@ code for the "true" and "false" statements is undefined.
In most cases, there is no programmer-visible difference between these two In most cases, there is no programmer-visible difference between these two
ways of compiling ``if``, though see the `Uniform Variables and Varying ways of compiling ``if``, though see the `Uniform Variables and Varying
Control Flow`_ section for a case where it causes undefined behavior in a Control Flow`_ section for a case where it causes undefined behavior in one
specific situation. particular situation.
Control Flow Example: Loops Control Flow Example: Loops
@@ -701,12 +705,13 @@ Therefore, if we have a loop like the following:
... ...
} }
where ``limit`` has the value 1 for all of the program instances but where ``limit`` has the value 1 for all of the program instances but one,
one, and has value 1000 for the other one, the program counter will step and has value 1000 for the other one, the program counter will step through
through the loop body 1000 times. The first time, the execution mask will be all on the loop body 1000 times. The first time, the execution mask will be all
(assuming it is all on going into the ``for`` loop), and the remaining 999 on (assuming it is all on going into the ``for`` loop), and the remaining
times, the mask will be off except for the program instance with 1000 in 999 times, the mask will be off except for the program instance with a
``limit``. (This would be a loop with poor control flow coherence!) ``limit`` value of 1000. (This would be a loop with poor control flow
coherence!)
A ``continue`` statement in a loop may be handled either by disabling the A ``continue`` statement in a loop may be handled either by disabling the
execution mask for the program instances that execute the ``continue`` and execution mask for the program instances that execute the ``continue`` and
@@ -716,11 +721,6 @@ disabled after the ``continue`` has executed. ``break`` statements are
handled in a similar fashion. handled in a similar fashion.
Control Flow Example: Function Pointers
---------------------------------------
Gang Convergence Guarantees Gang Convergence Guarantees
--------------------------- ---------------------------
@@ -728,13 +728,16 @@ The ``ispc`` execution model provides an important guarantee about the
behavior of the program counter and execution mask: the execution of behavior of the program counter and execution mask: the execution of
program instances is *maximally converged*. Maximal convergence means that program instances is *maximally converged*. Maximal convergence means that
if two program instances follow the same control path, they are guaranteed if two program instances follow the same control path, they are guaranteed
to execute each program statement concurrently. [#]_ to execute each program statement concurrently. If two program instances
follow diverging control paths, it is guaranteed that they will reconverge
as soon as possible (if they do later reconverge). [#]_
.. [#] This is another significant difference between the ``ispc`` .. [#] This is another significant difference between the ``ispc``
execution model and the one implemented by OpenCL* and CUDA*. execution model and the one implemented by OpenCL* and CUDA*, which
doesn't provide this guarantee.
Furthermore, maximal convergence means that in the presence of divergent Maximal convergence means that in the presence of divergent control flow
control flow such as the following: such as the following:
:: ::
@@ -751,9 +754,9 @@ It is guaranteed that all program instances that were running before the
for the gang of program instances, rather than the concept of a unique for the gang of program instances, rather than the concept of a unique
program counter for each program instance.) program counter for each program instance.)
Thus, it is illegal to execute a function with an 8-wide gang by running it Another implication of this property is that it is illegal to execute a
two times, with a 4-wide gang representing half of the original 8-wide gang function with an 8-wide gang by running it two times, with a 4-wide gang
each time. representing half of the original 8-wide gang each time.
The way that "varying" function pointers are handled in ``ispc`` is also The way that "varying" function pointers are handled in ``ispc`` is also
affected by this guarantee: if a function pointer is ``varying``, then it affected by this guarantee: if a function pointer is ``varying``, then it
@@ -772,16 +775,16 @@ Uniform Data
A variable that is declared with the ``uniform`` qualifier represents a A variable that is declared with the ``uniform`` qualifier represents a
single value that is shared across the entire gang. (In contrast, the single value that is shared across the entire gang. (In contrast, the
default qualifier for variables in ``ispc``, ``varying``, represents a default variability qualifier for variables in ``ispc``, ``varying``,
variable that has a distinct storage location for each program instance in represents a variable that has a distinct storage location for each program
the gang.) instance in the gang.)
It is an error to try to assign a ``varying`` value to a ``uniform`` It is an error to try to assign a ``varying`` value to a ``uniform``
variable, though ``uniform`` values can be assigned to ``uniform`` variable, though ``uniform`` values can be assigned to ``uniform``
variables. Assignments to ``uniform`` variables are not affected by the variables. Assignments to ``uniform`` variables are not affected by the
execution mask (there's no unambiguous way that they could be); rather, execution mask (there's no unambiguous way that they could be); rather,
they always apply if the block of code that has the the uniform assignment they always apply if the program pointer executes a statement that is a
is executed. uniform assignment.
Uniform Control Flow Uniform Control Flow
@@ -811,14 +814,13 @@ over pixels adjacent to the given (x,y) coordiantes:
return sum / 9.; return sum / 9.;
} }
Under the ``ispc`` SPMD model, we have a gang of program instances this In general each program instance in the gang has different values for ``x``
function in parallel, where in general each program instance has different and ``y`` in this function. For the box filtering algorithm here, all of
values for ``x`` and ``y``.) For the box filtering algorithm here, all of
the program instances will actually want to execute the same number of the program instances will actually want to execute the same number of
iterations of the ``for`` loops, with all of them having the same values iterations of the ``for`` loops, with all of them having the same values
for ``dx`` and ``dy`` each time through. If these loops are instead for ``dx`` and ``dy`` each time through. If these loops are instead
implemented with ``dx`` and ``dy`` declared as ``uniform`` variables, then implemented with ``dx`` and ``dy`` declared as ``uniform`` variables, then
the ``ispc`` compiler can generate more efficient code for the loops. [#]_ the ``ispc`` compiler can generate more efficient code for the loops. [#]_
.. [#] In this case, a sufficiently smart compiler could determine that .. [#] In this case, a sufficiently smart compiler could determine that
``dx`` and ``dy`` have the same value for all program instances and thus ``dx`` and ``dy`` have the same value for all program instances and thus
@@ -833,7 +835,7 @@ the ``ispc`` compiler can generate more efficient code for the loops. [#]_
In particular, ``ispc`` can avoid the overhead of checking to see if any of In particular, ``ispc`` can avoid the overhead of checking to see if any of
the running program instances wants to do another loop iteration. Instead, the running program instances wants to do another loop iteration. Instead,
``ispc`` can generate code where all instances always do the same the compiler can generate code where all instances always do the same
iterations. iterations.
The analogous benefit comes when using ``if`` statements--if the test in an The analogous benefit comes when using ``if`` statements--if the test in an
@@ -854,9 +856,9 @@ instances that are supposed to be executing the corresponding clause.
Under this model, we must define the effect of modifying ``uniform`` Under this model, we must define the effect of modifying ``uniform``
variables in the context of varying control flow. variables in the context of varying control flow.
In general, modifying ``uniform`` variables under varying control flow In most cases, modifying ``uniform`` variables under varying control flow
leads to the ``uniform`` variable having an undefined value, except leads to the ``uniform`` variable having an undefined value, except within
within a block where the ``uniform`` value had a value assigned to it. a block where the ``uniform`` value had a value assigned to it.
Consider the following example, which illustrates three cases. Consider the following example, which illustrates three cases.
@@ -1331,9 +1333,9 @@ instance of that variable shared by all program instances in a gang. (In
other words, it necessarily has the same value across all of the program other words, it necessarily has the same value across all of the program
instances.) In addition to requiring less storage than varying values, instances.) In addition to requiring less storage than varying values,
``uniform`` variables lead to a number of performance advantages when they ``uniform`` variables lead to a number of performance advantages when they
are applicable (see `Uniform Variables and Varying Control Flow`_, for are applicable (see `Uniform Control Flow`_, for example.) Varying
example.) Varying variables may be qualified with ``varying``, though variables may be qualified with ``varying``, though doing so has no effect,
doing so has no effect, as ``varying`` is the default. as ``varying`` is the default.
``uniform`` variables can be modified as the program executes, but only in ``uniform`` variables can be modified as the program executes, but only in
ways that preserve the property that they have a single value for the ways that preserve the property that they have a single value for the
@@ -1938,7 +1940,10 @@ more details on why regular ``if`` statements may sometimes do this.)
Along similar lines, ``cfor``, ``cdo``, and ``cwhile`` check to see if all Along similar lines, ``cfor``, ``cdo``, and ``cwhile`` check to see if all
program instances are running at the start of each loop iteration; if so, program instances are running at the start of each loop iteration; if so,
they can run a specialized code path that has been optimized for the "all they can run a specialized code path that has been optimized for the "all
on" execution mask case. on" execution mask case. It is already the case for the regular looping
constructs in ``ispc`` that a loop will never be executed with an "all off"
execution mask.
Parallel Iteration Statements: "foreach" and "foreach_tiled" Parallel Iteration Statements: "foreach" and "foreach_tiled"
------------------------------------------------------------ ------------------------------------------------------------
@@ -1966,8 +1971,8 @@ As a specific example, consdier the following ``foreach`` statement:
} }
It specifies a loop over a 2D domain, where the ``j`` variable goes from 0 It specifies a loop over a 2D domain, where the ``j`` variable goes from 0
to ``height`` and ``i`` goes from 0 to ``width``. Within the loop, the to ``height-1`` and ``i`` goes from 0 to ``width-1``. Within the loop, the
variables ``i`` and ``j`` are available. variables ``i`` and ``j`` are available and initialized accordingly.
``foreach`` loops actually cause the given iteration domain to be ``foreach`` loops actually cause the given iteration domain to be
automatically mapped to the program instances in the gang, so that all of automatically mapped to the program instances in the gang, so that all of
@@ -1981,27 +1986,28 @@ the gang size is 8:
// perform computation on element i // perform computation on element i
} }
At the CPU hardware level, the body of this loop will only execute The program counter will step through the statements of this loop just
``16/8==2`` times; the first time through, the ``varying int32`` variable ``16/8==2`` times; the first time through, the ``varying int32`` variable
``i`` will have the values (0,1,2,3,4,5,6,7) over the program instances, ``i`` will have the values (0,1,2,3,4,5,6,7) over the program instances,
and the second time through, ``i`` will have the values and the second time through, ``i`` will have the values
(8,9,10,11,12,13,14,15), thus mapping the available program instances to (8,9,10,11,12,13,14,15), thus mapping the available program instances to
all of the data by the end of the loop's execution. all of the data by the end of the loop's execution. The execution mask
starts out "all on" at the start of each ``foreach`` loop iteration, but
may be changed by control flow constructs within the loop.
The ``foreach`` statement subdivides the iteration domain by mapping a The basic ``foreach`` statement subdivides the iteration domain by mapping
gang-size worth of values in the innermost dimension to the gang, only a gang-size worth of values in the innermost dimension to the gang, only
spanning a single value in each of the outer dimensions. This spanning a single value in each of the outer dimensions. This
decomposition generally leads to coherent memory reads and writes, but may decomposition generally leads to coherent memory reads and writes, but may
lead to worse control flow coherence than other decompositions. lead to worse control flow coherence than other decompositions.
``foreach_tiled`` decomposes the iteration domain in a way that tries to
map locations in the domain to program instances in a way that is compact
across all of the dimensions.
For example, on a target with an 8-wide gang size, the following Therefore, ``foreach_tiled`` decomposes the iteration domain in a way that
``foreach_tiled`` statement processes the iteration domain in chunks of 2 tries to map locations in the domain to program instances in a way that is
elements in ``j`` and 4 elements in ``i`` each time. (The trade-offs compact across all of the dimensions. For example, on a target with an
between these two constructs are discussed in more detail in the `ispc 8-wide gang size, the following ``foreach_tiled`` statement processes the
Performance Guide`_.) iteration domain in chunks of 2 elements in ``j`` and 4 elements in ``i``
each time. (The trade-offs between these two constructs are discussed in
more detail in the `ispc Performance Guide`_.)
.. _ispc Performance Guide: perf.html#improving-control-flow-coherence-with-foreach-tiled .. _ispc Performance Guide: perf.html#improving-control-flow-coherence-with-foreach-tiled
@@ -2024,7 +2030,10 @@ each program instance. (In other words, it's a varying integer value that
has value zero for the first program instance, and so forth.) The has value zero for the first program instance, and so forth.) The
``programCount`` builtin gives the total number of instances in the gang. ``programCount`` builtin gives the total number of instances in the gang.
Together, these can be used to uniquely map executing program instances to Together, these can be used to uniquely map executing program instances to
input data. input data. [#]_
.. [#] ``programIndex`` is analogous to ``get_global_id()`` in OpenCL* and
``threadIdx`` in CUDA*.
As a specific example, consider an ``ispc`` function that needs to perform As a specific example, consider an ``ispc`` function that needs to perform
some computation on an array of data. some computation on an array of data.