1.1 Users guide final (for now)
This commit is contained in:
187
docs/ispc.txt
187
docs/ispc.txt
@@ -2,18 +2,19 @@
|
|||||||
Intel® SPMD Program Compiler User's Guide
|
Intel® SPMD Program Compiler User's Guide
|
||||||
=========================================
|
=========================================
|
||||||
|
|
||||||
``ispc`` is a compiler for writing SPMD (single program multiple data)
|
The Intel® SPMD Program Compiler (``ispc``) is a compiler for writing SPMD
|
||||||
programs to run on the CPU. The SPMD programming approach is widely known
|
(single program multiple data) programs to run on the CPU. The SPMD
|
||||||
to graphics and GPGPU programmers; it is used for GPU shaders and CUDA\* and
|
programming approach is widely known to graphics and GPGPU programmers; it
|
||||||
OpenCL\* kernels, for example. The main idea behind SPMD is that one writes
|
is used for GPU shaders and CUDA\* and OpenCL\* kernels, for example. The
|
||||||
programs as if they were operating on a single data element (a pixel for a
|
main idea behind SPMD is that one writes programs as if they were operating
|
||||||
pixel shader, for example), but then the underlying hardware and runtime
|
on a single data element (a pixel for a pixel shader, for example), but
|
||||||
system executes multiple invocations of the program in parallel with
|
then the underlying hardware and runtime system executes multiple
|
||||||
different inputs (the values for different pixels, for example).
|
invocations of the program in parallel with different inputs (the values
|
||||||
|
for different pixels, for example).
|
||||||
|
|
||||||
The main goals behind ``ispc`` are to:
|
The main goals behind ``ispc`` are to:
|
||||||
|
|
||||||
* Build a small variant of the C programming language that delivers good
|
* Build a variant of the C programming language that delivers good
|
||||||
performance to performance-oriented programmers who want to run SPMD
|
performance to performance-oriented programmers who want to run SPMD
|
||||||
programs on CPUs.
|
programs on CPUs.
|
||||||
* Provide a thin abstraction layer between the programmer and the
|
* Provide a thin abstraction layer between the programmer and the
|
||||||
@@ -162,10 +163,11 @@ of recent changes to the compiler.
|
|||||||
Updating ISPC Programs For Changes In ISPC 1.1
|
Updating ISPC Programs For Changes In ISPC 1.1
|
||||||
----------------------------------------------
|
----------------------------------------------
|
||||||
|
|
||||||
The 1.1 release of ``ispc`` features first-class support for pointers in
|
The major changes introduced in the 1.1 release of ``ispc`` are first-class
|
||||||
the language. Adding this functionality led to a number of syntactic
|
support for pointers in the language and new parallel loop constructs.
|
||||||
changes to the language. These should generally require only
|
Adding this functionality required a number of syntactic changes to the
|
||||||
straightforward modification of existing programs.
|
language. These changes should generally lead to straightforward minor
|
||||||
|
modifications of existing ``ispc`` programs.
|
||||||
|
|
||||||
These are the relevant changes to the language:
|
These are the relevant changes to the language:
|
||||||
|
|
||||||
@@ -179,11 +181,6 @@ These are the relevant changes to the language:
|
|||||||
qualifier should just have ``reference`` removed: ``void foo(reference
|
qualifier should just have ``reference`` removed: ``void foo(reference
|
||||||
float bar[])`` can just be ``void foo(float bar[])``.
|
float bar[])`` can just be ``void foo(float bar[])``.
|
||||||
|
|
||||||
* It is no longer legal to pass a varying lvalue to a function that takes a
|
|
||||||
reference parameter; references can only be to uniform lvalue types. In
|
|
||||||
this case, the function should be rewritten to take a varying pointer
|
|
||||||
parameter.
|
|
||||||
|
|
||||||
* It is now a compile-time error to assign an entire array to another
|
* It is now a compile-time error to assign an entire array to another
|
||||||
array.
|
array.
|
||||||
|
|
||||||
@@ -196,6 +193,11 @@ These are the relevant changes to the language:
|
|||||||
as its first parameter rather than taking a ``uniform unsigned int[]`` as
|
as its first parameter rather than taking a ``uniform unsigned int[]`` as
|
||||||
its first parameter and a ``uniform int`` offset as its second parameter.
|
its first parameter and a ``uniform int`` offset as its second parameter.
|
||||||
|
|
||||||
|
* It is no longer legal to pass a varying lvalue to a function that takes a
|
||||||
|
reference parameter; references can only be to uniform lvalue types. In
|
||||||
|
this case, the function should be rewritten to take a varying pointer
|
||||||
|
parameter.
|
||||||
|
|
||||||
* There are new iteration constructs for looping over computation domains,
|
* There are new iteration constructs for looping over computation domains,
|
||||||
``foreach`` and ``foreach_tiled``. In addition to being syntactically
|
``foreach`` and ``foreach_tiled``. In addition to being syntactically
|
||||||
cleaner than regular ``for`` loops, these can provide performance
|
cleaner than regular ``for`` loops, these can provide performance
|
||||||
@@ -505,14 +507,14 @@ parallel computation. Understanding the details of ``ispc``'s parallel
|
|||||||
execution model that are introduced in this section is critical for writing
|
execution model that are introduced in this section is critical for writing
|
||||||
efficient and correct programs in ``ispc``.
|
efficient and correct programs in ``ispc``.
|
||||||
|
|
||||||
``ispc`` supports two types of parallelism: both task parallelism to
|
``ispc`` supports two types of parallelism: task parallelism to parallelize
|
||||||
parallelize across multiple processor cores and SPMD parallelism to
|
across multiple processor cores and SPMD parallelism to parallelize across
|
||||||
parallelize across the SIMD vector lanes on a single core. Most of this
|
the SIMD vector lanes on a single core. Most of this section focuses on
|
||||||
section focuses on SPMD parallelism, but see `Tasking Model`_ at the end of
|
SPMD parallelism, but see `Tasking Model`_ at the end of this section for
|
||||||
this section for discussion of task parallelism in ``ispc``.
|
discussion of task parallelism in ``ispc``.
|
||||||
|
|
||||||
This section will use some snippets of ``ispc`` code to illustrate various
|
This section will use some snippets of ``ispc`` code to illustrate various
|
||||||
concepts. Given ``ispc``'s relationship to C, these should generally be
|
concepts. Given ``ispc``'s relationship to C, these should be
|
||||||
understandable on their own, but you may want to refer to the `The ISPC
|
understandable on their own, but you may want to refer to the `The ISPC
|
||||||
Language`_ section for details on language syntax.
|
Language`_ section for details on language syntax.
|
||||||
|
|
||||||
@@ -523,15 +525,15 @@ Basic Concepts: Program Instances and Gangs of Program Instances
|
|||||||
Upon entry to a ``ispc`` function called from C/C++ code, the execution
|
Upon entry to a ``ispc`` function called from C/C++ code, the execution
|
||||||
model switches from the application's serial model to ``ispc``'s execution
|
model switches from the application's serial model to ``ispc``'s execution
|
||||||
model. Conceptually, a number of ``ispc`` *program instances* start
|
model. Conceptually, a number of ``ispc`` *program instances* start
|
||||||
running in parallel. The group of running program instances is a called
|
running in concurrently. The group of running program instances is a
|
||||||
*gang* (harkening to "gang scheduling", since ``ispc`` provides certain
|
called *gang* (harkening to "gang scheduling", since ``ispc`` provides
|
||||||
guarantees about the control flow coherence of program instances running
|
certain guarantees about the control flow coherence of program instances
|
||||||
in a gang.) An ``ispc`` program instance is thus roughly similar to a
|
running in a gang, detailed in `Gang Convergence Guarantees`_.) An
|
||||||
CUDA* "thread" or an OpenCL* "work-item", and an ``ispc`` gang is roughly
|
``ispc`` program instance is thus similar to a CUDA* "thread" or an OpenCL*
|
||||||
similar to a CUDA* "warp".
|
"work-item", and an ``ispc`` gang is similar to a CUDA* "warp".
|
||||||
|
|
||||||
An ``ispc`` program, then, expresses the computation performed by a gang of
|
An ``ispc`` program expresses the computation performed by a gang of
|
||||||
program instances, using an "implicitly parallel" model, where the ``ispc``
|
program instances, using an "implicit parallel" model, where the ``ispc``
|
||||||
program generally describes the behavior of a single program instance, even
|
program generally describes the behavior of a single program instance, even
|
||||||
though a gang of them is actually executing together. This implicit model
|
though a gang of them is actually executing together. This implicit model
|
||||||
is the same that is used for shaders in programmable graphics pipelines,
|
is the same that is used for shaders in programmable graphics pipelines,
|
||||||
@@ -592,7 +594,7 @@ the same results for each program instance in a gang as would have been
|
|||||||
computed if the equivalent code ran serially in C to compute each program
|
computed if the equivalent code ran serially in C to compute each program
|
||||||
instance's result individually. However, here we will more precisely
|
instance's result individually. However, here we will more precisely
|
||||||
define the execution model for control flow in order to be able to
|
define the execution model for control flow in order to be able to
|
||||||
precisely define the language's behavior.
|
precisely define the language's behavior in specific situations.
|
||||||
|
|
||||||
We will specify the notion of a *program counter* and how it is updated to
|
We will specify the notion of a *program counter* and how it is updated to
|
||||||
step through the program, and an *execution mask* that indicates which
|
step through the program, and an *execution mask* that indicates which
|
||||||
@@ -615,13 +617,14 @@ of an ``ispc`` function.
|
|||||||
conservative execution path through the function, wherein if *any*
|
conservative execution path through the function, wherein if *any*
|
||||||
program instance wants to execute a statement, the program counter will
|
program instance wants to execute a statement, the program counter will
|
||||||
pass through that statement.
|
pass through that statement.
|
||||||
|
|
||||||
2. At each statement the program counter passes through, the execution
|
2. At each statement the program counter passes through, the execution
|
||||||
mask will be set such that its value for a particular program instance is
|
mask will be set such that its value for a particular program instance is
|
||||||
"on" if the program instance wants to execute that statements.
|
"on" if and only if the program instance wants to execute that statement.
|
||||||
|
|
||||||
Note that these definition provide the compiler some latitude; for example,
|
Note that these definition provide the compiler some latitude; for example,
|
||||||
the program counter is allowed pass through a series of statements with the
|
the program counter is allowed pass through a series of statements with the
|
||||||
execution mask "all off" as long as doing so has no observable side-effects.
|
execution mask "all off" because doing so has no observable side-effects.
|
||||||
|
|
||||||
Elsewhere, we will speak informally of the *control flow coherence* of a
|
Elsewhere, we will speak informally of the *control flow coherence* of a
|
||||||
program; this notion describes the degree to which the program instances in
|
program; this notion describes the degree to which the program instances in
|
||||||
@@ -638,7 +641,7 @@ Control Flow Example: If Statements
|
|||||||
As a concrete example of the interplay between program counter and
|
As a concrete example of the interplay between program counter and
|
||||||
execution mask, one way that an ``if`` statement like the one in the
|
execution mask, one way that an ``if`` statement like the one in the
|
||||||
previous section can be represented is shown by the following pseudo-code
|
previous section can be represented is shown by the following pseudo-code
|
||||||
``ispc`` compiler output:
|
compiler output:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@@ -654,7 +657,8 @@ previous section can be represented is shown by the following pseudo-code
|
|||||||
In other words, the program counter steps through the statements for both
|
In other words, the program counter steps through the statements for both
|
||||||
the "true" case and the "false" case, with the execution mask set so that
|
the "true" case and the "false" case, with the execution mask set so that
|
||||||
no side-effects from the true statements affect the program instances that
|
no side-effects from the true statements affect the program instances that
|
||||||
want to run the false statements, and vice versa.
|
want to run the false statements, and vice versa. the execution mask is
|
||||||
|
then restored to the value it had before the ``if`` statement.
|
||||||
|
|
||||||
However, the compiler is free to generate different code for an ``if``
|
However, the compiler is free to generate different code for an ``if``
|
||||||
test, such as:
|
test, such as:
|
||||||
@@ -681,8 +685,8 @@ code for the "true" and "false" statements is undefined.
|
|||||||
|
|
||||||
In most cases, there is no programmer-visible difference between these two
|
In most cases, there is no programmer-visible difference between these two
|
||||||
ways of compiling ``if``, though see the `Uniform Variables and Varying
|
ways of compiling ``if``, though see the `Uniform Variables and Varying
|
||||||
Control Flow`_ section for a case where it causes undefined behavior in a
|
Control Flow`_ section for a case where it causes undefined behavior in one
|
||||||
specific situation.
|
particular situation.
|
||||||
|
|
||||||
|
|
||||||
Control Flow Example: Loops
|
Control Flow Example: Loops
|
||||||
@@ -701,12 +705,13 @@ Therefore, if we have a loop like the following:
|
|||||||
...
|
...
|
||||||
}
|
}
|
||||||
|
|
||||||
where ``limit`` has the value 1 for all of the program instances but
|
where ``limit`` has the value 1 for all of the program instances but one,
|
||||||
one, and has value 1000 for the other one, the program counter will step
|
and has value 1000 for the other one, the program counter will step through
|
||||||
through the loop body 1000 times. The first time, the execution mask will be all on
|
the loop body 1000 times. The first time, the execution mask will be all
|
||||||
(assuming it is all on going into the ``for`` loop), and the remaining 999
|
on (assuming it is all on going into the ``for`` loop), and the remaining
|
||||||
times, the mask will be off except for the program instance with 1000 in
|
999 times, the mask will be off except for the program instance with a
|
||||||
``limit``. (This would be a loop with poor control flow coherence!)
|
``limit`` value of 1000. (This would be a loop with poor control flow
|
||||||
|
coherence!)
|
||||||
|
|
||||||
A ``continue`` statement in a loop may be handled either by disabling the
|
A ``continue`` statement in a loop may be handled either by disabling the
|
||||||
execution mask for the program instances that execute the ``continue`` and
|
execution mask for the program instances that execute the ``continue`` and
|
||||||
@@ -716,11 +721,6 @@ disabled after the ``continue`` has executed. ``break`` statements are
|
|||||||
handled in a similar fashion.
|
handled in a similar fashion.
|
||||||
|
|
||||||
|
|
||||||
Control Flow Example: Function Pointers
|
|
||||||
---------------------------------------
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Gang Convergence Guarantees
|
Gang Convergence Guarantees
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
@@ -728,13 +728,16 @@ The ``ispc`` execution model provides an important guarantee about the
|
|||||||
behavior of the program counter and execution mask: the execution of
|
behavior of the program counter and execution mask: the execution of
|
||||||
program instances is *maximally converged*. Maximal convergence means that
|
program instances is *maximally converged*. Maximal convergence means that
|
||||||
if two program instances follow the same control path, they are guaranteed
|
if two program instances follow the same control path, they are guaranteed
|
||||||
to execute each program statement concurrently. [#]_
|
to execute each program statement concurrently. If two program instances
|
||||||
|
follow diverging control paths, it is guaranteed that they will reconverge
|
||||||
|
as soon as possible (if they do later reconverge). [#]_
|
||||||
|
|
||||||
.. [#] This is another significant difference between the ``ispc``
|
.. [#] This is another significant difference between the ``ispc``
|
||||||
execution model and the one implemented by OpenCL* and CUDA*.
|
execution model and the one implemented by OpenCL* and CUDA*, which
|
||||||
|
doesn't provide this guarantee.
|
||||||
|
|
||||||
Furthermore, maximal convergence means that in the presence of divergent
|
Maximal convergence means that in the presence of divergent control flow
|
||||||
control flow such as the following:
|
such as the following:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@@ -751,9 +754,9 @@ It is guaranteed that all program instances that were running before the
|
|||||||
for the gang of program instances, rather than the concept of a unique
|
for the gang of program instances, rather than the concept of a unique
|
||||||
program counter for each program instance.)
|
program counter for each program instance.)
|
||||||
|
|
||||||
Thus, it is illegal to execute a function with an 8-wide gang by running it
|
Another implication of this property is that it is illegal to execute a
|
||||||
two times, with a 4-wide gang representing half of the original 8-wide gang
|
function with an 8-wide gang by running it two times, with a 4-wide gang
|
||||||
each time.
|
representing half of the original 8-wide gang each time.
|
||||||
|
|
||||||
The way that "varying" function pointers are handled in ``ispc`` is also
|
The way that "varying" function pointers are handled in ``ispc`` is also
|
||||||
affected by this guarantee: if a function pointer is ``varying``, then it
|
affected by this guarantee: if a function pointer is ``varying``, then it
|
||||||
@@ -772,16 +775,16 @@ Uniform Data
|
|||||||
|
|
||||||
A variable that is declared with the ``uniform`` qualifier represents a
|
A variable that is declared with the ``uniform`` qualifier represents a
|
||||||
single value that is shared across the entire gang. (In contrast, the
|
single value that is shared across the entire gang. (In contrast, the
|
||||||
default qualifier for variables in ``ispc``, ``varying``, represents a
|
default variability qualifier for variables in ``ispc``, ``varying``,
|
||||||
variable that has a distinct storage location for each program instance in
|
represents a variable that has a distinct storage location for each program
|
||||||
the gang.)
|
instance in the gang.)
|
||||||
|
|
||||||
It is an error to try to assign a ``varying`` value to a ``uniform``
|
It is an error to try to assign a ``varying`` value to a ``uniform``
|
||||||
variable, though ``uniform`` values can be assigned to ``uniform``
|
variable, though ``uniform`` values can be assigned to ``uniform``
|
||||||
variables. Assignments to ``uniform`` variables are not affected by the
|
variables. Assignments to ``uniform`` variables are not affected by the
|
||||||
execution mask (there's no unambiguous way that they could be); rather,
|
execution mask (there's no unambiguous way that they could be); rather,
|
||||||
they always apply if the block of code that has the the uniform assignment
|
they always apply if the program pointer executes a statement that is a
|
||||||
is executed.
|
uniform assignment.
|
||||||
|
|
||||||
|
|
||||||
Uniform Control Flow
|
Uniform Control Flow
|
||||||
@@ -811,14 +814,13 @@ over pixels adjacent to the given (x,y) coordiantes:
|
|||||||
return sum / 9.;
|
return sum / 9.;
|
||||||
}
|
}
|
||||||
|
|
||||||
Under the ``ispc`` SPMD model, we have a gang of program instances this
|
In general each program instance in the gang has different values for ``x``
|
||||||
function in parallel, where in general each program instance has different
|
and ``y`` in this function. For the box filtering algorithm here, all of
|
||||||
values for ``x`` and ``y``.) For the box filtering algorithm here, all of
|
|
||||||
the program instances will actually want to execute the same number of
|
the program instances will actually want to execute the same number of
|
||||||
iterations of the ``for`` loops, with all of them having the same values
|
iterations of the ``for`` loops, with all of them having the same values
|
||||||
for ``dx`` and ``dy`` each time through. If these loops are instead
|
for ``dx`` and ``dy`` each time through. If these loops are instead
|
||||||
implemented with ``dx`` and ``dy`` declared as ``uniform`` variables, then
|
implemented with ``dx`` and ``dy`` declared as ``uniform`` variables, then
|
||||||
the ``ispc`` compiler can generate more efficient code for the loops. [#]_
|
the ``ispc`` compiler can generate more efficient code for the loops. [#]_
|
||||||
|
|
||||||
.. [#] In this case, a sufficiently smart compiler could determine that
|
.. [#] In this case, a sufficiently smart compiler could determine that
|
||||||
``dx`` and ``dy`` have the same value for all program instances and thus
|
``dx`` and ``dy`` have the same value for all program instances and thus
|
||||||
@@ -833,7 +835,7 @@ the ``ispc`` compiler can generate more efficient code for the loops. [#]_
|
|||||||
|
|
||||||
In particular, ``ispc`` can avoid the overhead of checking to see if any of
|
In particular, ``ispc`` can avoid the overhead of checking to see if any of
|
||||||
the running program instances wants to do another loop iteration. Instead,
|
the running program instances wants to do another loop iteration. Instead,
|
||||||
``ispc`` can generate code where all instances always do the same
|
the compiler can generate code where all instances always do the same
|
||||||
iterations.
|
iterations.
|
||||||
|
|
||||||
The analogous benefit comes when using ``if`` statements--if the test in an
|
The analogous benefit comes when using ``if`` statements--if the test in an
|
||||||
@@ -854,9 +856,9 @@ instances that are supposed to be executing the corresponding clause.
|
|||||||
Under this model, we must define the effect of modifying ``uniform``
|
Under this model, we must define the effect of modifying ``uniform``
|
||||||
variables in the context of varying control flow.
|
variables in the context of varying control flow.
|
||||||
|
|
||||||
In general, modifying ``uniform`` variables under varying control flow
|
In most cases, modifying ``uniform`` variables under varying control flow
|
||||||
leads to the ``uniform`` variable having an undefined value, except
|
leads to the ``uniform`` variable having an undefined value, except within
|
||||||
within a block where the ``uniform`` value had a value assigned to it.
|
a block where the ``uniform`` value had a value assigned to it.
|
||||||
|
|
||||||
Consider the following example, which illustrates three cases.
|
Consider the following example, which illustrates three cases.
|
||||||
|
|
||||||
@@ -1331,9 +1333,9 @@ instance of that variable shared by all program instances in a gang. (In
|
|||||||
other words, it necessarily has the same value across all of the program
|
other words, it necessarily has the same value across all of the program
|
||||||
instances.) In addition to requiring less storage than varying values,
|
instances.) In addition to requiring less storage than varying values,
|
||||||
``uniform`` variables lead to a number of performance advantages when they
|
``uniform`` variables lead to a number of performance advantages when they
|
||||||
are applicable (see `Uniform Variables and Varying Control Flow`_, for
|
are applicable (see `Uniform Control Flow`_, for example.) Varying
|
||||||
example.) Varying variables may be qualified with ``varying``, though
|
variables may be qualified with ``varying``, though doing so has no effect,
|
||||||
doing so has no effect, as ``varying`` is the default.
|
as ``varying`` is the default.
|
||||||
|
|
||||||
``uniform`` variables can be modified as the program executes, but only in
|
``uniform`` variables can be modified as the program executes, but only in
|
||||||
ways that preserve the property that they have a single value for the
|
ways that preserve the property that they have a single value for the
|
||||||
@@ -1938,7 +1940,10 @@ more details on why regular ``if`` statements may sometimes do this.)
|
|||||||
Along similar lines, ``cfor``, ``cdo``, and ``cwhile`` check to see if all
|
Along similar lines, ``cfor``, ``cdo``, and ``cwhile`` check to see if all
|
||||||
program instances are running at the start of each loop iteration; if so,
|
program instances are running at the start of each loop iteration; if so,
|
||||||
they can run a specialized code path that has been optimized for the "all
|
they can run a specialized code path that has been optimized for the "all
|
||||||
on" execution mask case.
|
on" execution mask case. It is already the case for the regular looping
|
||||||
|
constructs in ``ispc`` that a loop will never be executed with an "all off"
|
||||||
|
execution mask.
|
||||||
|
|
||||||
|
|
||||||
Parallel Iteration Statements: "foreach" and "foreach_tiled"
|
Parallel Iteration Statements: "foreach" and "foreach_tiled"
|
||||||
------------------------------------------------------------
|
------------------------------------------------------------
|
||||||
@@ -1966,8 +1971,8 @@ As a specific example, consdier the following ``foreach`` statement:
|
|||||||
}
|
}
|
||||||
|
|
||||||
It specifies a loop over a 2D domain, where the ``j`` variable goes from 0
|
It specifies a loop over a 2D domain, where the ``j`` variable goes from 0
|
||||||
to ``height`` and ``i`` goes from 0 to ``width``. Within the loop, the
|
to ``height-1`` and ``i`` goes from 0 to ``width-1``. Within the loop, the
|
||||||
variables ``i`` and ``j`` are available.
|
variables ``i`` and ``j`` are available and initialized accordingly.
|
||||||
|
|
||||||
``foreach`` loops actually cause the given iteration domain to be
|
``foreach`` loops actually cause the given iteration domain to be
|
||||||
automatically mapped to the program instances in the gang, so that all of
|
automatically mapped to the program instances in the gang, so that all of
|
||||||
@@ -1981,27 +1986,28 @@ the gang size is 8:
|
|||||||
// perform computation on element i
|
// perform computation on element i
|
||||||
}
|
}
|
||||||
|
|
||||||
At the CPU hardware level, the body of this loop will only execute
|
The program counter will step through the statements of this loop just
|
||||||
``16/8==2`` times; the first time through, the ``varying int32`` variable
|
``16/8==2`` times; the first time through, the ``varying int32`` variable
|
||||||
``i`` will have the values (0,1,2,3,4,5,6,7) over the program instances,
|
``i`` will have the values (0,1,2,3,4,5,6,7) over the program instances,
|
||||||
and the second time through, ``i`` will have the values
|
and the second time through, ``i`` will have the values
|
||||||
(8,9,10,11,12,13,14,15), thus mapping the available program instances to
|
(8,9,10,11,12,13,14,15), thus mapping the available program instances to
|
||||||
all of the data by the end of the loop's execution.
|
all of the data by the end of the loop's execution. The execution mask
|
||||||
|
starts out "all on" at the start of each ``foreach`` loop iteration, but
|
||||||
|
may be changed by control flow constructs within the loop.
|
||||||
|
|
||||||
The ``foreach`` statement subdivides the iteration domain by mapping a
|
The basic ``foreach`` statement subdivides the iteration domain by mapping
|
||||||
gang-size worth of values in the innermost dimension to the gang, only
|
a gang-size worth of values in the innermost dimension to the gang, only
|
||||||
spanning a single value in each of the outer dimensions. This
|
spanning a single value in each of the outer dimensions. This
|
||||||
decomposition generally leads to coherent memory reads and writes, but may
|
decomposition generally leads to coherent memory reads and writes, but may
|
||||||
lead to worse control flow coherence than other decompositions.
|
lead to worse control flow coherence than other decompositions.
|
||||||
``foreach_tiled`` decomposes the iteration domain in a way that tries to
|
|
||||||
map locations in the domain to program instances in a way that is compact
|
|
||||||
across all of the dimensions.
|
|
||||||
|
|
||||||
For example, on a target with an 8-wide gang size, the following
|
Therefore, ``foreach_tiled`` decomposes the iteration domain in a way that
|
||||||
``foreach_tiled`` statement processes the iteration domain in chunks of 2
|
tries to map locations in the domain to program instances in a way that is
|
||||||
elements in ``j`` and 4 elements in ``i`` each time. (The trade-offs
|
compact across all of the dimensions. For example, on a target with an
|
||||||
between these two constructs are discussed in more detail in the `ispc
|
8-wide gang size, the following ``foreach_tiled`` statement processes the
|
||||||
Performance Guide`_.)
|
iteration domain in chunks of 2 elements in ``j`` and 4 elements in ``i``
|
||||||
|
each time. (The trade-offs between these two constructs are discussed in
|
||||||
|
more detail in the `ispc Performance Guide`_.)
|
||||||
|
|
||||||
.. _ispc Performance Guide: perf.html#improving-control-flow-coherence-with-foreach-tiled
|
.. _ispc Performance Guide: perf.html#improving-control-flow-coherence-with-foreach-tiled
|
||||||
|
|
||||||
@@ -2024,7 +2030,10 @@ each program instance. (In other words, it's a varying integer value that
|
|||||||
has value zero for the first program instance, and so forth.) The
|
has value zero for the first program instance, and so forth.) The
|
||||||
``programCount`` builtin gives the total number of instances in the gang.
|
``programCount`` builtin gives the total number of instances in the gang.
|
||||||
Together, these can be used to uniquely map executing program instances to
|
Together, these can be used to uniquely map executing program instances to
|
||||||
input data.
|
input data. [#]_
|
||||||
|
|
||||||
|
.. [#] ``programIndex`` is analogous to ``get_global_id()`` in OpenCL* and
|
||||||
|
``threadIdx`` in CUDA*.
|
||||||
|
|
||||||
As a specific example, consider an ``ispc`` function that needs to perform
|
As a specific example, consider an ``ispc`` function that needs to perform
|
||||||
some computation on an array of data.
|
some computation on an array of data.
|
||||||
|
|||||||
Reference in New Issue
Block a user