Add foreach_active iteration statement.

Issue #298.
This commit is contained in:
Matt Pharr
2012-06-22 10:35:43 -07:00
parent ed13dd066b
commit b4a078e2f6
15 changed files with 644 additions and 279 deletions

View File

@@ -105,12 +105,16 @@ Contents:
* `Conditional Statements: "if"`_
* `Conditional Statements: "switch"`_
* `Basic Iteration Statements: "for", "while", and "do"`_
* `Iteration over unique elements: "foreach_unique"`_
* `Iteration Statements`_
+ `Basic Iteration Statements: "for", "while", and "do"`_
+ `Iteration over active program instances: "foreach_active"`_
+ `Iteration over unique elements: "foreach_unique"`_
+ `Parallel Iteration Statements: "foreach" and "foreach_tiled"`_
+ `Parallel Iteration with "programIndex" and "programCount"`_
* `Unstructured Control Flow: "goto"`_
* `"Coherent" Control Flow Statements: "cif" and Friends`_
* `Parallel Iteration Statements: "foreach" and "foreach_tiled"`_
* `Parallel Iteration with "programIndex" and "programCount"`_
* `Functions and Function Calls`_
+ `Function Overloading`_
@@ -1984,7 +1988,7 @@ format in memory; the benefits from SOA layout are discussed in more detail
in the `Use "Structure of Arrays" Layout When Possible`_ section in the
ispc Performance Guide.
.. _Use "Structure of Arrays" Layout When Possible: perf.html#use-structure-of-arrays-layout-when-possible
.. _Use "Structure of Arrays" Layout When Possible: perfguide.html#use-structure-of-arrays-layout-when-possible
``ispc`` provides two key language-level capabilities for laying out and
accessing data in SOA format:
@@ -2348,11 +2352,19 @@ code below.
x *= x;
}
Iteration Statements
--------------------
In addition to the standard iteration statements ``for``, ``while``, and
``do``, inherited from C/C++, ``ispc`` provides a number of additional
specialized ways to iterate over data.
Basic Iteration Statements: "for", "while", and "do"
----------------------------------------------------
``ispc`` supports ``for``, ``while``, and ``do`` loops, with the same
specification as in C. Like C++, variables can be declared in the ``for``
specification as in C. As in C++, variables can be declared in the ``for``
statement itself:
::
@@ -2374,6 +2386,58 @@ executing code in the loop body that didn't execute the ``continue`` will
be unaffected by it.
Iteration over active program instances: "foreach_active"
---------------------------------------------------------
The ``foreach_active`` construct specifies a loop that serializes over the
active program instances: the loop body executes once for each active
program instance, and with only that program instance executing.
As an example of the use of this construct, consider an application where
each program instance independently computes an offset into a shared array
that is being updated:
::
uniform float array[...] = { ... };
int index = ...;
++array[index];
If more than one active program instance computes the same value for
``index``, the above code has undefined behavior (see the section `Data
Races Within a Gang`_ for details.) The increment of ``array[index]``
could instead be written inside a ``foreach_active`` statement:
::
foreach_active (i) {
++array[index];
}
The variable name provided in parenthesis after the ``foreach_active``
keyword (here, ``index``), causes a ``const uniform int64`` local variable
of that name to be declared, where the variable takes the ``programIndex``
value of the program instance executing at each loop iteraton.
In the code above, because only one program instance is executing at a time
when the loop body executes, the update to ``array`` is well-defined.
Note that for this particular example, the "local atomic" operations in
the standard library could be used instead to safely update ``array``.
However, local atomics functions aren't always available or appropriate for
more complex cases.)
``continue`` statements may be used inside ``foreach_active`` loops, though
``break`` and ``return`` are prohibited. The order in which the active
program instances are processed in the loop is not defined.
See the `Using "foreach_active" Effectively`_ Section in the ispc
Performance Guide for more details about ``foreach_active``.
.. _Using "foreach_active" Effectively: perfguide.html#using-foreach-active-effectively
Iteration over unique elements: "foreach_unique"
------------------------------------------------
@@ -2408,7 +2472,144 @@ evaluated once, and it must be of an atomic type (``float``, ``int``,
etc.), an ``enum`` type, or a pointer type. The iteration variable ``val``
is a variable of ``const uniform`` type of the iteration type; it can't be
modified within the loop. Finally, ``break`` and ``return`` statements are
illegal within the loop body, but ``continue`` statemetns are allowed.
illegal within the loop body, but ``continue`` statements are allowed.
Parallel Iteration Statements: "foreach" and "foreach_tiled"
------------------------------------------------------------
The ``foreach`` and ``foreach_tiled`` constructs specify loops over a
possibly multi-dimensional domain of integer ranges. Their role goes
beyond "syntactic sugar"; they provides one of the two key ways of
expressing parallel computation in ``ispc``.
In general, a ``foreach`` or ``foreach_tiled`` statement takes one or more
dimension specifiers separated by commas, where each dimension is specified
by ``identifier = start ... end``, where ``start`` is a signed integer
value less than or equal to ``end``, specifying iteration over all integer
values from ``start`` up to and including ``end-1``. An arbitrary number
of iteration dimensions may be specified, with each one spanning a
different range of values. Within the ``foreach`` loop, the given
identifiers are available as ``const varying int32`` variables. The
execution mask starts out "all on" at the start of each ``foreach`` loop
iteration, but may be changed by control flow constructs within the loop.
It is illegal to have a ``break`` statement or a ``return`` statement
within a ``foreach`` loop; a compile-time error will be issued in this
case. (It is legal to have a ``break`` in a regular ``for`` loop that's
nested inside a ``foreach`` loop.) ``continue`` statements are legal in
``foreach`` loops; they have the same effect as in regular ``for`` loops:
a program instances that executes a ``continue`` statement effectively
skips over the rest of the loop body for the current iteration.
It is also currently illegal to have nested ``foreach`` statements; this
limitation will be removed in a future release of ``ispc``.
As a specific example, consider the following ``foreach`` statement:
::
foreach (j = 0 ... height, i = 0 ... width) {
// loop body--process data element (i,j)
}
It specifies a loop over a 2D domain, where the ``j`` variable goes from 0
to ``height-1`` and ``i`` goes from 0 to ``width-1``. Within the loop, the
variables ``i`` and ``j`` are available and initialized accordingly.
``foreach`` loops actually cause the given iteration domain to be
automatically mapped to the program instances in the gang, so that all of
the data can be processed, in gang-sized chunks. As a specific example,
consider a simple ``foreach`` loop like the following, on a target where
the gang size is 8:
::
foreach (i = 0 ... 16) {
// perform computation on element i
}
One possible valid execution path of this loop would be for the program
counter the step through the statements of this loop just ``16/8==2``
times; the first time through, with the ``varying int32`` variable ``i``
having the values (0,1,2,3,4,5,6,7) over the program instances, and the
second time through, having the values (8,9,10,11,12,13,14,15), thus
mapping the available program instances to all of the data by the end of
the loop's execution.
In general, however, you shouldn't make any assumptions about the order in
which elements of the iteration domain will be processed by a ``foreach``
loop. For example, the following code exhibits undefined behavior:
::
uniform float a[10][100];
foreach (i = 0 ... 10, j = 0 ... 100) {
if (i == 0)
a[i][j] = j;
else
// Error: can't assume that a[i-1][j] has been set yet
a[i][j] = a[i-1][j];
The ``foreach`` statement generally subdivides the iteration domain by
selecting sets of contiguous elements in the inner-most dimension of the
iteration domain. This decomposition approach generally leads to coherent
memory reads and writes, but may lead to worse control flow coherence than
other decompositions.
Therefore, ``foreach_tiled`` decomposes the iteration domain in a way that
tries to map locations in the domain to program instances in a way that is
compact across all of the dimensions. For example, on a target with an
8-wide gang size, the following ``foreach_tiled`` statement might process
the iteration domain in chunks of 2 elements in ``j`` and 4 elements in
``i`` each time. (The trade-offs between these two constructs are
discussed in more detail in the `ispc Performance Guide`_.)
.. _ispc Performance Guide: perfguide.html#improving-control-flow-coherence-with-foreach-tiled
::
foreach_tiled (j = 0 ... height, i = 0 ... width) {
// loop body--process data element (i,j)
}
Parallel Iteration with "programIndex" and "programCount"
---------------------------------------------------------
In addition to ``foreach`` and ``foreach_tiled``, ``ispc`` provides a
lower-level mechanism for mapping SPMD program instances to data to operate
on via the built-in ``programIndex`` and ``programCount`` variables.
``programIndex`` gives the index of the SIMD-lane being used for running
each program instance. (In other words, it's a varying integer value that
has value zero for the first program instance, and so forth.) The
``programCount`` builtin gives the total number of instances in the gang.
Together, these can be used to uniquely map executing program instances to
input data. [#]_
.. [#] ``programIndex`` is analogous to ``get_global_id()`` in OpenCL* and
``threadIdx`` in CUDA*.
As a specific example, consider an ``ispc`` function that needs to perform
some computation on an array of data.
::
for (uniform int i = 0; i < count; i += programCount) {
float d = data[i + programIndex];
float r = ....
result[i + programIndex] = r;
}
Here, we've written a loop that explicitly loops over the data in chunks of
``programCount`` elements. In each loop iteration, the running program
instances effectively collude amongst themselves using ``programIndex`` to
determine which elements to work on in a way that ensures that all of the
data elements will be processed. In this particular case, a ``foreach``
loop would be preferable, as ``foreach`` naturally handles the case where
``programCount`` doesn't evenly divide the number of elements to be
processed, while the loop above assumes that case implicitly.
Unstructured Control Flow: "goto"
@@ -2479,139 +2680,6 @@ constructs in ``ispc`` that a loop will never be executed with an "all off"
execution mask.
Parallel Iteration Statements: "foreach" and "foreach_tiled"
------------------------------------------------------------
The ``foreach`` and ``foreach_tiled`` constructs specify loops over a
possibly multi-dimensional domain of integer ranges. Their role goes
beyond "syntactic sugar"; they provides one of the two key ways of
expressing parallel computation in ``ispc``.
In general, a ``foreach`` or ``foreach_tiled`` statement takes one or more
dimension specifiers separated by commas, where each dimension is specified
by ``identifier = start ... end``, where ``start`` is a signed integer
value less than or equal to ``end``, specifying iteration over all integer
values from ``start`` up to and including ``end-1``. An arbitrary number
of iteration dimensions may be specified, with each one spanning a
different range of values. Within the ``foreach`` loop, the given
identifiers are available as ``const varying int32`` variables. The
execution mask starts out "all on" at the start of each ``foreach`` loop
iteration, but may be changed by control flow constructs within the loop.
It is illegal to have a ``break`` statement or a ``return`` statement
within a ``foreach`` loop; a compile-time error will be issued in this
case. (It is legal to have a ``break`` in a regular ``for`` loop that's
nested inside a ``foreach`` loop.) ``continue`` statements are legal in
``foreach`` loops; they have the same effect as in regular ``for`` loops:
a program instances that executes a ``continue`` statement effectively
skips over the rest of the loop body for the current iteration.
As a specific example, consider the following ``foreach`` statement:
::
foreach (j = 0 ... height, i = 0 ... width) {
// loop body--process data element (i,j)
}
It specifies a loop over a 2D domain, where the ``j`` variable goes from 0
to ``height-1`` and ``i`` goes from 0 to ``width-1``. Within the loop, the
variables ``i`` and ``j`` are available and initialized accordingly.
``foreach`` loops actually cause the given iteration domain to be
automatically mapped to the program instances in the gang, so that all of
the data can be processed, in gang-sized chunks. As a specific example,
consider a simple ``foreach`` loop like the following, on a target where
the gang size is 8:
::
foreach (i = 0 ... 16) {
// perform computation on element i
}
One possible valid execution path of this loop would be for the program
counter the step through the statements of this loop just ``16/8==2``
times; the first time through, with the ``varying int32`` variable ``i``
having the values (0,1,2,3,4,5,6,7) over the program instances, and the
second time through, having the values (8,9,10,11,12,13,14,15), thus
mapping the available program instances to all of the data by the end of
the loop's execution.
In general, however, you shouldn't make any assumptions about the order in
which elements of the iteration domain will be processed by a ``foreach``
loop. For example, the following code exhibits undefined behavior:
::
uniform float a[10][100];
foreach (i = 0 ... 10, j = 0 ... 100) {
if (i == 0)
a[i][j] = j;
else
// Error: can't assume that a[i-1][j] has been set yet
a[i][j] = a[i-1][j];
The ``foreach`` statement generally subdivides the iteration domain by
selecting sets of contiguous elements in the inner-most dimension of the
iteration domain. This decomposition approach generally leads to coherent
memory reads and writes, but may lead to worse control flow coherence than
other decompositions.
Therefore, ``foreach_tiled`` decomposes the iteration domain in a way that
tries to map locations in the domain to program instances in a way that is
compact across all of the dimensions. For example, on a target with an
8-wide gang size, the following ``foreach_tiled`` statement might process
the iteration domain in chunks of 2 elements in ``j`` and 4 elements in
``i`` each time. (The trade-offs between these two constructs are
discussed in more detail in the `ispc Performance Guide`_.)
.. _ispc Performance Guide: perf.html#improving-control-flow-coherence-with-foreach-tiled
::
foreach_tiled (j = 0 ... height, i = 0 ... width) {
// loop body--process data element (i,j)
}
Parallel Iteration with "programIndex" and "programCount"
---------------------------------------------------------
In addition to ``foreach`` and ``foreach_tiled``, ``ispc`` provides a
lower-level mechanism for mapping SPMD program instances to data to operate
on via the built-in ``programIndex`` and ``programCount`` variables.
``programIndex`` gives the index of the SIMD-lane being used for running
each program instance. (In other words, it's a varying integer value that
has value zero for the first program instance, and so forth.) The
``programCount`` builtin gives the total number of instances in the gang.
Together, these can be used to uniquely map executing program instances to
input data. [#]_
.. [#] ``programIndex`` is analogous to ``get_global_id()`` in OpenCL* and
``threadIdx`` in CUDA*.
As a specific example, consider an ``ispc`` function that needs to perform
some computation on an array of data.
::
for (uniform int i = 0; i < count; i += programCount) {
float d = data[i + programIndex];
float r = ....
result[i + programIndex] = r;
}
Here, we've written a loop that explicitly loops over the data in chunks of
``programCount`` elements. In each loop iteration, the running program
instances effectively collude amongst themselves using ``programIndex`` to
determine which elements to work on in a way that ensures that all of the
data elements will be processed. In this particular case, a ``foreach``
loop would be preferable, as ``foreach`` naturally handles the case where
``programCount`` doesn't evenly divide the number of elements to be
processed, while the loop above assumes that case implicitly.
Functions and Function Calls
----------------------------
@@ -3452,7 +3520,7 @@ There are also variants of these functions that return the value as a
discussion of an application of this variant to improve memory access
performance in the `Performance Guide`_.
.. _Performance Guide: perf.html#understanding-gather-and-scatter
.. _Performance Guide: perfguide.html#understanding-gather-and-scatter
::
@@ -4130,8 +4198,10 @@ from ``ispc`` must be declared as follows:
It is illegal to overload functions declared with ``extern "C"`` linkage;
``ispc`` issues an error in this case.
Function calls back to C/C++ are not made if none of the program instances
want to make the call. For example, given code like:
**Only a single function call is made back to C++ for the entire gang of
runing program instances**. Furthermore, function calls back to C/C++ are not
made if none of the program instances want to make the call. For example,
given code like:
::
@@ -4174,6 +4244,24 @@ Application code can thus be written as:
}
}
In some cases, it can be desirable to generate a single call for each
executing program instance, rather than one call for a gang. For example,
the code below shows how one might call an existing math library routine
that takes a scalar parameter.
::
extern "C" uniform double erf(uniform double);
double v = ...;
double result;
foreach_active (instance) {
uniform double r = erf(extract(v, instance));
result = insert(result, instance, r);
}
This code calls ``erf()`` once for each active program instance, passing it
the program instance's value of ``v`` and storing the result in the
instance's ``result`` value.
Data Layout
-----------
@@ -4309,7 +4397,7 @@ can also have a significant effect on performance; in general, creating
groups of work that will tend to do similar computation across the SPMD
program instances improves performance.
.. _ispc Performance Tuning Guide: http://ispc.github.com/perf.html
.. _ispc Performance Tuning Guide: http://ispc.github.com/perfguide.html
Disclaimer and Legal Information

View File

@@ -21,6 +21,7 @@ the most out of ``ispc`` in practice.
+ `Avoid 64-bit Addressing Calculations When Possible`_
+ `Avoid Computation With 8 and 16-bit Integer Types`_
+ `Implementing Reductions Efficiently`_
+ `Using "foreach_active" Effectively`_
+ `Using Low-level Vector Tricks`_
+ `The "Fast math" Option`_
+ `"inline" Aggressively`_
@@ -510,6 +511,43 @@ values--very efficient code in the end.
return reduce_add(sum);
}
Using "foreach_active" Effectively
----------------------------------
For high-performance code,
For example, consider this segment of code, from the introduction of
``foreach_active`` in the ispc User's Guide:
::
uniform float array[...] = { ... };
int index = ...;
foreach_active (i) {
++array[index];
}
Here, ``index`` was assumed to possibly have the same value for multiple
program instances, so the updates to ``array[index]`` are serialized by the
``foreach_active`` statement in order to not have undefined results when
``index`` values do collide.
The code generated by the compiler can be improved in this case by making
it clear that only a single element of the array is accessed by
``array[index]`` and that thus a general gather or scatter isn't required.
Specifically, by using the ``extract()`` function from the standard library
to extract the current program instance's value of ``index`` into a
``uniform`` variable and then using that to index into ``array``, as below,
more efficient code is generated.
::
foreach_active (instanceNum) {
uniform int unifIndex = extract(index, instanceNum);
++array[unifIndex];
}
Using Low-level Vector Tricks
-----------------------------