From e07ef6d46a111dae34dc8e5b1aecd82975917b8c Mon Sep 17 00:00:00 2001
From: Matt Pharr <matt.pharr@intel.com>
Date: Fri, 2 Dec 2011 17:04:39 -0800
Subject: [PATCH] 1.1 Users guide final (for now)

---
 docs/ispc.txt | 187 ++++++++++++++++++++++++++------------------------
 1 file changed, 98 insertions(+), 89 deletions(-)

diff --git a/docs/ispc.txt b/docs/ispc.txt
index 6452d6ac..13f7b833 100644
--- a/docs/ispc.txt
+++ b/docs/ispc.txt
@@ -2,18 +2,19 @@
 Intel® SPMD Program Compiler User's Guide
 =========================================
 
-``ispc`` is a compiler for writing SPMD (single program multiple data)
-programs to run on the CPU.  The SPMD programming approach is widely known
-to graphics and GPGPU programmers; it is used for GPU shaders and CUDA\* and
-OpenCL\* kernels, for example.  The main idea behind SPMD is that one writes
-programs as if they were operating on a single data element (a pixel for a
-pixel shader, for example), but then the underlying hardware and runtime
-system executes multiple invocations of the program in parallel with
-different inputs (the values for different pixels, for example).
+The Intel® SPMD Program Compiler (``ispc``) is a compiler for writing SPMD
+(single program multiple data) programs to run on the CPU.  The SPMD
+programming approach is widely known to graphics and GPGPU programmers; it
+is used for GPU shaders and CUDA\* and OpenCL\* kernels, for example.  The
+main idea behind SPMD is that one writes programs as if they were operating
+on a single data element (a pixel for a pixel shader, for example), but
+then the underlying hardware and runtime system executes multiple
+invocations of the program in parallel with different inputs (the values
+for different pixels, for example).
 
 The main goals behind ``ispc`` are to:
 
-* Build a small variant of the C programming language that delivers good
+* Build a variant of the C programming language that delivers good
   performance to performance-oriented programmers who want to run SPMD
   programs on CPUs.
 * Provide a thin abstraction layer between the programmer and the
@@ -162,10 +163,11 @@ of recent changes to the compiler.
 Updating ISPC Programs For Changes In ISPC 1.1
 ----------------------------------------------
 
-The 1.1 release of ``ispc`` features first-class support for pointers in
-the language.  Adding this functionality led to a number of syntactic
-changes to the language.  These should generally require only
-straightforward modification of existing programs.
+The major changes introduced in the 1.1 release of ``ispc`` are first-class
+support for pointers in the language and new parallel loop constructs.
+Adding this functionality required a number of syntactic changes to the
+language.  These changes should generally lead to straightforward minor
+modifications of existing ``ispc`` programs.
 
 These are the relevant changes to the language:
 
@@ -179,11 +181,6 @@ These are the relevant changes to the language:
     qualifier should just have ``reference`` removed: ``void foo(reference
     float bar[])`` can just be ``void foo(float bar[])``.
 
-* It is no longer legal to pass a varying lvalue to a function that takes a
-  reference parameter; references can only be to uniform lvalue types.  In
-  this case, the function should be rewritten to take a varying pointer
-  parameter.
-
 * It is now a compile-time error to assign an entire array to another
   array.
 
@@ -196,6 +193,11 @@ These are the relevant changes to the language:
   as its first parameter rather than taking a ``uniform unsigned int[]`` as
   its first parameter and a ``uniform int`` offset as its second parameter.
 
+* It is no longer legal to pass a varying lvalue to a function that takes a
+  reference parameter; references can only be to uniform lvalue types.  In
+  this case, the function should be rewritten to take a varying pointer
+  parameter.
+
 * There are new iteration constructs for looping over computation domains,
   ``foreach`` and ``foreach_tiled``.  In addition to being syntactically
   cleaner than regular ``for`` loops, these can provide performance
@@ -505,14 +507,14 @@ parallel computation.  Understanding the details of ``ispc``'s parallel
 execution model that are introduced in this section is critical for writing
 efficient and correct programs in ``ispc``.
 
-``ispc`` supports two types of parallelism: both task parallelism to
-parallelize across multiple processor cores and SPMD parallelism to
-parallelize across the SIMD vector lanes on a single core.  Most of this
-section focuses on SPMD parallelism, but see `Tasking Model`_ at the end of
-this section for discussion of task parallelism in ``ispc``.
+``ispc`` supports two types of parallelism: task parallelism to parallelize
+across multiple processor cores and SPMD parallelism to parallelize across
+the SIMD vector lanes on a single core.  Most of this section focuses on
+SPMD parallelism, but see `Tasking Model`_ at the end of this section for
+discussion of task parallelism in ``ispc``.
 
 This section will use some snippets of ``ispc`` code to illustrate various
-concepts.  Given ``ispc``'s relationship to C, these should generally be
+concepts.  Given ``ispc``'s relationship to C, these should be
 understandable on their own, but you may want to refer to the `The ISPC
 Language`_ section for details on language syntax.
 
@@ -523,15 +525,15 @@ Basic Concepts: Program Instances and Gangs of Program Instances
 Upon entry to a ``ispc`` function called from C/C++ code, the execution
 model switches from the application's serial model to ``ispc``'s execution
 model.  Conceptually, a number of ``ispc`` *program instances* start
-running in parallel.  The group of running program instances is a called
-*gang* (harkening to "gang scheduling", since ``ispc`` provides certain
-guarantees about the control flow coherence of program instances running
-in a gang.)  An ``ispc`` program instance is thus roughly similar to a
-CUDA* "thread" or an OpenCL* "work-item", and an ``ispc`` gang is roughly
-similar to a CUDA* "warp".
+running in concurrently.  The group of running program instances is a
+called *gang* (harkening to "gang scheduling", since ``ispc`` provides
+certain guarantees about the control flow coherence of program instances
+running in a gang, detailed in `Gang Convergence Guarantees`_.)  An
+``ispc`` program instance is thus similar to a CUDA* "thread" or an OpenCL*
+"work-item", and an ``ispc`` gang is similar to a CUDA* "warp".
 
-An ``ispc`` program, then, expresses the computation performed by a gang of
-program instances, using an "implicitly parallel" model, where the ``ispc``
+An ``ispc`` program expresses the computation performed by a gang of
+program instances, using an "implicit parallel" model, where the ``ispc``
 program generally describes the behavior of a single program instance, even
 though a gang of them is actually executing together.  This implicit model
 is the same that is used for shaders in programmable graphics pipelines,
@@ -592,7 +594,7 @@ the same results for each program instance in a gang as would have been
 computed if the equivalent code ran serially in C to compute each program
 instance's result individually.  However, here we will more precisely
 define the execution model for control flow in order to be able to
-precisely define the language's behavior.
+precisely define the language's behavior in specific situations.
 
 We will specify the notion of a *program counter* and how it is updated to
 step through the program, and an *execution mask* that indicates which
@@ -615,13 +617,14 @@ of an ``ispc`` function.
   conservative execution path through the function, wherein if *any*
   program instance wants to execute a statement, the program counter will
   pass through that statement.
+
   2. At each statement the program counter passes through, the execution
   mask will be set such that its value for a particular program instance is
-  "on" if the program instance wants to execute that statements.
+  "on" if and only if the program instance wants to execute that statement.
 
 Note that these definition provide the compiler some latitude; for example,
 the program counter is allowed pass through a series of statements with the
-execution mask "all off" as long as doing so has no observable side-effects.
+execution mask "all off" because doing so has no observable side-effects.
 
 Elsewhere, we will speak informally of the *control flow coherence* of a
 program; this notion describes the degree to which the program instances in
@@ -638,7 +641,7 @@ Control Flow Example: If Statements
 As a concrete example of the interplay between program counter and
 execution mask, one way that an ``if`` statement like the one in the
 previous section can be represented is shown by the following pseudo-code
-``ispc`` compiler output:
+compiler output:
 
 ::
 
@@ -654,7 +657,8 @@ previous section can be represented is shown by the following pseudo-code
 In other words, the program counter steps through the statements for both
 the "true" case and the "false" case, with the execution mask set so that
 no side-effects from the true statements affect the program instances that
-want to run the false statements, and vice versa.
+want to run the false statements, and vice versa.  the execution mask is
+then restored to the value it had before the ``if`` statement.
 
 However, the compiler is free to generate different code for an ``if``
 test, such as:
@@ -681,8 +685,8 @@ code for the "true" and "false" statements is undefined.
 
 In most cases, there is no programmer-visible difference between these two
 ways of compiling ``if``, though see the `Uniform Variables and Varying
-Control Flow`_ section for a case where it causes undefined behavior in a
-specific situation.
+Control Flow`_ section for a case where it causes undefined behavior in one
+particular situation.
 
 
 Control Flow Example: Loops
@@ -701,12 +705,13 @@ Therefore, if we have a loop like the following:
         ...
     }
 
-where ``limit`` has the value 1 for all of the program instances but
-one, and has value 1000 for the other one, the program counter will step
-through the loop body 1000 times.  The first time, the execution mask will be all on
-(assuming it is all on going into the ``for`` loop), and the remaining 999
-times, the mask will be off except for the program instance with 1000 in
-``limit``.  (This would be a loop with poor control flow coherence!)
+where ``limit`` has the value 1 for all of the program instances but one,
+and has value 1000 for the other one, the program counter will step through
+the loop body 1000 times.  The first time, the execution mask will be all
+on (assuming it is all on going into the ``for`` loop), and the remaining
+999 times, the mask will be off except for the program instance with a
+``limit`` value of 1000.  (This would be a loop with poor control flow
+coherence!)
 
 A ``continue`` statement in a loop may be handled either by disabling the
 execution mask for the program instances that execute the ``continue`` and
@@ -716,11 +721,6 @@ disabled after the ``continue`` has executed.  ``break`` statements are
 handled in a similar fashion.
 
 
-Control Flow Example: Function Pointers
----------------------------------------
-
- 
-
 Gang Convergence Guarantees
 ---------------------------
 
@@ -728,13 +728,16 @@ The ``ispc`` execution model provides an important guarantee about the
 behavior of the program counter and execution mask: the execution of
 program instances is *maximally converged*.  Maximal convergence means that
 if two program instances follow the same control path, they are guaranteed
-to execute each program statement concurrently. [#]_
+to execute each program statement concurrently. If two program instances
+follow diverging control paths, it is guaranteed that they will reconverge
+as soon as possible (if they do later reconverge). [#]_
 
 .. [#] This is another significant difference between the ``ispc``
-       execution model and the one implemented by OpenCL* and CUDA*.
+       execution model and the one implemented by OpenCL* and CUDA*, which
+       doesn't provide this guarantee.
 
-Furthermore, maximal convergence means that in the presence of divergent
-control flow such as the following:
+Maximal convergence means that in the presence of divergent control flow
+such as the following:
 
 ::
 
@@ -751,9 +754,9 @@ It is guaranteed that all program instances that were running before the
 for the gang of program instances, rather than the concept of a unique
 program counter for each program instance.)
 
-Thus, it is illegal to execute a function with an 8-wide gang by running it
-two times, with a 4-wide gang representing half of the original 8-wide gang
-each time.
+Another implication of this property is that it is illegal to execute a
+function with an 8-wide gang by running it two times, with a 4-wide gang
+representing half of the original 8-wide gang each time.
 
 The way that "varying" function pointers are handled in ``ispc`` is also
 affected by this guarantee: if a function pointer is ``varying``, then it
@@ -772,16 +775,16 @@ Uniform Data
 
 A variable that is declared with the ``uniform`` qualifier represents a
 single value that is shared across the entire gang.  (In contrast, the
-default qualifier for variables in ``ispc``, ``varying``, represents a
-variable that has a distinct storage location for each program instance in
-the gang.)
+default variability qualifier for variables in ``ispc``, ``varying``,
+represents a variable that has a distinct storage location for each program
+instance in the gang.)
 
 It is an error to try to assign a ``varying`` value to a ``uniform``
 variable, though ``uniform`` values can be assigned to ``uniform``
 variables.  Assignments to ``uniform`` variables are not affected by the
 execution mask (there's no unambiguous way that they could be); rather,
-they always apply if the block of code that has the the uniform assignment
-is executed.
+they always apply if the program pointer executes a statement that is a
+uniform assignment.
 
 
 Uniform Control Flow
@@ -811,14 +814,13 @@ over pixels adjacent to the given (x,y) coordiantes:
         return sum / 9.;
     }
 
-Under the ``ispc`` SPMD model, we have a gang of program instances this
-function in parallel, where in general each program instance has different
-values for ``x`` and ``y``.)  For the box filtering algorithm here, all of
+In general each program instance in the gang has different values for ``x``
+and ``y`` in this function.  For the box filtering algorithm here, all of
 the program instances will actually want to execute the same number of
 iterations of the ``for`` loops, with all of them having the same values
 for ``dx`` and ``dy`` each time through.  If these loops are instead
 implemented with ``dx`` and ``dy`` declared as ``uniform`` variables, then
-the ``ispc`` compiler can generate more efficient code for the loops. [#]_ 
+the ``ispc`` compiler can generate more efficient code for the loops. [#]_
 
 .. [#] In this case, a sufficiently smart compiler could determine that
    ``dx`` and ``dy`` have the same value for all program instances and thus
@@ -833,7 +835,7 @@ the ``ispc`` compiler can generate more efficient code for the loops. [#]_
 
 In particular, ``ispc`` can avoid the overhead of checking to see if any of
 the running program instances wants to do another loop iteration.  Instead,
-``ispc`` can generate code where all instances always do the same
+the compiler can generate code where all instances always do the same
 iterations.
 
 The analogous benefit comes when using ``if`` statements--if the test in an
@@ -854,9 +856,9 @@ instances that are supposed to be executing the corresponding clause.
 Under this model, we must define the effect of modifying ``uniform``
 variables in the context of varying control flow.  
 
-In general, modifying ``uniform`` variables under varying control flow
-leads to the ``uniform`` variable having an undefined value, except
-within a block where the ``uniform`` value had a value assigned to it.
+In most cases, modifying ``uniform`` variables under varying control flow
+leads to the ``uniform`` variable having an undefined value, except within
+a block where the ``uniform`` value had a value assigned to it.
 
 Consider the following example, which illustrates three cases.
 
@@ -1331,9 +1333,9 @@ instance of that variable shared by all program instances in a gang.  (In
 other words, it necessarily has the same value across all of the program
 instances.)  In addition to requiring less storage than varying values,
 ``uniform`` variables lead to a number of performance advantages when they
-are applicable (see `Uniform Variables and Varying Control Flow`_, for
-example.)  Varying variables may be qualified with ``varying``, though
-doing so has no effect, as ``varying`` is the default.
+are applicable (see `Uniform Control Flow`_, for example.)  Varying
+variables may be qualified with ``varying``, though doing so has no effect,
+as ``varying`` is the default.
 
 ``uniform`` variables can be modified as the program executes, but only in
 ways that preserve the property that they have a single value for the
@@ -1938,7 +1940,10 @@ more details on why regular ``if`` statements may sometimes do this.)
 Along similar lines, ``cfor``, ``cdo``, and ``cwhile`` check to see if all
 program instances are running at the start of each loop iteration; if so,
 they can run a specialized code path that has been optimized for the "all
-on" execution mask case.
+on" execution mask case.  It is already the case for the regular looping
+constructs in ``ispc`` that a loop will never be executed with an "all off"
+execution mask.
+
 
 Parallel Iteration Statements: "foreach" and "foreach_tiled"
 ------------------------------------------------------------
@@ -1966,8 +1971,8 @@ As a specific example, consdier the following ``foreach`` statement:
     }
 
 It specifies a loop over a 2D domain, where the ``j`` variable goes from 0
-to ``height`` and ``i`` goes from 0 to ``width``.  Within the loop, the
-variables ``i`` and ``j`` are available.
+to ``height-1`` and ``i`` goes from 0 to ``width-1``.  Within the loop, the
+variables ``i`` and ``j`` are available and initialized accordingly.
 
 ``foreach`` loops actually cause the given iteration domain to be
 automatically mapped to the program instances in the gang, so that all of
@@ -1981,27 +1986,28 @@ the gang size is 8:
         // perform computation on element i
     }
 
-At the CPU hardware level, the body of this loop will only execute
+The program counter will step through the statements of this loop just
 ``16/8==2`` times; the first time through, the ``varying int32`` variable
 ``i`` will have the values (0,1,2,3,4,5,6,7) over the program instances,
 and the second time through, ``i`` will have the values
 (8,9,10,11,12,13,14,15), thus mapping the available program instances to
-all of the data by the end of the loop's execution.
+all of the data by the end of the loop's execution.  The execution mask
+starts out "all on" at the start of each ``foreach`` loop iteration, but
+may be changed by control flow constructs within the loop.
 
-The ``foreach`` statement subdivides the iteration domain by mapping a
-gang-size worth of values in the innermost dimension to the gang, only
+The basic ``foreach`` statement subdivides the iteration domain by mapping
+a gang-size worth of values in the innermost dimension to the gang, only
 spanning a single value in each of the outer dimensions.  This
 decomposition generally leads to coherent memory reads and writes, but may
 lead to worse control flow coherence than other decompositions.
-``foreach_tiled`` decomposes the iteration domain in a way that tries to
-map locations in the domain to program instances in a way that is compact
-across all of the dimensions.  
 
-For example, on a target with an 8-wide gang size, the following
-``foreach_tiled`` statement processes the iteration domain in chunks of 2
-elements in ``j`` and 4 elements in ``i`` each time.  (The trade-offs
-between these two constructs are discussed in more detail in the `ispc
-Performance Guide`_.)
+Therefore, ``foreach_tiled`` decomposes the iteration domain in a way that
+tries to map locations in the domain to program instances in a way that is
+compact across all of the dimensions.  For example, on a target with an
+8-wide gang size, the following ``foreach_tiled`` statement processes the
+iteration domain in chunks of 2 elements in ``j`` and 4 elements in ``i``
+each time.  (The trade-offs between these two constructs are discussed in
+more detail in the `ispc Performance Guide`_.)
 
 .. _ispc Performance Guide: perf.html#improving-control-flow-coherence-with-foreach-tiled
 
@@ -2024,7 +2030,10 @@ each program instance.  (In other words, it's a varying integer value that
 has value zero for the first program instance, and so forth.)  The
 ``programCount`` builtin gives the total number of instances in the gang.
 Together, these can be used to uniquely map executing program instances to
-input data.
+input data. [#]_
+
+.. [#] ``programIndex`` is analogous to ``get_global_id()`` in OpenCL* and
+       ``threadIdx`` in CUDA*.
 
 As a specific example, consider an ``ispc`` function that needs to perform
 some computation on an array of data.