Documentation work; first pass perf guide complete

2011-12-01 09:42:56 -08:00
parent a2f118a14e
commit f90aa172a6
3 changed files with 611 additions and 257 deletions
--- a/docs/faq.txt
+++ b/docs/faq.txt
@@ -23,7 +23,7 @@ distribution.
 * Programming Techniques

  + `What primitives are there for communicating between SPMD program instances?`_
-  + `How can a gang of program instances generate variable output efficiently?`_
+  + `How can a gang of program instances generate variable amounts of output efficiently?`_
  + `Is it possible to use ispc for explicit vector programming?`_


@@ -48,8 +48,7 @@ If the SSE4 target is used, then the following assembly is printed:

 ::

-    _foo:                                   ## @foo
-    ## BB#0:                                ## %allocas
+    _foo:
            addl    %esi, %edi
            movl    %edi, %eax
            ret
@@ -98,7 +97,7 @@ output array.
    }

 Here is the assembly code for the application-callable instance of the
-function--note that the selected instructions are ideal.
+function.

 ::

@@ -111,21 +110,7 @@ function--note that the selected instructions are ideal.


 And here is the assembly code for the ``ispc``-callable instance of the
-function.  There are a few things to notice in this code.  
-
-The current program mask is coming in via the %xmm0 register and the
-initial few instructions in the function essentially check to see if the
-mask is all-on or all-off.  If the mask is all on, the code at the label
-LBB0_3 executes; it's the same as the code that was generated for ``_foo``
-above.  If the mask is all off, then there's nothing to be done, and the
-function can return immediately.
-
-In the case of a mixed mask, a substantial amount of code is generated to
-load from and then store to only the array elements that correspond to
-program instances where the mask is on.  (This code is elided below).  This
-general pattern of having two-code paths for the "all on" and "mixed" mask
-cases is used in the code generated for almost all but the most simple
-functions (where the overhead of the test isn't worthwhile.)
+function.

 ::

@@ -148,11 +133,84 @@ functions (where the overhead of the test isn't worthwhile.)
    ####
            ret

+There are a few things to notice in this code.  First, the current program
+mask is coming in via the ``%xmm0`` register and the initial few
+instructions in the function essentially check to see if the mask is all on
+or all off.  If the mask is all on, the code at the label LBB0_3 executes;
+it's the same as the code that was generated for ``_foo`` above.  If the
+mask is all off, then there's nothing to be done, and the function can
+return immediately.
+
+In the case of a mixed mask, a substantial amount of code is generated to
+load from and then store to only the array elements that correspond to
+program instances where the mask is on.  (This code is elided below).  This
+general pattern of having two-code paths for the "all on" and "mixed" mask
+cases is used in the code generated for almost all but the most simple
+functions (where the overhead of the test isn't worthwhile.)

 How can I more easily see gathers and scatters in generated assembly?
 ---------------------------------------------------------------------

-FIXME
+Because CPU vector ISAs don't have native gather and scatter instructions,
+these memory operations are turned into sequences of a series of
+instructions in the code that ``ispc`` generates.  In some cases, it can be
+useful to see where gathers and scatters actually happen in code; there is
+an otherwise undocumented command-line flag that provides this information.
+
+Consider this simple program:
+
+::
+
+    void set(uniform int a[], int value, int index) {
+        a[index] = value;
+    }
+
+When compiled normally to the SSE4 target, this program generates this
+extensive code sequence, which makes it more difficult to see what the
+program is actually doing.
+
+::
+
+    "_set___uptr<Ui>ii":
+            pmulld        LCPI0_0(%rip), %xmm1
+            movmskps      %xmm2, %eax
+            testb         $1, %al
+            je            LBB0_2
+            movd          %xmm1, %ecx
+            movd          %xmm0, (%rcx,%rdi)
+    LBB0_2:
+            testb         $2, %al
+            je            LBB0_4
+            pextrd        $1, %xmm1, %ecx
+            pextrd        $1, %xmm0, (%rcx,%rdi)
+    LBB0_4:
+            testb         $4, %al
+            je            LBB0_6
+            pextrd        $2, %xmm1, %ecx
+            pextrd        $2, %xmm0, (%rcx,%rdi)
+    LBB0_6:
+            testb        $8, %al
+            je            LBB0_8
+            pextrd        $3, %xmm1, %eax
+            pextrd        $3, %xmm0, (%rax,%rdi)
+    LBB0_8:
+            ret
+
+If this program is compiled with the
+``--opt=disable-handle-pseudo-memory-ops`` command-line flag, then the
+scatter is left as an unresolved function call.  The resulting program
+won't link without unresolved symbols, but the assembly output is much
+easier to understand:
+
+::
+
+    "_set___uptr<Ui>ii":
+            movaps        %xmm0, %xmm3
+            pmulld        LCPI0_0(%rip), %xmm1
+            movdqa        %xmm1, %xmm0
+            movaps        %xmm3, %xmm1
+            jmp        ___pseudo_scatter_base_offsets32_32 ## TAILCALL
+

 Interoperability
 ================
@@ -301,13 +359,17 @@ need to synchronize the program instances before communicating between
 them, due to the synchronized execution model of gangs of program instances
 in ``ispc``.

-How can a gang of program instances generate variable output efficiently?
-------------------------------------------------------------------------
+How can a gang of program instances generate variable amounts of output efficiently?
+------------------------------------------------------------------------------------

-A useful application of the ``exclusive_scan_add()`` function in the
-standard library is when program instances want to generate a variable
-amount of output and when one would like that output to be densely packed
-in a single array.  For example, consider the code fragment below:
+It's not unusual to have a gang of program instances where each program
+instance generates a variable amount of output (perhaps some generate no
+output, some generate one output value, some generate many output values
+and so forth), and where one would like to have the output densely packed
+in an output array.  The ``exclusive_scan_add()`` function from the
+standard library is quite useful in this situation.
+
+Consider the following function:

 ::

@@ -331,11 +393,11 @@ value, the second two values, and the third and fourth three values each.
 In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6)
 to the four program instances, respectively.  

-The first program instance will write its one result to ``outArray[0]``,
-the second will write its two values to ``outArray[1]`` and
-``outArray[2]``, and so forth.  The ``reduce_add`` call at the end returns
-the total number of values that all of the program instances have written
-to the array.
+The first program instance will then write its one result to
+``outArray[0]``, the second will write its two values to ``outArray[1]``
+and ``outArray[2]``, and so forth.  The ``reduce_add()`` call at the end
+returns the total number of values that all of the program instances have
+written to the array.

 FIXME: add discussion of foreach_active as an option here once that's in