Documentation refactoring, initial pass at FAQ

2011-11-30 17:11:03 -08:00
parent 8bc7367109
commit 4d6bcdf41c
4 changed files with 32 additions and 514 deletions
--- a/docs/build.sh
+++ b/docs/build.sh
@@ -1,6 +1,8 @@
 #!/bin/bash
 rst2html.py ispc.txt > ispc.html
 rst2html.py perf.txt > perf.html
 rst2html.py faq.txt > faq.html
 #rst2latex --section-numbering --documentclass=article --documentoptions=DIV=9,10pt,letterpaper ispc.txt > ispc.tex
 #pdflatex ispc.tex
--- a/docs/faq.txt
+++ b/docs/faq.txt
@@ -0,0 +1,4 @@
 =============================================================
 Intel® SPMD Program Compiler Frequently Asked Questions (FAQ)
 =============================================================
--- a/docs/ispc.txt
+++ b/docs/ispc.txt
@@ -58,6 +58,7 @@ Contents:
  + `Basic Command-line Options`_
  + `Selecting The Compilation Target`_
  + `The Preprocessor`_
  + `Debugging`_
 * `The ISPC Language`_
@@ -106,26 +107,8 @@ Contents:
  + `Interoperability Overview`_
  + `Data Layout`_
  + `Data Alignment and Aliasing`_
 * `Using ISPC Effectively`_
  + `Restructuring Existing Programs to Use ISPC`_
  + `Understanding How to Interoperate With the Application's Data`_
  + `Communicating Between SPMD Program Instances`_
  + `Gather and Scatter`_
  + `8 and 16-bit Integer Types`_
  + `Low-level Vector Tricks`_
  + `Debugging`_
  + `The "Fast math" Option`_
  + `"Inline" Aggressively`_
  + `Small Performance Tricks`_
  + `Instrumenting Your ISPC Programs`_
  + `Using Scan Operations For Variable Output`_
  + `Application-Supplied Execution Masks`_
  + `Explicit Vector Programming With Uniform Short Vector Types`_
  + `Choosing A Target Vector Width`_
  + `Compiling With Support For Multiple Instruction Sets`_
  + `Implementing Reductions Efficiently`_
 * `Disclaimer and Legal Information`_
@@ -397,6 +380,23 @@ indicating the target instruction set is defined.  With an SSE2 target,
 and ``ISPC_TARGET_AVX`` for AVX.  Finally, ``PI`` is defined for
 convenience, having the value 3.1415926535.
 ISPC_MAJOR_VERSION, ISPC_MINOR_VERSION
 Debugging
 ---------
 Support for debugging in ``ispc`` is in progress.  On Linux\* and Mac
 OS\*, the ``-g`` command-line flag can be supplied to the compiler,
 which causes it to generate debugging symbols.  Running ``ispc`` programs
 in the debugger, setting breakpoints, printing out variables and the like
 all generally works, though there is occasional unexpected behavior.
 Another option for debugging (the only current option on Windows\*) is to
 use the ``print`` statement for ``printf()`` style debugging.  (See `Output
 Functions`_ for more information.)  You can also use the ability to call
 back to application code at particular points in the program, passing a set
 of variable values to be logged or otherwise analyzed from there.
 The ISPC Language
 =================
@@ -2762,9 +2762,6 @@ to the compiler's requirement of no aliasing.
 (In the future, ``ispc`` will have a mechanism to indicate that pointers
 may alias.)
 Using ISPC Effectively
 ======================
 Restructuring Existing Programs to Use ISPC
 -------------------------------------------
@@ -2786,13 +2783,15 @@ style is often effective.
 Carefully choose how to do the exact mapping of computation to SPMD program
 instances.  This choice can impact the mix of gather/scatter memory access
-versus coherent memory access, for example.  (See more on this in the
+versus coherent memory access, for example.  (See more on this topic in the
-section `Gather and Scatter`_ below.)  This decision can also impact the
+`ispc Performance Tuning Guide`_.)  This decision can also impact the
 coherence of control flow across the running SPMD program instances, which
 can also have a significant effect on performance; in general, creating
 groups of work that will tend to do similar computation across the SPMD
 program instances improves performance.
 .. _ispc Performance Tuning Guide: http://ispc.github.com/perf.html
 Understanding How to Interoperate With the Application's Data
 -------------------------------------------------------------
@@ -2953,497 +2952,6 @@ elements to work with and then proceeds with the computation.
  }
 Communicating Between SPMD Program Instances
 --------------------------------------------
 The ``broadcast()``, ``rotate()``, and ``shuffle()`` standard library
 routines provide a variety of mechanisms for the running program instances
 to communicate values to each other during execution.  See the section
 `Cross-Program Instance Operations`_ for more information about their
 operation.
 Gather and Scatter
 ------------------
 The CPU is a poor fit for SPMD execution in some ways, the worst of which
 is handling of general memory reads and writes from SPMD program instances.
 For example, in a "simple" array index:
 ::
    int i = ....;
    uniform float x[10] = { ... };
    float f = x[i];
 Since the index ``i`` is a varying value, the various SPMD program
 instances will in general be reading different locations in the array
 ``x``.  Because the CPU doesn't have a gather instruction, the ``ispc``
 compiler has to serialize these memory reads, performing a separate memory
 load for each running program instance, packing the result into ``f``.
 (And the analogous case would happen for a write into ``x[i]``.)
 In many cases, gathers like these are unavoidable; the running program
 instances just need to access incoherent memory locations.  However, if the
 array index ``i`` could actually be declared and used as a ``uniform``
 variable, the resulting array index is substantially more
 efficient.  This is another case where using ``uniform`` whenever applicable
 is of benefit.
 In some cases, the ``ispc`` compiler is able to deduce that the memory
 locations accessed are either all the same or are uniform.  For example,
 given:
 ::
  uniform int x = ...;
  int y = x;
  return array[y];
 The compiler is able to determine that all of the program instances are
 loading from the same location, even though ``y`` is not a ``uniform``
 variable.  In this case, the compiler will transform this load to a regular vector
 load, rather than a general gather.
 Sometimes the running program instances will access a
 linear sequence of memory locations; this happens most frequently when
 array indexing is done based on the built-in ``programIndex`` variable.  In
 many of these cases, the compiler is also able to detect this case and then
 do a vector load.  For example, given:
 ::
    uniform int x = ...;
    return array[2*x + programIndex];
 A regular vector load is done from array, starting at offset ``2*x``.
 8 and 16-bit Integer Types
 --------------------------
 The code generated for 8 and 16-bit integer types is generally not as
 efficient as the code generated for 32-bit integer types.  It is generally
 worthwhile to use 32-bit integer types for intermediate computations, even
 if the final result will be stored in a smaller integer type.
 Low-level Vector Tricks
 -----------------------
 Many low-level Intel® SSE coding constructs can be implemented in ``ispc``
 code.  For example, the following code efficiently reverses the sign of the
 given values.
 ::
  float flipsign(float a) {
      unsigned int i = intbits(a);
      i ^= 0x80000000;
      return floatbits(i);
  }
 This code compiles down to a single XOR instruction.
 Debugging
 ---------
 Support for debugging in ``ispc`` is in progress.  On Linux\* and Mac
 OS\*, the ``-g`` command-line flag can be supplied to the compiler,
 which causes it to generate debugging symbols.  Running ``ispc`` programs
 in the debugger, setting breakpoints, printing out variables and the like
 all generally works, though there is occasional unexpected behavior.
 Another option for debugging (the only current option on Windows\*) is
 to use the ``print`` statement for ``printf()``
 style debugging.  You can also use the ability to call back to
 application code at particular points in the program, passing a set of
 variable values to be logged or otherwise analyzed from there.
 The "Fast math" Option
 ----------------------
 ``ispc`` has a ``--fast-math`` command-line flag that enables a number of
 optimizations that may be undesirable in code where numerical preceision is
 critically important.  For many graphics applications, the
 approximations may be acceptable.  The following two optimizations are
 performed when ``--fast-math`` is used.  By default, the ``--fast-math``
 flag is off.
 * Expressions like ``x / y``, where ``y`` is a compile-time constant, are
  transformed to ``x * (1./y)``, where the inverse value of ``y`` is
  precomputed at compile time.
 * Expressions like ``x / y``, where ``y`` is not a compile-time constant,
  are transformed to ``x * rcp(y)``, where ``rcp()`` maps to the
  approximate reciprocal instruction from the standard library.
 "Inline" Aggressively
 ---------------------
 Inlining functions aggressively is generally beneficial for performance
 with ``ispc``.  Definitely use the ``inline`` qualifier for any short
 functions (a few lines long), and experiment with it for longer functions.
 Small Performance Tricks
 ------------------------
 Performance is slightly improved by declaring variables at the same block
 scope where they are first used.  For example, in code like the
 following, if the lifetime of ``foo`` is only within the scope of the
 ``if`` clause, write the code like this:  
 ::
    float func() {
        ....
        if (x < y) {
            float foo;
            ... use foo ...
        }
    }
 Try not to write code as:
 ::
    float func() {
        float foo;
        ....
        if (x < y) {
            ... use foo ...
        }
    }
 Doing so can reduce the amount of masked store instructions that the
 compiler needs to generate.
 Instrumenting Your ISPC Programs
 --------------------------------
 ``ispc`` has an optional instrumentation feature that can help you
 understand performance issues.  If a program is compiled using the
 ``--instrument`` flag, the compiler emits calls to a function with the
 following signature at various points in the program (for
 example, at interesting points in the control flow, when scatters or
 gathers happen.)
 ::
    extern "C" {
        void ISPCInstrument(const char *fn, const char *note, 
                            int line, int mask);
    }
 This function is passed the file name of the ``ispc`` file running, a short
 note indicating what is happening, the line number in the source file, and
 the current mask of active SPMD program lanes.  You must provide an
 implementation of this function and link it in with your application.
 For example, when the ``ispc`` program runs, this function might be called
 as follows:
 ::
   ISPCInstrument("foo.ispc", "function entry", 55, 0xf);
 This call indicates that at the currently executing program has just
 entered the function defined at line 55 of the file ``foo.ispc``, with a
 mask of all lanes currently executing (assuming a four-wide Intel® SSE
 target machine).
 For a fuller example of the utility of this functionality, see
 ``examples/aobench_instrumented`` in the ``ispc`` distribution.  Ths
 example includes an implementation of the ``ISPCInstrument`` function that
 collects aggregate data about the program's execution behavior.
 When running this example, you will want to direct to the ``ao`` executable
 to generate a low resolution image, because the instrumentation adds
 substantial execution overhead.  For example:
 ::
    % ./ao 1 32 32
 After the ``ao`` program exits, a summary report along the following lines
 will be printed.  In the first few lines, you can see how many times a few
 functions were called, and the average percentage of SIMD lanes that were
 active upon function entry.
 :: 
    ao.ispc(0067) - function entry: 342424 calls (0 / 0.00% all off!), 95.86% active lanes
    ao.ispc(0067) - return: uniform control flow: 342424 calls (0 / 0.00% all off!), 95.86% active lanes
    ao.ispc(0071) - function entry: 1122 calls (0 / 0.00% all off!), 97.33% active lanes
    ao.ispc(0075) - return: uniform control flow: 1122 calls (0 / 0.00% all off!), 97.33% active lanes
    ao.ispc(0079) - function entry: 10072 calls (0 / 0.00% all off!), 45.09% active lanes
    ao.ispc(0088) - function entry: 36928 calls (0 / 0.00% all off!), 97.40% active lanes
    ...
 Using Scan Operations For Variable Output
 -----------------------------------------
 One important application of the ``exclusive_scan_add()`` function in the
 standard library is when program instances want to generate a variable amount
 of output and when one would like that output to be densely packed in a
 single array.  For example, consider the code fragment below:
 ::
    uniform int func(uniform float outArray[], ...) {
       int numOut = ...;  // figure out how many to be output
       float outLocal[MAX_OUT]; // staging area
       // put results in outLocal[0], ..., outLocal[numOut-1]
       int startOffset = exclusive_scan_add(numOut);
       for (int i = 0; i < numOut; ++i)
           outArray[startOffset + i] = outLocal[i];
       return reduce_add(numOut);
    }
 Here, each program instance has computed a number, ``numOut``, of values to
 output, and has stored them in the ``outLocal`` array.  Assume that four
 program instances are running and that the first one wants to output one
 value, the second two values, and the third and fourth three values each.
 In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6)
 to the four program instances, respectively.  The first program instance
 will write its one result to ``outArray[0]``, the second will write its two
 values to ``outArray[1]`` and ``outArray[2]``, and so forth.  The
 ``reduce_add`` call at the end returns the total number of values that the
 program instances have written to the array.
 Application-Supplied Execution Masks
 ------------------------------------
 Recall that when execution transitions from the application code to an
 ``ispc`` function, all of the program instances are initially executing.
 In some cases, it may desired that only some of them are running, based on
 a data-dependent condition computed in the application program.  This
 situation can easily be handled via an additional parameter from the
 application.
 As a simple example, consider a case where the application code has an
 array of ``float`` values and we'd like the ``ispc`` code to update
 just specific values in that array, where which of those values to be
 updated has been determined by the application.  In C++ code, we might
 have:
 ::
    int count = ...;
    float *array = new float[count];
    bool *shouldUpdate = new bool[count];
    // initialize array and shouldUpdate
    ispc_func(array, shouldUpdate, count);
 Then, the ``ispc`` code could process this update as:
 ::
    export void ispc_func(uniform float array[], uniform bool update[],
                          uniform int count) {
        for (uniform int i = 0; i < count; i += programCount) {
            cif (update[i+programIndex] == true)
                // update array[i+programIndex]...
        }
    }
 (In this case a "coherent" if statement is likely to be worthwhile if the
 ``update`` array will tend to have sections that are either all-true or
 all-false.)
 Explicit Vector Programming With Uniform Short Vector Types
 -----------------------------------------------------------
 The typical model for programming in ``ispc`` is an *implicit* parallel
 model, where one writes a program that is apparently doing scalar
 computation on values and the program is then vectorized to run in parallel
 across the SIMD lanes of a processor.  However, ``ispc`` also has some
 support for explicit vector unit programming, where the vectorization is
 explicit.  Some computations may be more effectively described in the
 explicit model rather than the implicit model.
 This support is provided via ``uniform`` instances of short vectors 
 (as were introduced in the `Short Vector Types`_ section).  Specifically, 
 if this short program
 ::
    export uniform float<8> madd(uniform float<8> a, 
                                 uniform float<8> b, uniform float<8> c) {
        return a + b * c;
    }
 is compiled with the AVX target, ``ispc`` generates the following assembly:
 ::
    _madd:
 	vmulps	%ymm2, %ymm1, %ymm1
 	vaddps	%ymm0, %ymm1, %ymm0
 	ret
 (And similarly, if compiled with a 4-wide SSE target, two ``mulps`` and two
 ``addps`` instructions are generated, and so forth.)
 Note that ``ispc`` doesn't currently support control-flow based on
 ``uniform`` short vector types; it is thus not possible to write code like:
 ::
    export uniform int<8> count(uniform float<8> a, uniform float<8> b) {
        uniform int<8> sum = 0;
        while (a++ < b)
            ++sum;
    }
 Choosing A Target Vector Width
 ------------------------------
 By default, ``ispc`` compiles to the natural vector width of the target
 instruction set.  For example, for SSE2 and SSE4, it compiles four-wide,
 and for AVX, it complies 8-wide.  For some programs, higher performance may
 be seen if the program is compiled to a doubled vector width--8-wide for
 SSE and 16-wide for AVX.  
 For workloads that don't require many of registers, this method can lead to
 significantly more efficient execution thanks to greater instruction level
 parallelism and amortization of various overhead over more program
 instances.  For other workloads, it may lead to a slowdown due to higher
 register pressure; trying both approaches for key kernels may be
 worthwhile.
 This option is only available for each of the SSE2, SSE4 and AVX targets.
 It is selected with the ``--target=sse2-x2``, ``--target=sse4-x2`` and
 ``--target=avx-x2`` options, respectively.
 Compiling With Support For Multiple Instruction Sets
 ----------------------------------------------------
 ``ispc`` can also generate output that supports multiple target instruction
 sets, choosing the most appropriate one at runtime.  For example, if you
 run the command:
 ::
   ispc foo.ispc -o foo.o --target=sse2,sse4-x2,avx-x2
 Then four object files will be generated: ``foo_sse2.o``, ``foo_sse4.o``,
 ``foo_avx.o``, and ``foo.o``.[#]_  Link all of these into your executable, and
 when you call a function in ``foo.ispc`` from your application code,
 ``ispc`` will determine which instruction sets are supported by the CPU the
 code is running on and will call the most appropraite version of the
 function available.  
 .. [#] Similarly, if you choose to generate assembly langauage output or
   LLVM bitcode output, multiple versions of those files will be created.
 In general, the version of the function that runs will be the one in the
 most general instruction set that is supported by the system.  If you only
 compile SSE2 and SSE4 variants and run on a system that supports AVX, for
 example, then the SSE4 variant will be executed.  If the system doesn't
 is not able to run any of the available variants of the function (for
 example, trying to run a function that only has SSE4 and AVX variants on a
 system that only supports SSE2), then the standard library ``abort()``
 function will be called.
 One subtlety is that all non-static global variables (if any) must have the
 same size and layout with all of the targets used.  For example, if you
 have the global variables:
 ::
   uniform int foo[2*programCount];
   int bar;
 and compile to both SSE2 and AVX targets, both of these variables will have
 different sizes (the first due to program count having the value 4 for SSE2
 and 8 for AVX, and the second due to ``varying`` types having different
 numbers of elements with the two targets--essentially the same issue as the
 first.)
 Implementing Reductions Efficiently
 -----------------------------------
 It's often necessary to compute a "reduction" over a data set--for example,
 one might want to add all of the values in an array, compute their minimum,
 etc.  ``ispc`` provides a few capabilities that make it easy to efficiently
 compute reductions like these.  However, it's important to use these
 capabilities appropriately for best results.
 As an example, consider the task of computing the sum of all of the values
 in an array.  In C code, we might have:
 ::
    /* C implementation of a sum reduction */
    float sum(const float array[], int count) {
        float sum = 0;
        for (int i = 0; i < count; ++i)
            sum += array[i];
        return sum;
    } 
 Of course, exactly this computation could also be expressed in ``ispc``,
 though without any benefit from vectorization:
 ::
    /* inefficient ispc implementation of a sum reduction */
    uniform float sum(const uniform float array[], uniform int count) {
        uniform float sum = 0;
        for (uniform int i = 0; i < count; ++i)
            sum += array[i];
        return sum;
    } 
 As a first try, one might try using the ``reduce_add()`` function from the
 ``ispc`` standard library; it takes a ``varying`` value and returns the sum
 of that value across all of the active program instances (see
 `Cross-Program Instance Operations`_ for more details).
 ::
    /* inefficient ispc implementation of a sum reduction */
    uniform float sum(const uniform float array[], uniform int count) {
        uniform float sum = 0;
        // Assumes programCount evenly divides count
        for (uniform int i = 0; i < count; i += programCount)
            sum += reduce_add(array[i+programIndex]);
        return sum;
    } 
 This implementation loads a set of ``programCount`` values from the array,
 one for each of the program instances, and then uses ``reduce_add`` to
 reduce across the program instances and then update the sum.  Unfortunately
 this approach loses most benefit from vectorization, as it does more work
 on the cross-program instance ``reduce_add()`` call than it saves from the
 vector load of values.
 The most efficient approach is to do the reduction in two phases: rather
 than using a ``uniform`` variable to store the sum, we maintain a varying
 value, such that each program instance is effectively computing a local
 partial sum on the subset of array values that it has loaded from the
 array.  When the loop over array elements concludes, a single call to
 ``reduce_add()`` computes the final reduction across each of the program
 instances' elements of ``sum``.  This approach effectively compiles to a
 single vector load and a single vector add for each ``programCount`` worth
 of values--very efficient code in the end.
 ::
    /* good ispc implementation of a sum reduction */
    uniform float sum(const uniform float array[], uniform int count) {
        float sum = 0;
        // Assumes programCount evenly divides count
        for (uniform int i = 0; i < count; i += programCount)
            sum += array[i+programIndex];
        return reduce_add(sum);
    } 
 Disclaimer and Legal Information
 ================================
--- a/docs/perf.txt
+++ b/docs/perf.txt
@@ -0,0 +1,4 @@
 ==============================================
 Intel® SPMD Program Compiler Performance Guide
 ==============================================