diff --git a/docs/build.sh b/docs/build.sh index cca3bee6..a087da61 100755 --- a/docs/build.sh +++ b/docs/build.sh @@ -1,6 +1,8 @@ #!/bin/bash rst2html.py ispc.txt > ispc.html +rst2html.py perf.txt > perf.html +rst2html.py faq.txt > faq.html #rst2latex --section-numbering --documentclass=article --documentoptions=DIV=9,10pt,letterpaper ispc.txt > ispc.tex #pdflatex ispc.tex diff --git a/docs/faq.txt b/docs/faq.txt new file mode 100644 index 00000000..409e8bb9 --- /dev/null +++ b/docs/faq.txt @@ -0,0 +1,4 @@ +============================================================= +Intel® SPMD Program Compiler Frequently Asked Questions (FAQ) +============================================================= + diff --git a/docs/ispc.txt b/docs/ispc.txt index 2eac37af..848c8dd7 100644 --- a/docs/ispc.txt +++ b/docs/ispc.txt @@ -58,6 +58,7 @@ Contents: + `Basic Command-line Options`_ + `Selecting The Compilation Target`_ + `The Preprocessor`_ + + `Debugging`_ * `The ISPC Language`_ @@ -106,26 +107,8 @@ Contents: + `Interoperability Overview`_ + `Data Layout`_ + `Data Alignment and Aliasing`_ - -* `Using ISPC Effectively`_ - + `Restructuring Existing Programs to Use ISPC`_ + `Understanding How to Interoperate With the Application's Data`_ - + `Communicating Between SPMD Program Instances`_ - + `Gather and Scatter`_ - + `8 and 16-bit Integer Types`_ - + `Low-level Vector Tricks`_ - + `Debugging`_ - + `The "Fast math" Option`_ - + `"Inline" Aggressively`_ - + `Small Performance Tricks`_ - + `Instrumenting Your ISPC Programs`_ - + `Using Scan Operations For Variable Output`_ - + `Application-Supplied Execution Masks`_ - + `Explicit Vector Programming With Uniform Short Vector Types`_ - + `Choosing A Target Vector Width`_ - + `Compiling With Support For Multiple Instruction Sets`_ - + `Implementing Reductions Efficiently`_ * `Disclaimer and Legal Information`_ @@ -397,6 +380,23 @@ indicating the target instruction set is defined. With an SSE2 target, and ``ISPC_TARGET_AVX`` for AVX. Finally, ``PI`` is defined for convenience, having the value 3.1415926535. +ISPC_MAJOR_VERSION, ISPC_MINOR_VERSION + +Debugging +--------- + +Support for debugging in ``ispc`` is in progress. On Linux\* and Mac +OS\*, the ``-g`` command-line flag can be supplied to the compiler, +which causes it to generate debugging symbols. Running ``ispc`` programs +in the debugger, setting breakpoints, printing out variables and the like +all generally works, though there is occasional unexpected behavior. + +Another option for debugging (the only current option on Windows\*) is to +use the ``print`` statement for ``printf()`` style debugging. (See `Output +Functions`_ for more information.) You can also use the ability to call +back to application code at particular points in the program, passing a set +of variable values to be logged or otherwise analyzed from there. + The ISPC Language ================= @@ -2762,9 +2762,6 @@ to the compiler's requirement of no aliasing. (In the future, ``ispc`` will have a mechanism to indicate that pointers may alias.) -Using ISPC Effectively -====================== - Restructuring Existing Programs to Use ISPC ------------------------------------------- @@ -2786,13 +2783,15 @@ style is often effective. Carefully choose how to do the exact mapping of computation to SPMD program instances. This choice can impact the mix of gather/scatter memory access -versus coherent memory access, for example. (See more on this in the -section `Gather and Scatter`_ below.) This decision can also impact the +versus coherent memory access, for example. (See more on this topic in the +`ispc Performance Tuning Guide`_.) This decision can also impact the coherence of control flow across the running SPMD program instances, which can also have a significant effect on performance; in general, creating groups of work that will tend to do similar computation across the SPMD program instances improves performance. +.. _ispc Performance Tuning Guide: http://ispc.github.com/perf.html + Understanding How to Interoperate With the Application's Data ------------------------------------------------------------- @@ -2953,497 +2952,6 @@ elements to work with and then proceeds with the computation. } -Communicating Between SPMD Program Instances --------------------------------------------- - -The ``broadcast()``, ``rotate()``, and ``shuffle()`` standard library -routines provide a variety of mechanisms for the running program instances -to communicate values to each other during execution. See the section -`Cross-Program Instance Operations`_ for more information about their -operation. - - -Gather and Scatter ------------------- - -The CPU is a poor fit for SPMD execution in some ways, the worst of which -is handling of general memory reads and writes from SPMD program instances. -For example, in a "simple" array index: - -:: - - int i = ....; - uniform float x[10] = { ... }; - float f = x[i]; - -Since the index ``i`` is a varying value, the various SPMD program -instances will in general be reading different locations in the array -``x``. Because the CPU doesn't have a gather instruction, the ``ispc`` -compiler has to serialize these memory reads, performing a separate memory -load for each running program instance, packing the result into ``f``. -(And the analogous case would happen for a write into ``x[i]``.) - -In many cases, gathers like these are unavoidable; the running program -instances just need to access incoherent memory locations. However, if the -array index ``i`` could actually be declared and used as a ``uniform`` -variable, the resulting array index is substantially more -efficient. This is another case where using ``uniform`` whenever applicable -is of benefit. - -In some cases, the ``ispc`` compiler is able to deduce that the memory -locations accessed are either all the same or are uniform. For example, -given: - -:: - - uniform int x = ...; - int y = x; - return array[y]; - -The compiler is able to determine that all of the program instances are -loading from the same location, even though ``y`` is not a ``uniform`` -variable. In this case, the compiler will transform this load to a regular vector -load, rather than a general gather. - -Sometimes the running program instances will access a -linear sequence of memory locations; this happens most frequently when -array indexing is done based on the built-in ``programIndex`` variable. In -many of these cases, the compiler is also able to detect this case and then -do a vector load. For example, given: - -:: - - uniform int x = ...; - return array[2*x + programIndex]; - -A regular vector load is done from array, starting at offset ``2*x``. - - -8 and 16-bit Integer Types --------------------------- - -The code generated for 8 and 16-bit integer types is generally not as -efficient as the code generated for 32-bit integer types. It is generally -worthwhile to use 32-bit integer types for intermediate computations, even -if the final result will be stored in a smaller integer type. - -Low-level Vector Tricks ------------------------ - -Many low-level Intel® SSE coding constructs can be implemented in ``ispc`` -code. For example, the following code efficiently reverses the sign of the -given values. - -:: - - float flipsign(float a) { - unsigned int i = intbits(a); - i ^= 0x80000000; - return floatbits(i); - } - -This code compiles down to a single XOR instruction. - -Debugging ---------- - -Support for debugging in ``ispc`` is in progress. On Linux\* and Mac -OS\*, the ``-g`` command-line flag can be supplied to the compiler, -which causes it to generate debugging symbols. Running ``ispc`` programs -in the debugger, setting breakpoints, printing out variables and the like -all generally works, though there is occasional unexpected behavior. - -Another option for debugging (the only current option on Windows\*) is -to use the ``print`` statement for ``printf()`` -style debugging. You can also use the ability to call back to -application code at particular points in the program, passing a set of -variable values to be logged or otherwise analyzed from there. - -The "Fast math" Option ----------------------- - -``ispc`` has a ``--fast-math`` command-line flag that enables a number of -optimizations that may be undesirable in code where numerical preceision is -critically important. For many graphics applications, the -approximations may be acceptable. The following two optimizations are -performed when ``--fast-math`` is used. By default, the ``--fast-math`` -flag is off. - -* Expressions like ``x / y``, where ``y`` is a compile-time constant, are - transformed to ``x * (1./y)``, where the inverse value of ``y`` is - precomputed at compile time. - -* Expressions like ``x / y``, where ``y`` is not a compile-time constant, - are transformed to ``x * rcp(y)``, where ``rcp()`` maps to the - approximate reciprocal instruction from the standard library. - - -"Inline" Aggressively ---------------------- - -Inlining functions aggressively is generally beneficial for performance -with ``ispc``. Definitely use the ``inline`` qualifier for any short -functions (a few lines long), and experiment with it for longer functions. - -Small Performance Tricks ------------------------- - -Performance is slightly improved by declaring variables at the same block -scope where they are first used. For example, in code like the -following, if the lifetime of ``foo`` is only within the scope of the -``if`` clause, write the code like this: - -:: - - float func() { - .... - if (x < y) { - float foo; - ... use foo ... - } - } - -Try not to write code as: - -:: - - float func() { - float foo; - .... - if (x < y) { - ... use foo ... - } - } - -Doing so can reduce the amount of masked store instructions that the -compiler needs to generate. - -Instrumenting Your ISPC Programs --------------------------------- - -``ispc`` has an optional instrumentation feature that can help you -understand performance issues. If a program is compiled using the -``--instrument`` flag, the compiler emits calls to a function with the -following signature at various points in the program (for -example, at interesting points in the control flow, when scatters or -gathers happen.) - -:: - - extern "C" { - void ISPCInstrument(const char *fn, const char *note, - int line, int mask); - } - -This function is passed the file name of the ``ispc`` file running, a short -note indicating what is happening, the line number in the source file, and -the current mask of active SPMD program lanes. You must provide an -implementation of this function and link it in with your application. - -For example, when the ``ispc`` program runs, this function might be called -as follows: - -:: - - ISPCInstrument("foo.ispc", "function entry", 55, 0xf); - -This call indicates that at the currently executing program has just -entered the function defined at line 55 of the file ``foo.ispc``, with a -mask of all lanes currently executing (assuming a four-wide Intel® SSE -target machine). - -For a fuller example of the utility of this functionality, see -``examples/aobench_instrumented`` in the ``ispc`` distribution. Ths -example includes an implementation of the ``ISPCInstrument`` function that -collects aggregate data about the program's execution behavior. - -When running this example, you will want to direct to the ``ao`` executable -to generate a low resolution image, because the instrumentation adds -substantial execution overhead. For example: - -:: - - % ./ao 1 32 32 - -After the ``ao`` program exits, a summary report along the following lines -will be printed. In the first few lines, you can see how many times a few -functions were called, and the average percentage of SIMD lanes that were -active upon function entry. - -:: - - ao.ispc(0067) - function entry: 342424 calls (0 / 0.00% all off!), 95.86% active lanes - ao.ispc(0067) - return: uniform control flow: 342424 calls (0 / 0.00% all off!), 95.86% active lanes - ao.ispc(0071) - function entry: 1122 calls (0 / 0.00% all off!), 97.33% active lanes - ao.ispc(0075) - return: uniform control flow: 1122 calls (0 / 0.00% all off!), 97.33% active lanes - ao.ispc(0079) - function entry: 10072 calls (0 / 0.00% all off!), 45.09% active lanes - ao.ispc(0088) - function entry: 36928 calls (0 / 0.00% all off!), 97.40% active lanes - ... - - -Using Scan Operations For Variable Output ------------------------------------------ - -One important application of the ``exclusive_scan_add()`` function in the -standard library is when program instances want to generate a variable amount -of output and when one would like that output to be densely packed in a -single array. For example, consider the code fragment below: - -:: - - uniform int func(uniform float outArray[], ...) { - int numOut = ...; // figure out how many to be output - float outLocal[MAX_OUT]; // staging area - // put results in outLocal[0], ..., outLocal[numOut-1] - int startOffset = exclusive_scan_add(numOut); - for (int i = 0; i < numOut; ++i) - outArray[startOffset + i] = outLocal[i]; - return reduce_add(numOut); - } - -Here, each program instance has computed a number, ``numOut``, of values to -output, and has stored them in the ``outLocal`` array. Assume that four -program instances are running and that the first one wants to output one -value, the second two values, and the third and fourth three values each. -In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6) -to the four program instances, respectively. The first program instance -will write its one result to ``outArray[0]``, the second will write its two -values to ``outArray[1]`` and ``outArray[2]``, and so forth. The -``reduce_add`` call at the end returns the total number of values that the -program instances have written to the array. - -Application-Supplied Execution Masks ------------------------------------- - -Recall that when execution transitions from the application code to an -``ispc`` function, all of the program instances are initially executing. -In some cases, it may desired that only some of them are running, based on -a data-dependent condition computed in the application program. This -situation can easily be handled via an additional parameter from the -application. - -As a simple example, consider a case where the application code has an -array of ``float`` values and we'd like the ``ispc`` code to update -just specific values in that array, where which of those values to be -updated has been determined by the application. In C++ code, we might -have: - -:: - - int count = ...; - float *array = new float[count]; - bool *shouldUpdate = new bool[count]; - // initialize array and shouldUpdate - ispc_func(array, shouldUpdate, count); - -Then, the ``ispc`` code could process this update as: - -:: - - export void ispc_func(uniform float array[], uniform bool update[], - uniform int count) { - for (uniform int i = 0; i < count; i += programCount) { - cif (update[i+programIndex] == true) - // update array[i+programIndex]... - } - } - -(In this case a "coherent" if statement is likely to be worthwhile if the -``update`` array will tend to have sections that are either all-true or -all-false.) - -Explicit Vector Programming With Uniform Short Vector Types ------------------------------------------------------------ - -The typical model for programming in ``ispc`` is an *implicit* parallel -model, where one writes a program that is apparently doing scalar -computation on values and the program is then vectorized to run in parallel -across the SIMD lanes of a processor. However, ``ispc`` also has some -support for explicit vector unit programming, where the vectorization is -explicit. Some computations may be more effectively described in the -explicit model rather than the implicit model. - -This support is provided via ``uniform`` instances of short vectors -(as were introduced in the `Short Vector Types`_ section). Specifically, -if this short program - -:: - - export uniform float<8> madd(uniform float<8> a, - uniform float<8> b, uniform float<8> c) { - return a + b * c; - } - -is compiled with the AVX target, ``ispc`` generates the following assembly: - -:: - _madd: - vmulps %ymm2, %ymm1, %ymm1 - vaddps %ymm0, %ymm1, %ymm0 - ret - -(And similarly, if compiled with a 4-wide SSE target, two ``mulps`` and two -``addps`` instructions are generated, and so forth.) - -Note that ``ispc`` doesn't currently support control-flow based on -``uniform`` short vector types; it is thus not possible to write code like: - -:: - - export uniform int<8> count(uniform float<8> a, uniform float<8> b) { - uniform int<8> sum = 0; - while (a++ < b) - ++sum; - } - - -Choosing A Target Vector Width ------------------------------- - -By default, ``ispc`` compiles to the natural vector width of the target -instruction set. For example, for SSE2 and SSE4, it compiles four-wide, -and for AVX, it complies 8-wide. For some programs, higher performance may -be seen if the program is compiled to a doubled vector width--8-wide for -SSE and 16-wide for AVX. - -For workloads that don't require many of registers, this method can lead to -significantly more efficient execution thanks to greater instruction level -parallelism and amortization of various overhead over more program -instances. For other workloads, it may lead to a slowdown due to higher -register pressure; trying both approaches for key kernels may be -worthwhile. - -This option is only available for each of the SSE2, SSE4 and AVX targets. -It is selected with the ``--target=sse2-x2``, ``--target=sse4-x2`` and -``--target=avx-x2`` options, respectively. - - -Compiling With Support For Multiple Instruction Sets ----------------------------------------------------- - -``ispc`` can also generate output that supports multiple target instruction -sets, choosing the most appropriate one at runtime. For example, if you -run the command: - -:: - - ispc foo.ispc -o foo.o --target=sse2,sse4-x2,avx-x2 - -Then four object files will be generated: ``foo_sse2.o``, ``foo_sse4.o``, -``foo_avx.o``, and ``foo.o``.[#]_ Link all of these into your executable, and -when you call a function in ``foo.ispc`` from your application code, -``ispc`` will determine which instruction sets are supported by the CPU the -code is running on and will call the most appropraite version of the -function available. - -.. [#] Similarly, if you choose to generate assembly langauage output or - LLVM bitcode output, multiple versions of those files will be created. - -In general, the version of the function that runs will be the one in the -most general instruction set that is supported by the system. If you only -compile SSE2 and SSE4 variants and run on a system that supports AVX, for -example, then the SSE4 variant will be executed. If the system doesn't -is not able to run any of the available variants of the function (for -example, trying to run a function that only has SSE4 and AVX variants on a -system that only supports SSE2), then the standard library ``abort()`` -function will be called. - -One subtlety is that all non-static global variables (if any) must have the -same size and layout with all of the targets used. For example, if you -have the global variables: - -:: - - uniform int foo[2*programCount]; - int bar; - -and compile to both SSE2 and AVX targets, both of these variables will have -different sizes (the first due to program count having the value 4 for SSE2 -and 8 for AVX, and the second due to ``varying`` types having different -numbers of elements with the two targets--essentially the same issue as the -first.) - - -Implementing Reductions Efficiently ------------------------------------ - -It's often necessary to compute a "reduction" over a data set--for example, -one might want to add all of the values in an array, compute their minimum, -etc. ``ispc`` provides a few capabilities that make it easy to efficiently -compute reductions like these. However, it's important to use these -capabilities appropriately for best results. - -As an example, consider the task of computing the sum of all of the values -in an array. In C code, we might have: - -:: - - /* C implementation of a sum reduction */ - float sum(const float array[], int count) { - float sum = 0; - for (int i = 0; i < count; ++i) - sum += array[i]; - return sum; - } - -Of course, exactly this computation could also be expressed in ``ispc``, -though without any benefit from vectorization: - -:: - - /* inefficient ispc implementation of a sum reduction */ - uniform float sum(const uniform float array[], uniform int count) { - uniform float sum = 0; - for (uniform int i = 0; i < count; ++i) - sum += array[i]; - return sum; - } - -As a first try, one might try using the ``reduce_add()`` function from the -``ispc`` standard library; it takes a ``varying`` value and returns the sum -of that value across all of the active program instances (see -`Cross-Program Instance Operations`_ for more details). - -:: - - /* inefficient ispc implementation of a sum reduction */ - uniform float sum(const uniform float array[], uniform int count) { - uniform float sum = 0; - // Assumes programCount evenly divides count - for (uniform int i = 0; i < count; i += programCount) - sum += reduce_add(array[i+programIndex]); - return sum; - } - -This implementation loads a set of ``programCount`` values from the array, -one for each of the program instances, and then uses ``reduce_add`` to -reduce across the program instances and then update the sum. Unfortunately -this approach loses most benefit from vectorization, as it does more work -on the cross-program instance ``reduce_add()`` call than it saves from the -vector load of values. - -The most efficient approach is to do the reduction in two phases: rather -than using a ``uniform`` variable to store the sum, we maintain a varying -value, such that each program instance is effectively computing a local -partial sum on the subset of array values that it has loaded from the -array. When the loop over array elements concludes, a single call to -``reduce_add()`` computes the final reduction across each of the program -instances' elements of ``sum``. This approach effectively compiles to a -single vector load and a single vector add for each ``programCount`` worth -of values--very efficient code in the end. - -:: - - /* good ispc implementation of a sum reduction */ - uniform float sum(const uniform float array[], uniform int count) { - float sum = 0; - // Assumes programCount evenly divides count - for (uniform int i = 0; i < count; i += programCount) - sum += array[i+programIndex]; - return reduce_add(sum); - } - - Disclaimer and Legal Information ================================ diff --git a/docs/perf.txt b/docs/perf.txt new file mode 100644 index 00000000..89be9cd9 --- /dev/null +++ b/docs/perf.txt @@ -0,0 +1,4 @@ +============================================== +Intel® SPMD Program Compiler Performance Guide +============================================== +