============================================== Intel® SPMD Program Compiler Performance Guide ============================================== * `Using ISPC Effectively`_ + `Gather and Scatter`_ + `8 and 16-bit Integer Types`_ + `Low-level Vector Tricks`_ + `The "Fast math" Option`_ + `"Inline" Aggressively`_ + `Small Performance Tricks`_ + `Instrumenting Your ISPC Programs`_ + `Choosing A Target Vector Width`_ + `Implementing Reductions Efficiently`_ * `Disclaimer and Legal Information`_ * `Optimization Notice`_ don't use the system math library unless it's absolutely necessary opt=32-bit-addressing Using ISPC Effectively ====================== Gather and Scatter ------------------ The CPU is a poor fit for SPMD execution in some ways, the worst of which is handling of general memory reads and writes from SPMD program instances. For example, in a "simple" array index: :: int i = ....; uniform float x[10] = { ... }; float f = x[i]; Since the index ``i`` is a varying value, the various SPMD program instances will in general be reading different locations in the array ``x``. Because the CPU doesn't have a gather instruction, the ``ispc`` compiler has to serialize these memory reads, performing a separate memory load for each running program instance, packing the result into ``f``. (And the analogous case would happen for a write into ``x[i]``.) In many cases, gathers like these are unavoidable; the running program instances just need to access incoherent memory locations. However, if the array index ``i`` could actually be declared and used as a ``uniform`` variable, the resulting array index is substantially more efficient. This is another case where using ``uniform`` whenever applicable is of benefit. In some cases, the ``ispc`` compiler is able to deduce that the memory locations accessed are either all the same or are uniform. For example, given: :: uniform int x = ...; int y = x; return array[y]; The compiler is able to determine that all of the program instances are loading from the same location, even though ``y`` is not a ``uniform`` variable. In this case, the compiler will transform this load to a regular vector load, rather than a general gather. Sometimes the running program instances will access a linear sequence of memory locations; this happens most frequently when array indexing is done based on the built-in ``programIndex`` variable. In many of these cases, the compiler is also able to detect this case and then do a vector load. For example, given: :: uniform int x = ...; return array[2*x + programIndex]; A regular vector load is done from array, starting at offset ``2*x``. 8 and 16-bit Integer Types -------------------------- The code generated for 8 and 16-bit integer types is generally not as efficient as the code generated for 32-bit integer types. It is generally worthwhile to use 32-bit integer types for intermediate computations, even if the final result will be stored in a smaller integer type. Low-level Vector Tricks ----------------------- Many low-level Intel® SSE coding constructs can be implemented in ``ispc`` code. For example, the following code efficiently reverses the sign of the given values. :: float flipsign(float a) { unsigned int i = intbits(a); i ^= 0x80000000; return floatbits(i); } This code compiles down to a single XOR instruction. The "Fast math" Option ---------------------- ``ispc`` has a ``--fast-math`` command-line flag that enables a number of optimizations that may be undesirable in code where numerical preceision is critically important. For many graphics applications, the approximations may be acceptable. The following two optimizations are performed when ``--fast-math`` is used. By default, the ``--fast-math`` flag is off. * Expressions like ``x / y``, where ``y`` is a compile-time constant, are transformed to ``x * (1./y)``, where the inverse value of ``y`` is precomputed at compile time. * Expressions like ``x / y``, where ``y`` is not a compile-time constant, are transformed to ``x * rcp(y)``, where ``rcp()`` maps to the approximate reciprocal instruction from the standard library. "Inline" Aggressively --------------------- Inlining functions aggressively is generally beneficial for performance with ``ispc``. Definitely use the ``inline`` qualifier for any short functions (a few lines long), and experiment with it for longer functions. Small Performance Tricks ------------------------ Performance is slightly improved by declaring variables at the same block scope where they are first used. For example, in code like the following, if the lifetime of ``foo`` is only within the scope of the ``if`` clause, write the code like this: :: float func() { .... if (x < y) { float foo; ... use foo ... } } Try not to write code as: :: float func() { float foo; .... if (x < y) { ... use foo ... } } Doing so can reduce the amount of masked store instructions that the compiler needs to generate. Instrumenting Your ISPC Programs -------------------------------- ``ispc`` has an optional instrumentation feature that can help you understand performance issues. If a program is compiled using the ``--instrument`` flag, the compiler emits calls to a function with the following signature at various points in the program (for example, at interesting points in the control flow, when scatters or gathers happen.) :: extern "C" { void ISPCInstrument(const char *fn, const char *note, int line, int mask); } This function is passed the file name of the ``ispc`` file running, a short note indicating what is happening, the line number in the source file, and the current mask of active SPMD program lanes. You must provide an implementation of this function and link it in with your application. For example, when the ``ispc`` program runs, this function might be called as follows: :: ISPCInstrument("foo.ispc", "function entry", 55, 0xf); This call indicates that at the currently executing program has just entered the function defined at line 55 of the file ``foo.ispc``, with a mask of all lanes currently executing (assuming a four-wide Intel® SSE target machine). For a fuller example of the utility of this functionality, see ``examples/aobench_instrumented`` in the ``ispc`` distribution. Ths example includes an implementation of the ``ISPCInstrument`` function that collects aggregate data about the program's execution behavior. When running this example, you will want to direct to the ``ao`` executable to generate a low resolution image, because the instrumentation adds substantial execution overhead. For example: :: % ./ao 1 32 32 After the ``ao`` program exits, a summary report along the following lines will be printed. In the first few lines, you can see how many times a few functions were called, and the average percentage of SIMD lanes that were active upon function entry. :: ao.ispc(0067) - function entry: 342424 calls (0 / 0.00% all off!), 95.86% active lanes ao.ispc(0067) - return: uniform control flow: 342424 calls (0 / 0.00% all off!), 95.86% active lanes ao.ispc(0071) - function entry: 1122 calls (0 / 0.00% all off!), 97.33% active lanes ao.ispc(0075) - return: uniform control flow: 1122 calls (0 / 0.00% all off!), 97.33% active lanes ao.ispc(0079) - function entry: 10072 calls (0 / 0.00% all off!), 45.09% active lanes ao.ispc(0088) - function entry: 36928 calls (0 / 0.00% all off!), 97.40% active lanes ... Choosing A Target Vector Width ------------------------------ By default, ``ispc`` compiles to the natural vector width of the target instruction set. For example, for SSE2 and SSE4, it compiles four-wide, and for AVX, it complies 8-wide. For some programs, higher performance may be seen if the program is compiled to a doubled vector width--8-wide for SSE and 16-wide for AVX. For workloads that don't require many of registers, this method can lead to significantly more efficient execution thanks to greater instruction level parallelism and amortization of various overhead over more program instances. For other workloads, it may lead to a slowdown due to higher register pressure; trying both approaches for key kernels may be worthwhile. This option is only available for each of the SSE2, SSE4 and AVX targets. It is selected with the ``--target=sse2-x2``, ``--target=sse4-x2`` and ``--target=avx-x2`` options, respectively. Implementing Reductions Efficiently ----------------------------------- It's often necessary to compute a "reduction" over a data set--for example, one might want to add all of the values in an array, compute their minimum, etc. ``ispc`` provides a few capabilities that make it easy to efficiently compute reductions like these. However, it's important to use these capabilities appropriately for best results. As an example, consider the task of computing the sum of all of the values in an array. In C code, we might have: :: /* C implementation of a sum reduction */ float sum(const float array[], int count) { float sum = 0; for (int i = 0; i < count; ++i) sum += array[i]; return sum; } Of course, exactly this computation could also be expressed in ``ispc``, though without any benefit from vectorization: :: /* inefficient ispc implementation of a sum reduction */ uniform float sum(const uniform float array[], uniform int count) { uniform float sum = 0; for (uniform int i = 0; i < count; ++i) sum += array[i]; return sum; } As a first try, one might try using the ``reduce_add()`` function from the ``ispc`` standard library; it takes a ``varying`` value and returns the sum of that value across all of the active program instances. :: /* inefficient ispc implementation of a sum reduction */ uniform float sum(const uniform float array[], uniform int count) { uniform float sum = 0; // Assumes programCount evenly divides count for (uniform int i = 0; i < count; i += programCount) sum += reduce_add(array[i+programIndex]); return sum; } This implementation loads a set of ``programCount`` values from the array, one for each of the program instances, and then uses ``reduce_add`` to reduce across the program instances and then update the sum. Unfortunately this approach loses most benefit from vectorization, as it does more work on the cross-program instance ``reduce_add()`` call than it saves from the vector load of values. The most efficient approach is to do the reduction in two phases: rather than using a ``uniform`` variable to store the sum, we maintain a varying value, such that each program instance is effectively computing a local partial sum on the subset of array values that it has loaded from the array. When the loop over array elements concludes, a single call to ``reduce_add()`` computes the final reduction across each of the program instances' elements of ``sum``. This approach effectively compiles to a single vector load and a single vector add for each ``programCount`` worth of values--very efficient code in the end. :: /* good ispc implementation of a sum reduction */ uniform float sum(const uniform float array[], uniform int count) { float sum = 0; // Assumes programCount evenly divides count for (uniform int i = 0; i < count; i += programCount) sum += array[i+programIndex]; return reduce_add(sum); } Disclaimer and Legal Information ================================ INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile, i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Copyright(C) 2011, Intel Corporation. All rights reserved. Optimization Notice =================== Intel compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the "Intel Compiler User and Reference Guides" under "Compiler Options." Many library routines that are part of Intel compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. Intel compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.