diff --git a/docs/build.sh b/docs/build.sh index 20802497..904d2d21 100755 --- a/docs/build.sh +++ b/docs/build.sh @@ -1,10 +1,13 @@ #!/bin/bash -for i in ispc perf faq; do +for i in ispc perfguide faq; do rst2html.py --template=template.txt --link-stylesheet \ --stylesheet-path=css/style.css $i.txt > $i.html done +rst2html.py --template=template-perf.txt --link-stylesheet \ + --stylesheet-path=css/style.css perf.txt > perf.html + #rst2latex --section-numbering --documentclass=article --documentoptions=DIV=9,10pt,letterpaper ispc.txt > ispc.tex #pdflatex ispc.tex #/bin/rm -f ispc.aux ispc.log ispc.out ispc.tex diff --git a/docs/perf.txt b/docs/perf.txt index e69de29b..bf56fbc1 100644 --- a/docs/perf.txt +++ b/docs/perf.txt @@ -0,0 +1,85 @@ +=========== +Performance +=========== + +The SPMD programming model that ``ispc`` makes it easy to harness the +computational power available in SIMD vector units on modern CPUs, while +its basis in C makes it easy for programmers to adopt and use +productively. This page summarizes the performance of ``ispc`` with the +workloads in the ``examples/`` directory of the ``ispc`` distribution. + +These results were measured on a 4-core Apple iMac with a 4-core 3.4GHz +Intel® Core-i7 processor using the Intel® AVX instruction set. The basis +for comparison is a reference C++ implementation compiled with gcc 4.2.1, +the version distributed with OS X 10.7.2. (The reference implementation is +also included in the ``examples/`` directory.) + +.. list-table:: Performance of ``ispc`` with a variety of the workloads + from the ``examples/`` directory of the ``ispc`` distribution, compared + a reference C++ implementation compiled with gcc 4.2.1. + + * - Workload + - ``ispc``, 1 core + - ``ispc``, 4 cores + * - `AOBench`_ (512 x 512 resolution) + - 3.99x + - 19.32x + * - `Binomial Options`_ (128k options) + - 7.94x + - 33.43x + * - `Black-Scholes Options`_ (128k options) + - 8.45x + - 32.48x + * - `Deferred Shading`_ (1280p) + - n/a + - 23.06x + * - `Mandelbrot Set`_ + - 6.21x + - 19.90x + * - `Perlin Noise Function`_ + - 5.37x + - n/a + * - `Ray Tracer`_ (Sponza dataset) + - 3.99x + - 19.32x + * - `3D Stencil`_ + - 3.76x + - 13.79x + * - `Volume Rendering`_ + - 3.11x + - 15.80x + + +.. _AOBench: https://github.com/ispc/ispc/tree/master/examples/aobench +.. _Binomial Options: https://github.com/ispc/ispc/tree/master/examples/options +.. _Black-Scholes Options: https://github.com/ispc/ispc/tree/master/examples/options +.. _Deferred Shading: https://github.com/ispc/ispc/tree/master/examples/deferred +.. _Mandelbrot Set: https://github.com/ispc/ispc/tree/master/examples/mandelbrot_tasks +.. _Ray Tracer: https://github.com/ispc/ispc/tree/master/examples/rt +.. _Perlin Noise Function: https://github.com/ispc/ispc/tree/master/examples/noise +.. _3D Stencil: https://github.com/ispc/ispc/tree/master/examples/stencil +.. _Volume Rendering: https://github.com/ispc/ispc/tree/master/examples/volume_rendering + + +The following table shows speedups for a number of the examples on a +2.40GHz, 40-core Intel® Xeon E7-8870 system with the Intel® SSE4 +instruction set, running Microsoft Windows Server 2008 Enterprise. Here, +the serial C/C++ baseline code was compiled with MSVC 2010. + +.. list-table:: Performance of ``ispc`` with a variety of the workloads + from the ``examples/`` directory of the ``ispc`` distribution, on + system with 40 CPU cores. + + * - Workload + - ``ispc``, 40 cores + * - AOBench (2048 x 2048 resolution) + - 182.36x + * - Binomial Options (2m options) + - 63.85x + * - Black-Scholes Options (2m options) + - 83.97x + * - Ray Tracer (Sponza dataset) + - 195.67x + * - Volume Rendering + - 243.18x + diff --git a/docs/perfguide.txt b/docs/perfguide.txt new file mode 100644 index 00000000..e6006012 --- /dev/null +++ b/docs/perfguide.txt @@ -0,0 +1,714 @@ +============================================== +Intel® SPMD Program Compiler Performance Guide +============================================== + +The SPMD programming model provided by ``ispc`` naturally delivers +excellent performance for many workloads thanks to efficient use of CPU +SIMD vector hardware. This guide provides more details about how to get +the most out of ``ispc`` in practice. + +* `Key Concepts`_ + + + `Efficient Iteration With "foreach"`_ + + `Improving Control Flow Coherence With "foreach_tiled"`_ + + `Using Coherent Control Flow Constructs`_ + + `Use "uniform" Whenever Appropriate`_ + +* `Tips and Techniques`_ + + + `Understanding Gather and Scatter`_ + + `Avoid 64-bit Addressing Calculations When Possible`_ + + `Avoid Computation With 8 and 16-bit Integer Types`_ + + `Implementing Reductions Efficiently`_ + + `Using Low-level Vector Tricks`_ + + `The "Fast math" Option`_ + + `"inline" Aggressively`_ + + `Avoid The System Math Library`_ + + `Declare Variables In The Scope Where They're Used`_ + + `Instrumenting ISPC Programs To Understand Runtime Behavior`_ + + `Choosing A Target Vector Width`_ + +* `Disclaimer and Legal Information`_ + +* `Optimization Notice`_ + +Key Concepts +============ + +This section describes the four most important concepts to understand and +keep in mind when writing high-performance ``ispc`` programs. It assumes +good familiarity with the topics covered in the ``ispc`` `Users Guide`_. + +.. _Users Guide: ispc.html + +Efficient Iteration With "foreach" +---------------------------------- + +The ``foreach`` parallel iteration construct is semantically equivalent to +a regular ``for()`` loop, though it offers meaningful performance benefits. +(See the `documentation on "foreach" in the Users Guide`_ for a review of +its syntax and semantics.) As an example, consider this simple function +that iterates over some number of elements in an array, doing computation +on each one: + +.. _documentation on "foreach" in the Users Guide: ispc.html#parallel-iteration-statements-foreach-and-foreach-tiled + +:: + + export void foo(uniform int a[], uniform int count) { + for (int i = programIndex; i < count; i += programCount) { + // do some computation on a[i] + } + } + +Depending on the specifics of the computation being performed, the code +generated for this function could likely be improved by modifying the code +so that the loop only goes as far through the data as is possible to pack +an entire gang of program instances with computation each time thorugh the +loop. Doing so enables the ``ispc`` compiler to generate more efficient +code for cases where it knows that the execution mask is "all on". Then, +an ``if`` statement at the end handles processing the ragged extra bits of +data that didn't fully fill a gang. + +:: + + export void foo(uniform int a[], uniform int count) { + // First, just loop up to the point where all program instances + // in the gang will be active at the loop iteration start + uniform int countBase = count & ~(programCount-1); + for (uniform int i = 0; i < countBase; i += programCount) { + int index = i + programIndex; + // do some computation on a[index] + } + // Now handle the ragged extra bits at the end + if (countBase < count) { + int index = countBase + programIndex; + // do some computation on a[index] + } + } + +While the performance of the above code will likely be better than the +first version of the function, the loop body code has been duplicated (or +has been forced to move into a separate utility function). + +Using the ``foreach`` looping construct as below provides all of the +performance benefits of the second version of this function, with the +compactness of the first. + +:: + + export void foo(uniform int a[], uniform int count) { + foreach (i = 0 ... count) { + // do some computation on a[i] + } + } + +Improving Control Flow Coherence With "foreach_tiled" +----------------------------------------------------- + +Depending on the computation being performed, ``foreach_tiled`` may give +better performance than ``foreach``. (See the `documentation in the Users +Guide`_ for the syntax and semantics of ``foreach_tiled``.) Given a +multi-dimensional iteration like: + +.. _documentation in the Users Guide: ispc.html#parallel-iteration-statements-foreach-and-foreach-tiled + +:: + + foreach (i = 0 ... width, j = 0 ... height) { + // do computation on element (i,j) + } + +if the ``foreach`` statement is used, elements in the gang of program +instances will be mapped to values of ``i`` and ``j`` by taking spans of +``programCount`` elements across ``i`` with a single value of ``j``. For +example, the ``foreach`` statement above roughly corresponds to: + +:: + + for (uniform int j = 0; j < height; ++j) + for (int i = 0; i < width; i += programCount) { + // do computation + } + +When a multi-dimensional domain is being iterated over, ``foreach_tiled`` +statement maps program instances to data in a way that tries to select +square n-dimensional segments of the domain. For example, on a compilation +target with 8-wide gangs of program instances, it generates code that +iterates over the domain the same way as the following code (though more +efficiently): + +:: + + for (int j = programIndex/4; j < height; j += 2) + for (int i = programIndex%4; i < width; i += 4) { + // do computation + } + +Thus, each gang of program instances operates on a 2x4 tile of the domain. +With higher-dimensional iteration and different gang sizes, a similar +mapping is performed--e.g. for 2D iteration with a 16-wide gang size, 4x4 +tiles are iterated over; for 4D iteration with a 8-gang, 1x2x2x2 tiles are +processed, and so forth. + +Performance benefit can come from using ``foreach_tiled`` in that it +essentially optimizes for the benefit of iterating over *compact* regions +of the domian (while ``foreach`` iterates over the domain in a way that +generally allows linear memory access.) There are two benefits from +processing compact regions of the domain. + +First, it's often the case that the control flow coherence of the program +instances in the gang is improved; if data-dependent control flow decisions +are related to the values of the data in the domain being processed, and if +the data values have some coherence, iterating with compact regions will +improve control flow coherence. + +Second, processing compact regions may mean that the data accessed by +program instances in the gang is be more coherent, leading to performance +benefits from better cache hit rates. + +As a concrete example, for the ray tracer example in the ``ispc`` +distribution (in the ``examples/rt`` directory), performance is 20% better +when the pixels are iterated over using ``foreach_tiled`` than ``foreach``, +because more coherent regions of the scene are accessed by the set of rays +in the gang of program instances. + + +Using Coherent Control Flow Constructs +-------------------------------------- + +Recall from the ``ispc`` Users Guide, in the `SPMD-on-SIMD Execution Model +section`_ that ``if`` statements with a ``uniform`` test compile to more +efficient code than ``if`` tests with varying tests. The coherent ``cif`` +statement can provide many benefits of ``if`` with a uniform test in the +case where the test is actually varying. + +.. _SPMD-on-SIMD Execution Model section: ispc.html#the-spmd-on-simd-execution-model + +In this case, the code the compiler generates for the ``if`` +test is along the lines of the following pseudo-code: + +:: + + bool expr = /* evaluate cif condition */ + if (all(expr)) { + // run "true" case of if test only + } else if (!any(expr)) { + // run "false" case of if test only + } else { + // run both true and false cases, updating mask appropriately + } + +For ``if`` statements where the different running SPMD program instances +don't have coherent values for the boolean ``if`` test, using ``cif`` +introduces some additional overhead from the ``all`` and ``any`` tests as +well as the corresponding branches. For cases where the program +instances often do compute the same boolean value, this overhead is +worthwhile. If the control flow is in fact usually incoherent, this +overhead only costs performance. + +In a similar fashion, ``ispc`` provides ``cfor``, ``cwhile``, and ``cdo`` +statements. These statements are semantically the same as the +corresponding non-"c"-prefixed functions. + +Use "uniform" Whenever Appropriate +---------------------------------- + +For any variable that will always have the same value across all of the +program instances in a gang, declare the variable with the ``unfiorm`` +qualifier. Doing so enables the ``ispc`` compiler to emit better code in +many different ways. + +As a simple example, consider a ``for`` loop that always does the same +number of iterations: + +:: + + for (int i = 0; i < 10; ++i) + // do something ten times + +If this is written with ``i`` as a ``varying`` variable, as above, there's +additional overhead in the code generated for the loop as the compiler +emits instructions to handle the possibilty of not all program instances +following the same control flow path (as might be the case if the loop +limit, 10, was itself a ``varying`` value.) + +If the above loop is instead written with ``i`` ``uniform``, as: + +:: + + for (uniform int i = 0; i < 10; ++i) + // do something ten times + +Then better code can be generated (and the loop possibly unrolled). + +In some cases, the compiler may be able to detect simple cases like these, +but it's always best to provide the compiler with as much help as possible +to understand the actual form of your computation. + + +Tips and Techniques +=================== + +This section introduces a number of additional techniques that are worth +keeping in mind when writing ``ispc`` programs. + +Understanding Gather and Scatter +-------------------------------- + +Memory reads and writes from the program instances in a gang that access +irregular memory locations (rather than a consecutive set of locations, or +a single location) can be relatively inefficient. As an example, consider +the "simple" array indexing calculation below: + +:: + + int i = ....; + uniform float x[10] = { ... }; + float f = x[i]; + +Since the index ``i`` is a varying value, the program instances in the gang +will in general be reading different locations in the array ``x``. Because +current CPUs have a "gather" instruction, the ``ispc`` compiler has to +serialize these memory reads, performing a separate memory load for each +running program instance, packing the result into ``f``. (The analogous +case happens for a write into ``x[i]``.) + +In many cases, gathers like these are unavoidable; the program instances +just need to access incoherent memory locations. However, if the array +index ``i`` actually has the same value for all of the program instances or +if it represents an access to a consecutive set of array locations, much +more efficient load and store instructions can be generated instead of +gathers and scatters, respectively. + +In many cases, the ``ispc`` compiler is able to deduce that the memory +locations accessed by a varying index are either all the same or are +uniform. For example, given: + +:: + + uniform int x = ...; + int y = x; + return array[y]; + +The compiler is able to determine that all of the program instances are +loading from the same location, even though ``y`` is not a ``uniform`` +variable. In this case, the compiler will transform this load to a regular +vector load, rather than a general gather. + +Sometimes the running program instances will access a linear sequence of +memory locations; this happens most frequently when array indexing is done +based on the built-in ``programIndex`` variable. In many of these cases, +the compiler is also able to detect this case and then do a vector load. +For example, given: + +:: + + for (int i = programIndex; i < count; i += programCount) + // process array[i]; + +Regular vector loads and stores are issued for accesses to ``array[i]``. + +Both of these cases have been ones where the compiler is able to determine +statically that the index has the same value at compile-time. It's +often the case that this determination can't be made at compile time, but +this is often the case at run time. The ``reduce_equal()`` function from +the standard library can be used in this case; it checks to see if the +given value is the same across over all of the running program instances, +returning true and its ``uniform`` value if so. + +The following function shows the use of ``reduce_equal()`` to check for an +equal index at execution time and then either do a scalar load and +broadcast or a general gather. + +:: + + uniform float array[..] = { ... }; + float value; + int i = ...; + uniform int ui; + if (reduce_equal(i, &ui) == true) + value = array[ui]; // scalar load + broadcast + else + value = array[i]; // gather + +For a simple case like the one above, the overhead of doing the +``reduce_equal()`` check is likely not worthwhile compared to just always +doing a gather. In more complex cases, where a number of accesses are done +based on the index, it can be worth doing. See the example +``examples/volume_rendering`` in the ``ispc`` distribution for the use of +this technique in an instance where it is beneficial to performance. + +Avoid 64-bit Addressing Calculations When Possible +-------------------------------------------------- + +Even when compiling to a 64-bit architecture target, ``ispc`` does many of +the addressing calculations in 32-bit precision by default--this behavior +can be overridden with the ``--addressing=64`` command-line argument. This +option should only be used if it's necessary to be able to address over 4GB +of memory in the ``ispc`` code, as it essentially doubles the cost of +memory addressing calculations in the generated code. + +Avoid Computation With 8 and 16-bit Integer Types +------------------------------------------------- + +The code generated for 8 and 16-bit integer types is generally not as +efficient as the code generated for 32-bit integer types. It is generally +worthwhile to use 32-bit integer types for intermediate computations, even +if the final result will be stored in a smaller integer type. + +Implementing Reductions Efficiently +----------------------------------- + +It's often necessary to compute a reduction over a data set--for example, +one might want to add all of the values in an array, compute their minimum, +etc. ``ispc`` provides a few capabilities that make it easy to efficiently +compute reductions like these. However, it's important to use these +capabilities appropriately for best results. + +As an example, consider the task of computing the sum of all of the values +in an array. In C code, we might have: + +:: + + /* C implementation of a sum reduction */ + float sum(const float array[], int count) { + float sum = 0; + for (int i = 0; i < count; ++i) + sum += array[i]; + return sum; + } + +Exactly this computation could also be expressed as a purely uniform +computation in ``ispc``, though without any benefit from vectorization: + +:: + + /* inefficient ispc implementation of a sum reduction */ + uniform float sum(const uniform float array[], uniform int count) { + uniform float sum = 0; + for (uniform int i = 0; i < count; ++i) + sum += array[i]; + return sum; + } + +As a first try, one might try using the ``reduce_add()`` function from the +``ispc`` standard library; it takes a ``varying`` value and returns the sum +of that value across all of the active program instances. + +:: + + /* inefficient ispc implementation of a sum reduction */ + uniform float sum(const uniform float array[], uniform int count) { + uniform float sum = 0; + foreach (i = 0 ... count) + sum += reduce_add(array[i+programIndex]); + return sum; + } + +This implementation loads a gang's worth of values from the array, one for +each of the program instances, and then uses ``reduce_add()`` to reduce +across the program instances and then update the sum. Unfortunately this +approach loses most benefit from vectorization, as it does more work on the +cross-program instance ``reduce_add()`` call than it saves from the vector +load of values. + +The most efficient approach is to do the reduction in two phases: rather +than using a ``uniform`` variable to store the sum, we maintain a varying +value, such that each program instance is effectively computing a local +partial sum on the subset of array values that it has loaded from the +array. When the loop over array elements concludes, a single call to +``reduce_add()`` computes the final reduction across each of the program +instances' elements of ``sum``. This approach effectively compiles to a +single vector load and a single vector add for each loop iteration's of +values--very efficient code in the end. + +:: + + /* good ispc implementation of a sum reduction */ + uniform float sum(const uniform float array[], uniform int count) { + float sum = 0; + foreach (i = 0 ... count) + sum += array[i+programIndex]; + return reduce_add(sum); + } + +Using Low-level Vector Tricks +----------------------------- + +Many low-level Intel® SSE and AVX coding constructs can be implemented in +``ispc`` code. The ``ispc`` standard library functions ``intbits()`` and +``floatbits()`` are often useful in this context. Recall that +``intbits()`` takes a ``float`` value and returns it as an integer where +the bits of the integer are the same as the bit representation in memory of +the ``float``. (In other words, it does *not* perform an integer to +floating-point conversion.) ``floatbits()``, then, performs the inverse +computation. + +As an example of the use of these functions, the following code efficiently +reverses the sign of the given values. + +:: + + float flipsign(float a) { + unsigned int i = intbits(a); + i ^= 0x80000000; + return floatbits(i); + } + +This code compiles down to a single XOR instruction. + +The "Fast math" Option +---------------------- + +``ispc`` has a ``--opt=fast-math`` command-line flag that enables a number of +optimizations that may be undesirable in code where numerical precision is +critically important. For many graphics applications, for example, the +approximations introduced may be acceptable, however. The following two +optimizations are performed when ``--opt=fast-math`` is used. By default, the +``--opt=fast-math`` flag is off. + +* Expressions like ``x / y``, where ``y`` is a compile-time constant, are + transformed to ``x * (1./y)``, where the inverse value of ``y`` is + precomputed at compile time. + +* Expressions like ``x / y``, where ``y`` is not a compile-time constant, + are transformed to ``x * rcp(y)``, where ``rcp()`` maps to the + approximate reciprocal instruction from the ``ispc`` standard library. + + +"inline" Aggressively +--------------------- + +Inlining functions aggressively is generally beneficial for performance +with ``ispc``. Definitely use the ``inline`` qualifier for any short +functions (a few lines long), and experiment with it for longer functions. + +Avoid The System Math Library +----------------------------- + +The default math library for transcendentals and the like that ``ispc`` has +higher error than the system's math library, though is much more efficient +due to being vectorized across the program instances and due to the fact +that the functions can be inlined in the final code. (It generally has +errors in the range of 10ulps, while the system math library generally has +no more than 1ulp of error for transcendentals.) + +If the ``--math-lib=system`` command-line option is used when compiling an +``ispc`` program, then calls to the system math library will be generated +instead. This option should only be used if the higher precision is +absolutely required as the performance impact of using it can be +significant. + +Declare Variables In The Scope Where They're Used +------------------------------------------------- + +Performance is slightly improved by declaring variables at the same block +scope where they are first used. For example, in code like the +following, if the lifetime of ``foo`` is only within the scope of the +``if`` clause, write the code like this: + +:: + + float func() { + .... + if (x < y) { + float foo; + ... use foo ... + } + } + +Try not to write code as: + +:: + + float func() { + float foo; + .... + if (x < y) { + ... use foo ... + } + } + +Doing so can reduce the amount of masked store instructions that the +compiler needs to generate. + +Instrumenting ISPC Programs To Understand Runtime Behavior +---------------------------------------------------------- + +``ispc`` has an optional instrumentation feature that can help you +understand performance issues. If a program is compiled using the +``--instrument`` flag, the compiler emits calls to a function with the +following signature at various points in the program (for +example, at interesting points in the control flow, when scatters or +gathers happen.) + +:: + + extern "C" { + void ISPCInstrument(const char *fn, const char *note, + int line, int mask); + } + +This function is passed the file name of the ``ispc`` file running, a short +note indicating what is happening, the line number in the source file, and +the current mask of active program instances in the gang. You must provide an +implementation of this function and link it in with your application. + +For example, when the ``ispc`` program runs, this function might be called +as follows: + +:: + + ISPCInstrument("foo.ispc", "function entry", 55, 0xf); + +This call indicates that at the currently executing program has just +entered the function defined at line 55 of the file ``foo.ispc``, with a +mask of all lanes currently executing (assuming a four-wide gang size +target machine). + +For a fuller example of the utility of this functionality, see +``examples/aobench_instrumented`` in the ``ispc`` distribution. Ths +example includes an implementation of the ``ISPCInstrument()`` function +that collects aggregate data about the program's execution behavior. + +When running this example, you will want to direct to the ``ao`` executable +to generate a low resolution image, because the instrumentation adds +substantial execution overhead. For example: + +:: + + % ./ao 1 32 32 + +After the ``ao`` program exits, a summary report along the following lines +will be printed. In the first few lines, you can see how many times a few +functions were called, and the average percentage of SIMD lanes that were +active upon function entry. + +:: + + ao.ispc(0067) - function entry: 342424 calls (0 / 0.00% all off!), 95.86% active lanes + ao.ispc(0067) - return: uniform control flow: 342424 calls (0 / 0.00% all off!), 95.86% active lanes + ao.ispc(0071) - function entry: 1122 calls (0 / 0.00% all off!), 97.33% active lanes + ao.ispc(0075) - return: uniform control flow: 1122 calls (0 / 0.00% all off!), 97.33% active lanes + ao.ispc(0079) - function entry: 10072 calls (0 / 0.00% all off!), 45.09% active lanes + ao.ispc(0088) - function entry: 36928 calls (0 / 0.00% all off!), 97.40% active lanes + ... + + +Choosing A Target Vector Width +------------------------------ + +By default, ``ispc`` compiles to the natural vector width of the target +instruction set. For example, for SSE2 and SSE4, it compiles four-wide, +and for AVX, it complies 8-wide. For some programs, higher performance may +be seen if the program is compiled to a doubled vector width--8-wide for +SSE and 16-wide for AVX. + +For workloads that don't require many of registers, this method can lead to +significantly more efficient execution thanks to greater instruction level +parallelism and amortization of various overhead over more program +instances. For other workloads, it may lead to a slowdown due to higher +register pressure; trying both approaches for key kernels may be +worthwhile. + +This option is only available for each of the SSE2, SSE4 and AVX targets. +It is selected with the ``--target=sse2-x2``, ``--target=sse4-x2`` and +``--target=avx-x2`` options, respectively. + + +Disclaimer and Legal Information +================================ + +INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. +NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL +PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS +AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, +AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE +OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A +PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT +OR OTHER INTELLECTUAL PROPERTY RIGHT. + +UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED +NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD +CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. + +Intel may make changes to specifications and product descriptions at any time, +without notice. Designers must not rely on the absence or characteristics of any +features or instructions marked "reserved" or "undefined." Intel reserves these +for future definition and shall have no responsibility whatsoever for conflicts +or incompatibilities arising from future changes to them. The information here +is subject to change without notice. Do not finalize a design with this +information. + +The products described in this document may contain design defects or errors +known as errata which may cause the product to deviate from published +specifications. Current characterized errata are available on request. + +Contact your local Intel sales office or your distributor to obtain the latest +specifications and before placing your product order. + +Copies of documents which have an order number and are referenced in this +document, or other Intel literature, may be obtained by calling 1-800-548-4725, +or by visiting Intel's Web Site. + +Intel processor numbers are not a measure of performance. Processor numbers +differentiate features within each processor family, not across different +processor families. See http://www.intel.com/products/processor_number for +details. + +BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, +Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile, +i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, +IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, +Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, +Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, +Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, +Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, +skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, +and Xeon Inside are trademarks of Intel Corporation in the U.S. and other +countries. + +* Other names and brands may be claimed as the property of others. + +Copyright(C) 2011, Intel Corporation. All rights reserved. + + +Optimization Notice +=================== + +Intel compilers, associated libraries and associated development tools may +include or utilize options that optimize for instruction sets that are +available in both Intel and non-Intel microprocessors (for example SIMD +instruction sets), but do not optimize equally for non-Intel +microprocessors. In addition, certain compiler options for Intel +compilers, including some that are not specific to Intel +micro-architecture, are reserved for Intel microprocessors. For a detailed +description of Intel compiler options, including the instruction sets and +specific microprocessors they implicate, please refer to the "Intel +Compiler User and Reference Guides" under "Compiler Options." Many library +routines that are part of Intel compiler products are more highly optimized +for Intel microprocessors than for other microprocessors. While the +compilers and libraries in Intel compiler products offer optimizations for +both Intel and Intel-compatible microprocessors, depending on the options +you select, your code and other factors, you likely will get extra +performance on Intel microprocessors. + +Intel compilers, associated libraries and associated development tools may +or may not optimize to the same degree for non-Intel microprocessors for +optimizations that are not unique to Intel microprocessors. These +optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), +Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental +Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other +optimizations. Intel does not guarantee the availability, functionality, +or effectiveness of any optimization on microprocessors not manufactured by +Intel. Microprocessor-dependent optimizations in this product are intended +for use with Intel microprocessors. + +While Intel believes our compilers and libraries are excellent choices to +assist in obtaining the best performance on Intel and non-Intel +microprocessors, Intel recommends that you evaluate other compilers and +libraries to determine which best meet your requirements. We hope to win +your business by striving to offer the best performance of any compiler or +library; please let us know if you find we do not. + diff --git a/docs/template-perf.txt b/docs/template-perf.txt index 437702f1..e051c2e7 100644 --- a/docs/template-perf.txt +++ b/docs/template-perf.txt @@ -28,8 +28,8 @@