FAQ and perf guide updates

2011-11-30 19:38:37 -08:00
parent c5aecd51e9
commit a2f118a14e
2 changed files with 806 additions and 0 deletions
--- a/docs/faq.txt
+++ b/docs/faq.txt
@@ -2,3 +2,385 @@
 Intel® SPMD Program Compiler Frequently Asked Questions (FAQ)
 =============================================================

+This document includes a number of frequently (and not frequently) asked
+questions about ispc, the Intel® SPMD Program Compiler.  The source to this
+document is in the file ``docs/faq.txt`` in the ``ispc`` source
+distribution.
+
+* Understanding ispc's Output
+
+  + `How can I see the assembly language generated by ispc?`_
+  + `How can I have the assembly output be printed using Intel assembly syntax?`_
+  + `Why are there multiple versions of exported ispc functions in the assembly output?`_
+  + `How can I more easily see gathers and scatters in generated assembly?`_
+
+* Interoperability
+
+  + `How can I supply an initial execution mask in the call from the application?`_
+  + `How can I generate a single binary executable with support for multiple instruction sets?`_
+  + `How can I determine at run-time which vector instruction set's instructions were selected to execute?`_
+
+* Programming Techniques
+
+  + `What primitives are there for communicating between SPMD program instances?`_
+  + `How can a gang of program instances generate variable output efficiently?`_
+  + `Is it possible to use ispc for explicit vector programming?`_
+
+
+Understanding ispc's Output
+===========================
+
+How can I see the assembly language generated by ispc?
+------------------------------------------------------
+
+The ``--emit-asm`` flag causes assembly output to be generated.  If the
+``-o`` command-line flag is also supplied, the assembly is stored in the
+given file, or printed to standard output if ``-`` is specified for the
+filename.  For example, given the simple ``ispc`` program:
+
+::
+
+    export uniform int foo(uniform int a, uniform int b) {
+        return a+b;
+    }
+
+If the SSE4 target is used, then the following assembly is printed:
+
+::
+
+    _foo:                                   ## @foo
+    ## BB#0:                                ## %allocas
+            addl    %esi, %edi
+            movl    %edi, %eax
+            ret
+
+
+How can I have the assembly output be printed using Intel assembly syntax?
+--------------------------------------------------------------------------
+
+The ``ispc`` compiler is currently only able to emit assembly with AT+T
+syntax, where the destination operand is the last operand after an
+instruction.  If you'd prefer Intel assembly output, one option is to use
+Agner Fog's ``objconv`` tool: have ``ispc`` emit a native object file and
+then use ``objconv`` to disassemble it, specifying the assembler syntax
+that you prefer.  ``objconv`` `is available for download here`_.
+
+.. _is available for download here: http://www.agner.org/optimize/#objconv
+
+Why are there multiple versions of exported ispc functions in the assembly output?
+----------------------------------------------------------------------------------
+
+Two generations of all functions qualified with ``export`` are generated:
+one of them is for being be called by other ``ispc`` functions, and the
+other is to be called by the application.  The application callable
+function has the original function's name, while the ``ispc``-callable
+function has a mangled name that encodes the types of the function's
+parameters.
+
+The crucial difference between these two functions is that the
+application-callable function doesn't take a parameter encoding the current
+execution mask, while ``ispc``-callable functions have a hidden mask
+parameter.  An implication of this difference is that the ``export``
+function starts with the execution mask "all on".  This allows a number of
+improvements in the generated code, particularly on architectures that
+don't have support for masked load and store instructions.
+
+As an example, consider this short function, which loads a vector's worth
+values from two arrays in memory, adds them, and writes the result to an
+output array.
+
+::
+
+    export void foo(uniform float a[], uniform float b[],
+                    uniform float result[]) {
+        float aa = a[programIndex], bb = b[programIndex];
+        result[programIndex] = aa+bb;
+    }
+
+Here is the assembly code for the application-callable instance of the
+function--note that the selected instructions are ideal.
+
+::
+
+    _foo:
+            movups        (%rsi), %xmm1
+            movups        (%rdi), %xmm0
+            addps         %xmm1, %xmm0
+            movups        %xmm0, (%rdx)
+            ret
+
+
+And here is the assembly code for the ``ispc``-callable instance of the
+function.  There are a few things to notice in this code.  
+
+The current program mask is coming in via the %xmm0 register and the
+initial few instructions in the function essentially check to see if the
+mask is all-on or all-off.  If the mask is all on, the code at the label
+LBB0_3 executes; it's the same as the code that was generated for ``_foo``
+above.  If the mask is all off, then there's nothing to be done, and the
+function can return immediately.
+
+In the case of a mixed mask, a substantial amount of code is generated to
+load from and then store to only the array elements that correspond to
+program instances where the mask is on.  (This code is elided below).  This
+general pattern of having two-code paths for the "all on" and "mixed" mask
+cases is used in the code generated for almost all but the most simple
+functions (where the overhead of the test isn't worthwhile.)
+
+::
+
+    "_foo___uptr<Uf>uptr<Uf>uptr<Uf>":
+            movmskps      %xmm0, %eax
+            cmpl          $15, %eax
+            je            LBB0_3
+            testl         %eax, %eax
+            jne           LBB0_4
+            ret
+    LBB0_3:
+            movups        (%rsi), %xmm1
+            movups        (%rdi), %xmm0
+            addps         %xmm1, %xmm0
+            movups        %xmm0, (%rdx)
+            ret
+    LBB0_4:
+    ####
+    ####  Code elided; handle mixed mask case..
+    ####
+            ret
+
+
+How can I more easily see gathers and scatters in generated assembly?
+---------------------------------------------------------------------
+
+FIXME
+
+Interoperability
+================
+
+How can I supply an initial execution mask in the call from the application?
+----------------------------------------------------------------------------
+
+Recall that when execution transitions from the application code to an
+``ispc`` function, all of the program instances are initially executing.
+In some cases, it may desired that only some of them are running, based on
+a data-dependent condition computed in the application program.  This
+situation can easily be handled via an additional parameter from the
+application.
+
+As a simple example, consider a case where the application code has an
+array of ``float`` values and we'd like the ``ispc`` code to update
+just specific values in that array, where which of those values to be
+updated has been determined by the application.  In C++ code, we might
+have:
+
+::
+
+    int count = ...;
+    float *array = new float[count];
+    bool *shouldUpdate = new bool[count];
+    // initialize array and shouldUpdate
+    ispc_func(array, shouldUpdate, count);
+
+Then, the ``ispc`` code could process this update as:
+
+::
+
+    export void ispc_func(uniform float array[], uniform bool update[],
+                          uniform int count) {
+        foreach (i = 0 ... count) {
+            cif (update[i] == true)
+                // update array[i+programIndex]...
+        }
+    }
+
+(In this case a "coherent" if statement is likely to be worthwhile if the
+``update`` array will tend to have sections that are either all-true or
+all-false.)
+
+How can I generate a single binary executable with support for multiple instruction sets?
+-----------------------------------------------------------------------------------------
+
+``ispc`` can also generate output that supports multiple target instruction
+sets, also generating code that chooses the most appropriate one at runtime
+if multiple targets are specified with the ``--target`` command-line
+argument.
+
+For example, if you run the command:
+
+::
+
+   ispc foo.ispc -o foo.o --target=sse2,sse4-x2,avx-x2
+
+Then four object files will be generated: ``foo_sse2.o``, ``foo_sse4.o``,
+``foo_avx.o``, and ``foo.o``.[#]_  Link all of these into your executable, and
+when you call a function in ``foo.ispc`` from your application code,
+``ispc`` will determine which instruction sets are supported by the CPU the
+code is running on and will call the most appropraite version of the
+function available.  
+
+.. [#] Similarly, if you choose to generate assembly langauage output or
+   LLVM bitcode output, multiple versions of those files will be created.
+
+In general, the version of the function that runs will be the one in the
+most general instruction set that is supported by the system.  If you only
+compile SSE2 and SSE4 variants and run on a system that supports AVX, for
+example, then the SSE4 variant will be executed.  If the system doesn't
+is not able to run any of the available variants of the function (for
+example, trying to run a function that only has SSE4 and AVX variants on a
+system that only supports SSE2), then the standard library ``abort()``
+function will be called.
+
+One subtlety is that all non-static global variables (if any) must have the
+same size and layout with all of the targets used.  For example, if you
+have the global variables:
+
+::
+
+   uniform int foo[2*programCount];
+   int bar;
+
+and compile to both SSE2 and AVX targets, both of these variables will have
+different sizes (the first due to program count having the value 4 for SSE2
+and 8 for AVX, and the second due to ``varying`` types having different
+numbers of elements with the two targets--essentially the same issue as the
+first.)  ``ispc`` issues an error in this case.
+
+
+How can I determine at run-time which vector instruction set's instructions were selected to execute?
+-----------------------------------------------------------------------------------------------------
+
+``ispc`` doesn't provide any API that allows querying which vector ISA's
+instructions are running when multi-target compilation was used.  However,
+this can be solved in "user space" by writing a small helper function.
+Specifically, if you implement a function like this
+
+::
+
+    export uniform int isa() {
+    #if defined(ISPC_TARGET_SSE2)
+        return 0;
+    #elif defined(ISPC_TARGET_SSE4)
+        return 1;
+    #elif defined(ISPC_TARGET_AVX)
+        return 2;
+    #else
+        return -1;
+    #endif
+    }
+
+And then call it from your application code at runtime, it will return 0,
+1, or 2, depending on which target's instructions are running.
+
+The way this works is a little surprising, but it's a useful trick.  Of
+course the preprocessor ``#if`` checks are all compile-time only
+operations.  What's actually happening is that the function is compiled
+multiple times, once for each target, with the appropriate ``ISPC_TARGET``
+preprocessor symbol set.  Then, a small dispatch function is generated for
+the application to actually call.  This dispatch function in turn calls the
+appropriate version of the function based on the CPU of the system it's
+executing on, which in turn returns the appropriate value.
+
+In a similar fashion, it's possible to find out at run-time the value of
+``programCount`` for the target that's actually being used.
+
+::
+
+    export uniform int width() { return programCount; }
+
+
+Programming Techniques
+======================
+
+What primitives are there for communicating between SPMD program instances?
+---------------------------------------------------------------------------
+
+The ``broadcast()``, ``rotate()``, and ``shuffle()`` standard library
+routines provide a variety of mechanisms for the running program instances
+to communicate values to each other during execution.  Note that there's no
+need to synchronize the program instances before communicating between
+them, due to the synchronized execution model of gangs of program instances
+in ``ispc``.
+
+How can a gang of program instances generate variable output efficiently?
+-------------------------------------------------------------------------
+
+A useful application of the ``exclusive_scan_add()`` function in the
+standard library is when program instances want to generate a variable
+amount of output and when one would like that output to be densely packed
+in a single array.  For example, consider the code fragment below:
+
+::
+
+    uniform int func(uniform float outArray[], ...) {
+       int numOut = ...;  // figure out how many to be output
+       float outLocal[MAX_OUT]; // staging area
+
+       // each program instance in the gang puts its results in
+       //  outLocal[0], ..., outLocal[numOut-1]
+
+       int startOffset = exclusive_scan_add(numOut);
+       for (int i = 0; i < numOut; ++i)
+           outArray[startOffset + i] = outLocal[i];
+       return reduce_add(numOut);
+    }
+
+Here, each program instance has computed a number, ``numOut``, of values to
+output, and has stored them in the ``outLocal`` array.  Assume that four
+program instances are running and that the first one wants to output one
+value, the second two values, and the third and fourth three values each.
+In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6)
+to the four program instances, respectively.  
+
+The first program instance will write its one result to ``outArray[0]``,
+the second will write its two values to ``outArray[1]`` and
+``outArray[2]``, and so forth.  The ``reduce_add`` call at the end returns
+the total number of values that all of the program instances have written
+to the array.
+
+FIXME: add discussion of foreach_active as an option here once that's in
+
+Is it possible to use ispc for explicit vector programming?
+-----------------------------------------------------------
+
+The typical model for programming in ``ispc`` is an *implicit* parallel
+model, where one writes a program that is apparently doing scalar
+computation on values and the program is then vectorized to run in parallel
+across the SIMD lanes of a processor.  However, ``ispc`` also has some
+support for explicit vector unit programming, where the vectorization is
+explicit.  Some computations may be more effectively described in the
+explicit model rather than the implicit model.
+
+This support is provided via ``uniform`` instances of short vectors
+Specifically, if this short program
+
+::
+
+    export uniform float<8> madd(uniform float<8> a, uniform float<8> b,
+                                 uniform float<8> c) {
+        return a + b * c;
+    }
+
+is compiled with the AVX target, ``ispc`` generates the following assembly:
+
+::
+
+    _madd:
+	vmulps	%ymm2, %ymm1, %ymm1
+	vaddps	%ymm0, %ymm1, %ymm0
+	ret
+
+(And similarly, if compiled with a 4-wide SSE target, two ``mulps`` and two
+``addps`` instructions are generated, and so forth.)
+
+Note that ``ispc`` doesn't currently support control-flow based on
+``uniform`` short vector types; it is thus not possible to write code like:
+
+::
+
+    export uniform int<8> count(uniform float<8> a, uniform float<8> b) {
+        uniform int<8> sum = 0;
+        while (a++ < b)
+            ++sum;
+    }
+
+
--- a/docs/perf.txt
+++ b/docs/perf.txt
@@ -2,3 +2,427 @@
 Intel® SPMD Program Compiler Performance Guide
 ==============================================

+
+* `Using ISPC Effectively`_
+
+  + `Gather and Scatter`_
+  + `8 and 16-bit Integer Types`_
+  + `Low-level Vector Tricks`_
+  + `The "Fast math" Option`_
+  + `"Inline" Aggressively`_
+  + `Small Performance Tricks`_
+  + `Instrumenting Your ISPC Programs`_
+  + `Choosing A Target Vector Width`_
+  + `Implementing Reductions Efficiently`_
+
+* `Disclaimer and Legal Information`_
+
+* `Optimization Notice`_
+
+
+don't use the system math library unless it's absolutely necessary
+
+opt=32-bit-addressing
+
+Using ISPC Effectively
+======================
+
+
+Gather and Scatter
+------------------
+
+The CPU is a poor fit for SPMD execution in some ways, the worst of which
+is handling of general memory reads and writes from SPMD program instances.
+For example, in a "simple" array index:
+
+::
+
+    int i = ....;
+    uniform float x[10] = { ... };
+    float f = x[i];
+
+Since the index ``i`` is a varying value, the various SPMD program
+instances will in general be reading different locations in the array
+``x``.  Because the CPU doesn't have a gather instruction, the ``ispc``
+compiler has to serialize these memory reads, performing a separate memory
+load for each running program instance, packing the result into ``f``.
+(And the analogous case would happen for a write into ``x[i]``.)
+
+In many cases, gathers like these are unavoidable; the running program
+instances just need to access incoherent memory locations.  However, if the
+array index ``i`` could actually be declared and used as a ``uniform``
+variable, the resulting array index is substantially more
+efficient.  This is another case where using ``uniform`` whenever applicable
+is of benefit.
+
+In some cases, the ``ispc`` compiler is able to deduce that the memory
+locations accessed are either all the same or are uniform.  For example,
+given:
+
+::
+
+  uniform int x = ...;
+  int y = x;
+  return array[y];
+
+The compiler is able to determine that all of the program instances are
+loading from the same location, even though ``y`` is not a ``uniform``
+variable.  In this case, the compiler will transform this load to a regular vector
+load, rather than a general gather.
+
+Sometimes the running program instances will access a
+linear sequence of memory locations; this happens most frequently when
+array indexing is done based on the built-in ``programIndex`` variable.  In
+many of these cases, the compiler is also able to detect this case and then
+do a vector load.  For example, given:
+
+::
+
+    uniform int x = ...;
+    return array[2*x + programIndex];
+
+A regular vector load is done from array, starting at offset ``2*x``.
+
+
+8 and 16-bit Integer Types
+--------------------------
+
+The code generated for 8 and 16-bit integer types is generally not as
+efficient as the code generated for 32-bit integer types.  It is generally
+worthwhile to use 32-bit integer types for intermediate computations, even
+if the final result will be stored in a smaller integer type.
+
+Low-level Vector Tricks
+-----------------------
+
+Many low-level Intel® SSE coding constructs can be implemented in ``ispc``
+code.  For example, the following code efficiently reverses the sign of the
+given values.
+
+::
+
+  float flipsign(float a) {
+      unsigned int i = intbits(a);
+      i ^= 0x80000000;
+      return floatbits(i);
+  }
+
+This code compiles down to a single XOR instruction.
+
+The "Fast math" Option
+----------------------
+
+``ispc`` has a ``--fast-math`` command-line flag that enables a number of
+optimizations that may be undesirable in code where numerical preceision is
+critically important.  For many graphics applications, the
+approximations may be acceptable.  The following two optimizations are
+performed when ``--fast-math`` is used.  By default, the ``--fast-math``
+flag is off.
+
+* Expressions like ``x / y``, where ``y`` is a compile-time constant, are
+  transformed to ``x * (1./y)``, where the inverse value of ``y`` is
+  precomputed at compile time.
+
+* Expressions like ``x / y``, where ``y`` is not a compile-time constant,
+  are transformed to ``x * rcp(y)``, where ``rcp()`` maps to the
+  approximate reciprocal instruction from the standard library.
+
+
+"Inline" Aggressively
+---------------------
+
+Inlining functions aggressively is generally beneficial for performance
+with ``ispc``.  Definitely use the ``inline`` qualifier for any short
+functions (a few lines long), and experiment with it for longer functions.
+
+Small Performance Tricks
+------------------------
+
+Performance is slightly improved by declaring variables at the same block
+scope where they are first used.  For example, in code like the
+following, if the lifetime of ``foo`` is only within the scope of the
+``if`` clause, write the code like this:  
+
+::
+
+    float func() {
+        ....
+        if (x < y) {
+            float foo;
+            ... use foo ...
+        }
+    }
+
+Try not to write code as:
+
+::
+
+    float func() {
+        float foo;
+        ....
+        if (x < y) {
+            ... use foo ...
+        }
+    }
+
+Doing so can reduce the amount of masked store instructions that the
+compiler needs to generate.
+
+Instrumenting Your ISPC Programs
+--------------------------------
+
+``ispc`` has an optional instrumentation feature that can help you
+understand performance issues.  If a program is compiled using the
+``--instrument`` flag, the compiler emits calls to a function with the
+following signature at various points in the program (for
+example, at interesting points in the control flow, when scatters or
+gathers happen.)
+
+::
+
+    extern "C" {
+        void ISPCInstrument(const char *fn, const char *note, 
+                            int line, int mask);
+    }
+
+This function is passed the file name of the ``ispc`` file running, a short
+note indicating what is happening, the line number in the source file, and
+the current mask of active SPMD program lanes.  You must provide an
+implementation of this function and link it in with your application.
+
+For example, when the ``ispc`` program runs, this function might be called
+as follows:
+
+::
+
+   ISPCInstrument("foo.ispc", "function entry", 55, 0xf);
+
+This call indicates that at the currently executing program has just
+entered the function defined at line 55 of the file ``foo.ispc``, with a
+mask of all lanes currently executing (assuming a four-wide Intel® SSE
+target machine).
+
+For a fuller example of the utility of this functionality, see
+``examples/aobench_instrumented`` in the ``ispc`` distribution.  Ths
+example includes an implementation of the ``ISPCInstrument`` function that
+collects aggregate data about the program's execution behavior.
+
+When running this example, you will want to direct to the ``ao`` executable
+to generate a low resolution image, because the instrumentation adds
+substantial execution overhead.  For example:
+
+::
+
+    % ./ao 1 32 32
+
+After the ``ao`` program exits, a summary report along the following lines
+will be printed.  In the first few lines, you can see how many times a few
+functions were called, and the average percentage of SIMD lanes that were
+active upon function entry.
+
+:: 
+
+    ao.ispc(0067) - function entry: 342424 calls (0 / 0.00% all off!), 95.86% active lanes
+    ao.ispc(0067) - return: uniform control flow: 342424 calls (0 / 0.00% all off!), 95.86% active lanes
+    ao.ispc(0071) - function entry: 1122 calls (0 / 0.00% all off!), 97.33% active lanes
+    ao.ispc(0075) - return: uniform control flow: 1122 calls (0 / 0.00% all off!), 97.33% active lanes
+    ao.ispc(0079) - function entry: 10072 calls (0 / 0.00% all off!), 45.09% active lanes
+    ao.ispc(0088) - function entry: 36928 calls (0 / 0.00% all off!), 97.40% active lanes
+    ...
+
+
+Choosing A Target Vector Width
+------------------------------
+
+By default, ``ispc`` compiles to the natural vector width of the target
+instruction set.  For example, for SSE2 and SSE4, it compiles four-wide,
+and for AVX, it complies 8-wide.  For some programs, higher performance may
+be seen if the program is compiled to a doubled vector width--8-wide for
+SSE and 16-wide for AVX.  
+
+For workloads that don't require many of registers, this method can lead to
+significantly more efficient execution thanks to greater instruction level
+parallelism and amortization of various overhead over more program
+instances.  For other workloads, it may lead to a slowdown due to higher
+register pressure; trying both approaches for key kernels may be
+worthwhile.
+
+This option is only available for each of the SSE2, SSE4 and AVX targets.
+It is selected with the ``--target=sse2-x2``, ``--target=sse4-x2`` and
+``--target=avx-x2`` options, respectively.
+
+
+Implementing Reductions Efficiently
+-----------------------------------
+
+It's often necessary to compute a "reduction" over a data set--for example,
+one might want to add all of the values in an array, compute their minimum,
+etc.  ``ispc`` provides a few capabilities that make it easy to efficiently
+compute reductions like these.  However, it's important to use these
+capabilities appropriately for best results.
+
+As an example, consider the task of computing the sum of all of the values
+in an array.  In C code, we might have:
+
+::
+
+    /* C implementation of a sum reduction */
+    float sum(const float array[], int count) {
+        float sum = 0;
+        for (int i = 0; i < count; ++i)
+            sum += array[i];
+        return sum;
+    } 
+
+Of course, exactly this computation could also be expressed in ``ispc``,
+though without any benefit from vectorization:
+
+::
+
+    /* inefficient ispc implementation of a sum reduction */
+    uniform float sum(const uniform float array[], uniform int count) {
+        uniform float sum = 0;
+        for (uniform int i = 0; i < count; ++i)
+            sum += array[i];
+        return sum;
+    } 
+
+As a first try, one might try using the ``reduce_add()`` function from the
+``ispc`` standard library; it takes a ``varying`` value and returns the sum
+of that value across all of the active program instances.
+
+::
+
+    /* inefficient ispc implementation of a sum reduction */
+    uniform float sum(const uniform float array[], uniform int count) {
+        uniform float sum = 0;
+        // Assumes programCount evenly divides count
+        for (uniform int i = 0; i < count; i += programCount)
+            sum += reduce_add(array[i+programIndex]);
+        return sum;
+    } 
+
+This implementation loads a set of ``programCount`` values from the array,
+one for each of the program instances, and then uses ``reduce_add`` to
+reduce across the program instances and then update the sum.  Unfortunately
+this approach loses most benefit from vectorization, as it does more work
+on the cross-program instance ``reduce_add()`` call than it saves from the
+vector load of values.
+
+The most efficient approach is to do the reduction in two phases: rather
+than using a ``uniform`` variable to store the sum, we maintain a varying
+value, such that each program instance is effectively computing a local
+partial sum on the subset of array values that it has loaded from the
+array.  When the loop over array elements concludes, a single call to
+``reduce_add()`` computes the final reduction across each of the program
+instances' elements of ``sum``.  This approach effectively compiles to a
+single vector load and a single vector add for each ``programCount`` worth
+of values--very efficient code in the end.
+
+::
+
+    /* good ispc implementation of a sum reduction */
+    uniform float sum(const uniform float array[], uniform int count) {
+        float sum = 0;
+        // Assumes programCount evenly divides count
+        for (uniform int i = 0; i < count; i += programCount)
+            sum += array[i+programIndex];
+        return reduce_add(sum);
+    } 
+
+
+Disclaimer and Legal Information
+================================
+
+INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS.
+NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
+PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS
+AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,
+AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE
+OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
+PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT
+OR OTHER INTELLECTUAL PROPERTY RIGHT.
+
+UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED
+NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD
+CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
+
+Intel may make changes to specifications and product descriptions at any time,
+without notice. Designers must not rely on the absence or characteristics of any
+features or instructions marked "reserved" or "undefined." Intel reserves these
+for future definition and shall have no responsibility whatsoever for conflicts
+or incompatibilities arising from future changes to them. The information here
+is subject to change without notice. Do not finalize a design with this
+information.
+
+The products described in this document may contain design defects or errors
+known as errata which may cause the product to deviate from published
+specifications. Current characterized errata are available on request.
+
+Contact your local Intel sales office or your distributor to obtain the latest
+specifications and before placing your product order.
+
+Copies of documents which have an order number and are referenced in this
+document, or other Intel literature, may be obtained by calling 1-800-548-4725,
+or by visiting Intel's Web Site.
+
+Intel processor numbers are not a measure of performance. Processor numbers
+differentiate features within each processor family, not across different
+processor families. See http://www.intel.com/products/processor_number for
+details.
+
+BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom,
+Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile,
+i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4,
+IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside,
+Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst,
+Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep,
+Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium,
+Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside,
+skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon,
+and Xeon Inside are trademarks of Intel Corporation in the U.S. and other
+countries.
+
+* Other names and brands may be claimed as the property of others.
+
+Copyright(C) 2011, Intel Corporation. All rights reserved.
+
+
+Optimization Notice
+===================
+
+Intel compilers, associated libraries and associated development tools may
+include or utilize options that optimize for instruction sets that are
+available in both Intel and non-Intel microprocessors (for example SIMD
+instruction sets), but do not optimize equally for non-Intel
+microprocessors.  In addition, certain compiler options for Intel
+compilers, including some that are not specific to Intel
+micro-architecture, are reserved for Intel microprocessors.  For a detailed
+description of Intel compiler options, including the instruction sets and
+specific microprocessors they implicate, please refer to the "Intel
+Compiler User and Reference Guides" under "Compiler Options."  Many library
+routines that are part of Intel compiler products are more highly optimized
+for Intel microprocessors than for other microprocessors.  While the
+compilers and libraries in Intel compiler products offer optimizations for
+both Intel and Intel-compatible microprocessors, depending on the options
+you select, your code and other factors, you likely will get extra
+performance on Intel microprocessors.
+
+Intel compilers, associated libraries and associated development tools may
+or may not optimize to the same degree for non-Intel microprocessors for
+optimizations that are not unique to Intel microprocessors.  These
+optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2),
+Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental
+Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other
+optimizations.  Intel does not guarantee the availability, functionality,
+or effectiveness of any optimization on microprocessors not manufactured by
+Intel.  Microprocessor-dependent optimizations in this product are intended
+for use with Intel microprocessors.
+
+While Intel believes our compilers and libraries are excellent choices to
+assist in obtaining the best performance on Intel and non-Intel
+microprocessors, Intel recommends that you evaluate other compilers and
+libraries to determine which best meet your requirements.  We hope to win
+your business by striving to offer the best performance of any compiler or
+library; please let us know if you find we do not.
+