diff --git a/docs/faq.txt b/docs/faq.txt index 409e8bb9..9e66d16f 100644 --- a/docs/faq.txt +++ b/docs/faq.txt @@ -2,3 +2,385 @@ Intel® SPMD Program Compiler Frequently Asked Questions (FAQ) ============================================================= +This document includes a number of frequently (and not frequently) asked +questions about ispc, the Intel® SPMD Program Compiler. The source to this +document is in the file ``docs/faq.txt`` in the ``ispc`` source +distribution. + +* Understanding ispc's Output + + + `How can I see the assembly language generated by ispc?`_ + + `How can I have the assembly output be printed using Intel assembly syntax?`_ + + `Why are there multiple versions of exported ispc functions in the assembly output?`_ + + `How can I more easily see gathers and scatters in generated assembly?`_ + +* Interoperability + + + `How can I supply an initial execution mask in the call from the application?`_ + + `How can I generate a single binary executable with support for multiple instruction sets?`_ + + `How can I determine at run-time which vector instruction set's instructions were selected to execute?`_ + +* Programming Techniques + + + `What primitives are there for communicating between SPMD program instances?`_ + + `How can a gang of program instances generate variable output efficiently?`_ + + `Is it possible to use ispc for explicit vector programming?`_ + + +Understanding ispc's Output +=========================== + +How can I see the assembly language generated by ispc? +------------------------------------------------------ + +The ``--emit-asm`` flag causes assembly output to be generated. If the +``-o`` command-line flag is also supplied, the assembly is stored in the +given file, or printed to standard output if ``-`` is specified for the +filename. For example, given the simple ``ispc`` program: + +:: + + export uniform int foo(uniform int a, uniform int b) { + return a+b; + } + +If the SSE4 target is used, then the following assembly is printed: + +:: + + _foo: ## @foo + ## BB#0: ## %allocas + addl %esi, %edi + movl %edi, %eax + ret + + +How can I have the assembly output be printed using Intel assembly syntax? +-------------------------------------------------------------------------- + +The ``ispc`` compiler is currently only able to emit assembly with AT+T +syntax, where the destination operand is the last operand after an +instruction. If you'd prefer Intel assembly output, one option is to use +Agner Fog's ``objconv`` tool: have ``ispc`` emit a native object file and +then use ``objconv`` to disassemble it, specifying the assembler syntax +that you prefer. ``objconv`` `is available for download here`_. + +.. _is available for download here: http://www.agner.org/optimize/#objconv + +Why are there multiple versions of exported ispc functions in the assembly output? +---------------------------------------------------------------------------------- + +Two generations of all functions qualified with ``export`` are generated: +one of them is for being be called by other ``ispc`` functions, and the +other is to be called by the application. The application callable +function has the original function's name, while the ``ispc``-callable +function has a mangled name that encodes the types of the function's +parameters. + +The crucial difference between these two functions is that the +application-callable function doesn't take a parameter encoding the current +execution mask, while ``ispc``-callable functions have a hidden mask +parameter. An implication of this difference is that the ``export`` +function starts with the execution mask "all on". This allows a number of +improvements in the generated code, particularly on architectures that +don't have support for masked load and store instructions. + +As an example, consider this short function, which loads a vector's worth +values from two arrays in memory, adds them, and writes the result to an +output array. + +:: + + export void foo(uniform float a[], uniform float b[], + uniform float result[]) { + float aa = a[programIndex], bb = b[programIndex]; + result[programIndex] = aa+bb; + } + +Here is the assembly code for the application-callable instance of the +function--note that the selected instructions are ideal. + +:: + + _foo: + movups (%rsi), %xmm1 + movups (%rdi), %xmm0 + addps %xmm1, %xmm0 + movups %xmm0, (%rdx) + ret + + +And here is the assembly code for the ``ispc``-callable instance of the +function. There are a few things to notice in this code. + +The current program mask is coming in via the %xmm0 register and the +initial few instructions in the function essentially check to see if the +mask is all-on or all-off. If the mask is all on, the code at the label +LBB0_3 executes; it's the same as the code that was generated for ``_foo`` +above. If the mask is all off, then there's nothing to be done, and the +function can return immediately. + +In the case of a mixed mask, a substantial amount of code is generated to +load from and then store to only the array elements that correspond to +program instances where the mask is on. (This code is elided below). This +general pattern of having two-code paths for the "all on" and "mixed" mask +cases is used in the code generated for almost all but the most simple +functions (where the overhead of the test isn't worthwhile.) + +:: + + "_foo___uptruptruptr": + movmskps %xmm0, %eax + cmpl $15, %eax + je LBB0_3 + testl %eax, %eax + jne LBB0_4 + ret + LBB0_3: + movups (%rsi), %xmm1 + movups (%rdi), %xmm0 + addps %xmm1, %xmm0 + movups %xmm0, (%rdx) + ret + LBB0_4: + #### + #### Code elided; handle mixed mask case.. + #### + ret + + +How can I more easily see gathers and scatters in generated assembly? +--------------------------------------------------------------------- + +FIXME + +Interoperability +================ + +How can I supply an initial execution mask in the call from the application? +---------------------------------------------------------------------------- + +Recall that when execution transitions from the application code to an +``ispc`` function, all of the program instances are initially executing. +In some cases, it may desired that only some of them are running, based on +a data-dependent condition computed in the application program. This +situation can easily be handled via an additional parameter from the +application. + +As a simple example, consider a case where the application code has an +array of ``float`` values and we'd like the ``ispc`` code to update +just specific values in that array, where which of those values to be +updated has been determined by the application. In C++ code, we might +have: + +:: + + int count = ...; + float *array = new float[count]; + bool *shouldUpdate = new bool[count]; + // initialize array and shouldUpdate + ispc_func(array, shouldUpdate, count); + +Then, the ``ispc`` code could process this update as: + +:: + + export void ispc_func(uniform float array[], uniform bool update[], + uniform int count) { + foreach (i = 0 ... count) { + cif (update[i] == true) + // update array[i+programIndex]... + } + } + +(In this case a "coherent" if statement is likely to be worthwhile if the +``update`` array will tend to have sections that are either all-true or +all-false.) + +How can I generate a single binary executable with support for multiple instruction sets? +----------------------------------------------------------------------------------------- + +``ispc`` can also generate output that supports multiple target instruction +sets, also generating code that chooses the most appropriate one at runtime +if multiple targets are specified with the ``--target`` command-line +argument. + +For example, if you run the command: + +:: + + ispc foo.ispc -o foo.o --target=sse2,sse4-x2,avx-x2 + +Then four object files will be generated: ``foo_sse2.o``, ``foo_sse4.o``, +``foo_avx.o``, and ``foo.o``.[#]_ Link all of these into your executable, and +when you call a function in ``foo.ispc`` from your application code, +``ispc`` will determine which instruction sets are supported by the CPU the +code is running on and will call the most appropraite version of the +function available. + +.. [#] Similarly, if you choose to generate assembly langauage output or + LLVM bitcode output, multiple versions of those files will be created. + +In general, the version of the function that runs will be the one in the +most general instruction set that is supported by the system. If you only +compile SSE2 and SSE4 variants and run on a system that supports AVX, for +example, then the SSE4 variant will be executed. If the system doesn't +is not able to run any of the available variants of the function (for +example, trying to run a function that only has SSE4 and AVX variants on a +system that only supports SSE2), then the standard library ``abort()`` +function will be called. + +One subtlety is that all non-static global variables (if any) must have the +same size and layout with all of the targets used. For example, if you +have the global variables: + +:: + + uniform int foo[2*programCount]; + int bar; + +and compile to both SSE2 and AVX targets, both of these variables will have +different sizes (the first due to program count having the value 4 for SSE2 +and 8 for AVX, and the second due to ``varying`` types having different +numbers of elements with the two targets--essentially the same issue as the +first.) ``ispc`` issues an error in this case. + + +How can I determine at run-time which vector instruction set's instructions were selected to execute? +----------------------------------------------------------------------------------------------------- + +``ispc`` doesn't provide any API that allows querying which vector ISA's +instructions are running when multi-target compilation was used. However, +this can be solved in "user space" by writing a small helper function. +Specifically, if you implement a function like this + +:: + + export uniform int isa() { + #if defined(ISPC_TARGET_SSE2) + return 0; + #elif defined(ISPC_TARGET_SSE4) + return 1; + #elif defined(ISPC_TARGET_AVX) + return 2; + #else + return -1; + #endif + } + +And then call it from your application code at runtime, it will return 0, +1, or 2, depending on which target's instructions are running. + +The way this works is a little surprising, but it's a useful trick. Of +course the preprocessor ``#if`` checks are all compile-time only +operations. What's actually happening is that the function is compiled +multiple times, once for each target, with the appropriate ``ISPC_TARGET`` +preprocessor symbol set. Then, a small dispatch function is generated for +the application to actually call. This dispatch function in turn calls the +appropriate version of the function based on the CPU of the system it's +executing on, which in turn returns the appropriate value. + +In a similar fashion, it's possible to find out at run-time the value of +``programCount`` for the target that's actually being used. + +:: + + export uniform int width() { return programCount; } + + +Programming Techniques +====================== + +What primitives are there for communicating between SPMD program instances? +--------------------------------------------------------------------------- + +The ``broadcast()``, ``rotate()``, and ``shuffle()`` standard library +routines provide a variety of mechanisms for the running program instances +to communicate values to each other during execution. Note that there's no +need to synchronize the program instances before communicating between +them, due to the synchronized execution model of gangs of program instances +in ``ispc``. + +How can a gang of program instances generate variable output efficiently? +------------------------------------------------------------------------- + +A useful application of the ``exclusive_scan_add()`` function in the +standard library is when program instances want to generate a variable +amount of output and when one would like that output to be densely packed +in a single array. For example, consider the code fragment below: + +:: + + uniform int func(uniform float outArray[], ...) { + int numOut = ...; // figure out how many to be output + float outLocal[MAX_OUT]; // staging area + + // each program instance in the gang puts its results in + // outLocal[0], ..., outLocal[numOut-1] + + int startOffset = exclusive_scan_add(numOut); + for (int i = 0; i < numOut; ++i) + outArray[startOffset + i] = outLocal[i]; + return reduce_add(numOut); + } + +Here, each program instance has computed a number, ``numOut``, of values to +output, and has stored them in the ``outLocal`` array. Assume that four +program instances are running and that the first one wants to output one +value, the second two values, and the third and fourth three values each. +In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6) +to the four program instances, respectively. + +The first program instance will write its one result to ``outArray[0]``, +the second will write its two values to ``outArray[1]`` and +``outArray[2]``, and so forth. The ``reduce_add`` call at the end returns +the total number of values that all of the program instances have written +to the array. + +FIXME: add discussion of foreach_active as an option here once that's in + +Is it possible to use ispc for explicit vector programming? +----------------------------------------------------------- + +The typical model for programming in ``ispc`` is an *implicit* parallel +model, where one writes a program that is apparently doing scalar +computation on values and the program is then vectorized to run in parallel +across the SIMD lanes of a processor. However, ``ispc`` also has some +support for explicit vector unit programming, where the vectorization is +explicit. Some computations may be more effectively described in the +explicit model rather than the implicit model. + +This support is provided via ``uniform`` instances of short vectors +Specifically, if this short program + +:: + + export uniform float<8> madd(uniform float<8> a, uniform float<8> b, + uniform float<8> c) { + return a + b * c; + } + +is compiled with the AVX target, ``ispc`` generates the following assembly: + +:: + + _madd: + vmulps %ymm2, %ymm1, %ymm1 + vaddps %ymm0, %ymm1, %ymm0 + ret + +(And similarly, if compiled with a 4-wide SSE target, two ``mulps`` and two +``addps`` instructions are generated, and so forth.) + +Note that ``ispc`` doesn't currently support control-flow based on +``uniform`` short vector types; it is thus not possible to write code like: + +:: + + export uniform int<8> count(uniform float<8> a, uniform float<8> b) { + uniform int<8> sum = 0; + while (a++ < b) + ++sum; + } + + diff --git a/docs/perf.txt b/docs/perf.txt index 89be9cd9..03f83ba1 100644 --- a/docs/perf.txt +++ b/docs/perf.txt @@ -2,3 +2,427 @@ Intel® SPMD Program Compiler Performance Guide ============================================== + +* `Using ISPC Effectively`_ + + + `Gather and Scatter`_ + + `8 and 16-bit Integer Types`_ + + `Low-level Vector Tricks`_ + + `The "Fast math" Option`_ + + `"Inline" Aggressively`_ + + `Small Performance Tricks`_ + + `Instrumenting Your ISPC Programs`_ + + `Choosing A Target Vector Width`_ + + `Implementing Reductions Efficiently`_ + +* `Disclaimer and Legal Information`_ + +* `Optimization Notice`_ + + +don't use the system math library unless it's absolutely necessary + +opt=32-bit-addressing + +Using ISPC Effectively +====================== + + +Gather and Scatter +------------------ + +The CPU is a poor fit for SPMD execution in some ways, the worst of which +is handling of general memory reads and writes from SPMD program instances. +For example, in a "simple" array index: + +:: + + int i = ....; + uniform float x[10] = { ... }; + float f = x[i]; + +Since the index ``i`` is a varying value, the various SPMD program +instances will in general be reading different locations in the array +``x``. Because the CPU doesn't have a gather instruction, the ``ispc`` +compiler has to serialize these memory reads, performing a separate memory +load for each running program instance, packing the result into ``f``. +(And the analogous case would happen for a write into ``x[i]``.) + +In many cases, gathers like these are unavoidable; the running program +instances just need to access incoherent memory locations. However, if the +array index ``i`` could actually be declared and used as a ``uniform`` +variable, the resulting array index is substantially more +efficient. This is another case where using ``uniform`` whenever applicable +is of benefit. + +In some cases, the ``ispc`` compiler is able to deduce that the memory +locations accessed are either all the same or are uniform. For example, +given: + +:: + + uniform int x = ...; + int y = x; + return array[y]; + +The compiler is able to determine that all of the program instances are +loading from the same location, even though ``y`` is not a ``uniform`` +variable. In this case, the compiler will transform this load to a regular vector +load, rather than a general gather. + +Sometimes the running program instances will access a +linear sequence of memory locations; this happens most frequently when +array indexing is done based on the built-in ``programIndex`` variable. In +many of these cases, the compiler is also able to detect this case and then +do a vector load. For example, given: + +:: + + uniform int x = ...; + return array[2*x + programIndex]; + +A regular vector load is done from array, starting at offset ``2*x``. + + +8 and 16-bit Integer Types +-------------------------- + +The code generated for 8 and 16-bit integer types is generally not as +efficient as the code generated for 32-bit integer types. It is generally +worthwhile to use 32-bit integer types for intermediate computations, even +if the final result will be stored in a smaller integer type. + +Low-level Vector Tricks +----------------------- + +Many low-level Intel® SSE coding constructs can be implemented in ``ispc`` +code. For example, the following code efficiently reverses the sign of the +given values. + +:: + + float flipsign(float a) { + unsigned int i = intbits(a); + i ^= 0x80000000; + return floatbits(i); + } + +This code compiles down to a single XOR instruction. + +The "Fast math" Option +---------------------- + +``ispc`` has a ``--fast-math`` command-line flag that enables a number of +optimizations that may be undesirable in code where numerical preceision is +critically important. For many graphics applications, the +approximations may be acceptable. The following two optimizations are +performed when ``--fast-math`` is used. By default, the ``--fast-math`` +flag is off. + +* Expressions like ``x / y``, where ``y`` is a compile-time constant, are + transformed to ``x * (1./y)``, where the inverse value of ``y`` is + precomputed at compile time. + +* Expressions like ``x / y``, where ``y`` is not a compile-time constant, + are transformed to ``x * rcp(y)``, where ``rcp()`` maps to the + approximate reciprocal instruction from the standard library. + + +"Inline" Aggressively +--------------------- + +Inlining functions aggressively is generally beneficial for performance +with ``ispc``. Definitely use the ``inline`` qualifier for any short +functions (a few lines long), and experiment with it for longer functions. + +Small Performance Tricks +------------------------ + +Performance is slightly improved by declaring variables at the same block +scope where they are first used. For example, in code like the +following, if the lifetime of ``foo`` is only within the scope of the +``if`` clause, write the code like this: + +:: + + float func() { + .... + if (x < y) { + float foo; + ... use foo ... + } + } + +Try not to write code as: + +:: + + float func() { + float foo; + .... + if (x < y) { + ... use foo ... + } + } + +Doing so can reduce the amount of masked store instructions that the +compiler needs to generate. + +Instrumenting Your ISPC Programs +-------------------------------- + +``ispc`` has an optional instrumentation feature that can help you +understand performance issues. If a program is compiled using the +``--instrument`` flag, the compiler emits calls to a function with the +following signature at various points in the program (for +example, at interesting points in the control flow, when scatters or +gathers happen.) + +:: + + extern "C" { + void ISPCInstrument(const char *fn, const char *note, + int line, int mask); + } + +This function is passed the file name of the ``ispc`` file running, a short +note indicating what is happening, the line number in the source file, and +the current mask of active SPMD program lanes. You must provide an +implementation of this function and link it in with your application. + +For example, when the ``ispc`` program runs, this function might be called +as follows: + +:: + + ISPCInstrument("foo.ispc", "function entry", 55, 0xf); + +This call indicates that at the currently executing program has just +entered the function defined at line 55 of the file ``foo.ispc``, with a +mask of all lanes currently executing (assuming a four-wide Intel® SSE +target machine). + +For a fuller example of the utility of this functionality, see +``examples/aobench_instrumented`` in the ``ispc`` distribution. Ths +example includes an implementation of the ``ISPCInstrument`` function that +collects aggregate data about the program's execution behavior. + +When running this example, you will want to direct to the ``ao`` executable +to generate a low resolution image, because the instrumentation adds +substantial execution overhead. For example: + +:: + + % ./ao 1 32 32 + +After the ``ao`` program exits, a summary report along the following lines +will be printed. In the first few lines, you can see how many times a few +functions were called, and the average percentage of SIMD lanes that were +active upon function entry. + +:: + + ao.ispc(0067) - function entry: 342424 calls (0 / 0.00% all off!), 95.86% active lanes + ao.ispc(0067) - return: uniform control flow: 342424 calls (0 / 0.00% all off!), 95.86% active lanes + ao.ispc(0071) - function entry: 1122 calls (0 / 0.00% all off!), 97.33% active lanes + ao.ispc(0075) - return: uniform control flow: 1122 calls (0 / 0.00% all off!), 97.33% active lanes + ao.ispc(0079) - function entry: 10072 calls (0 / 0.00% all off!), 45.09% active lanes + ao.ispc(0088) - function entry: 36928 calls (0 / 0.00% all off!), 97.40% active lanes + ... + + +Choosing A Target Vector Width +------------------------------ + +By default, ``ispc`` compiles to the natural vector width of the target +instruction set. For example, for SSE2 and SSE4, it compiles four-wide, +and for AVX, it complies 8-wide. For some programs, higher performance may +be seen if the program is compiled to a doubled vector width--8-wide for +SSE and 16-wide for AVX. + +For workloads that don't require many of registers, this method can lead to +significantly more efficient execution thanks to greater instruction level +parallelism and amortization of various overhead over more program +instances. For other workloads, it may lead to a slowdown due to higher +register pressure; trying both approaches for key kernels may be +worthwhile. + +This option is only available for each of the SSE2, SSE4 and AVX targets. +It is selected with the ``--target=sse2-x2``, ``--target=sse4-x2`` and +``--target=avx-x2`` options, respectively. + + +Implementing Reductions Efficiently +----------------------------------- + +It's often necessary to compute a "reduction" over a data set--for example, +one might want to add all of the values in an array, compute their minimum, +etc. ``ispc`` provides a few capabilities that make it easy to efficiently +compute reductions like these. However, it's important to use these +capabilities appropriately for best results. + +As an example, consider the task of computing the sum of all of the values +in an array. In C code, we might have: + +:: + + /* C implementation of a sum reduction */ + float sum(const float array[], int count) { + float sum = 0; + for (int i = 0; i < count; ++i) + sum += array[i]; + return sum; + } + +Of course, exactly this computation could also be expressed in ``ispc``, +though without any benefit from vectorization: + +:: + + /* inefficient ispc implementation of a sum reduction */ + uniform float sum(const uniform float array[], uniform int count) { + uniform float sum = 0; + for (uniform int i = 0; i < count; ++i) + sum += array[i]; + return sum; + } + +As a first try, one might try using the ``reduce_add()`` function from the +``ispc`` standard library; it takes a ``varying`` value and returns the sum +of that value across all of the active program instances. + +:: + + /* inefficient ispc implementation of a sum reduction */ + uniform float sum(const uniform float array[], uniform int count) { + uniform float sum = 0; + // Assumes programCount evenly divides count + for (uniform int i = 0; i < count; i += programCount) + sum += reduce_add(array[i+programIndex]); + return sum; + } + +This implementation loads a set of ``programCount`` values from the array, +one for each of the program instances, and then uses ``reduce_add`` to +reduce across the program instances and then update the sum. Unfortunately +this approach loses most benefit from vectorization, as it does more work +on the cross-program instance ``reduce_add()`` call than it saves from the +vector load of values. + +The most efficient approach is to do the reduction in two phases: rather +than using a ``uniform`` variable to store the sum, we maintain a varying +value, such that each program instance is effectively computing a local +partial sum on the subset of array values that it has loaded from the +array. When the loop over array elements concludes, a single call to +``reduce_add()`` computes the final reduction across each of the program +instances' elements of ``sum``. This approach effectively compiles to a +single vector load and a single vector add for each ``programCount`` worth +of values--very efficient code in the end. + +:: + + /* good ispc implementation of a sum reduction */ + uniform float sum(const uniform float array[], uniform int count) { + float sum = 0; + // Assumes programCount evenly divides count + for (uniform int i = 0; i < count; i += programCount) + sum += array[i+programIndex]; + return reduce_add(sum); + } + + +Disclaimer and Legal Information +================================ + +INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. +NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL +PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS +AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, +AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE +OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A +PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT +OR OTHER INTELLECTUAL PROPERTY RIGHT. + +UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED +NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD +CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. + +Intel may make changes to specifications and product descriptions at any time, +without notice. Designers must not rely on the absence or characteristics of any +features or instructions marked "reserved" or "undefined." Intel reserves these +for future definition and shall have no responsibility whatsoever for conflicts +or incompatibilities arising from future changes to them. The information here +is subject to change without notice. Do not finalize a design with this +information. + +The products described in this document may contain design defects or errors +known as errata which may cause the product to deviate from published +specifications. Current characterized errata are available on request. + +Contact your local Intel sales office or your distributor to obtain the latest +specifications and before placing your product order. + +Copies of documents which have an order number and are referenced in this +document, or other Intel literature, may be obtained by calling 1-800-548-4725, +or by visiting Intel's Web Site. + +Intel processor numbers are not a measure of performance. Processor numbers +differentiate features within each processor family, not across different +processor families. See http://www.intel.com/products/processor_number for +details. + +BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, +Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile, +i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, +IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, +Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, +Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, +Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, +Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, +skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, +and Xeon Inside are trademarks of Intel Corporation in the U.S. and other +countries. + +* Other names and brands may be claimed as the property of others. + +Copyright(C) 2011, Intel Corporation. All rights reserved. + + +Optimization Notice +=================== + +Intel compilers, associated libraries and associated development tools may +include or utilize options that optimize for instruction sets that are +available in both Intel and non-Intel microprocessors (for example SIMD +instruction sets), but do not optimize equally for non-Intel +microprocessors. In addition, certain compiler options for Intel +compilers, including some that are not specific to Intel +micro-architecture, are reserved for Intel microprocessors. For a detailed +description of Intel compiler options, including the instruction sets and +specific microprocessors they implicate, please refer to the "Intel +Compiler User and Reference Guides" under "Compiler Options." Many library +routines that are part of Intel compiler products are more highly optimized +for Intel microprocessors than for other microprocessors. While the +compilers and libraries in Intel compiler products offer optimizations for +both Intel and Intel-compatible microprocessors, depending on the options +you select, your code and other factors, you likely will get extra +performance on Intel microprocessors. + +Intel compilers, associated libraries and associated development tools may +or may not optimize to the same degree for non-Intel microprocessors for +optimizations that are not unique to Intel microprocessors. These +optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), +Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental +Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other +optimizations. Intel does not guarantee the availability, functionality, +or effectiveness of any optimization on microprocessors not manufactured by +Intel. Microprocessor-dependent optimizations in this product are intended +for use with Intel microprocessors. + +While Intel believes our compilers and libraries are excellent choices to +assist in obtaining the best performance on Intel and non-Intel +microprocessors, Intel recommends that you evaluate other compilers and +libraries to determine which best meet your requirements. We hope to win +your business by striving to offer the best performance of any compiler or +library; please let us know if you find we do not. +