FAQ and perf guide updates

This commit is contained in:
Matt Pharr
2011-11-30 19:38:37 -08:00
parent c5aecd51e9
commit a2f118a14e
2 changed files with 806 additions and 0 deletions

View File

@@ -2,3 +2,385 @@
Intel® SPMD Program Compiler Frequently Asked Questions (FAQ)
=============================================================
This document includes a number of frequently (and not frequently) asked
questions about ispc, the Intel® SPMD Program Compiler. The source to this
document is in the file ``docs/faq.txt`` in the ``ispc`` source
distribution.
* Understanding ispc's Output
+ `How can I see the assembly language generated by ispc?`_
+ `How can I have the assembly output be printed using Intel assembly syntax?`_
+ `Why are there multiple versions of exported ispc functions in the assembly output?`_
+ `How can I more easily see gathers and scatters in generated assembly?`_
* Interoperability
+ `How can I supply an initial execution mask in the call from the application?`_
+ `How can I generate a single binary executable with support for multiple instruction sets?`_
+ `How can I determine at run-time which vector instruction set's instructions were selected to execute?`_
* Programming Techniques
+ `What primitives are there for communicating between SPMD program instances?`_
+ `How can a gang of program instances generate variable output efficiently?`_
+ `Is it possible to use ispc for explicit vector programming?`_
Understanding ispc's Output
===========================
How can I see the assembly language generated by ispc?
------------------------------------------------------
The ``--emit-asm`` flag causes assembly output to be generated. If the
``-o`` command-line flag is also supplied, the assembly is stored in the
given file, or printed to standard output if ``-`` is specified for the
filename. For example, given the simple ``ispc`` program:
::
export uniform int foo(uniform int a, uniform int b) {
return a+b;
}
If the SSE4 target is used, then the following assembly is printed:
::
_foo: ## @foo
## BB#0: ## %allocas
addl %esi, %edi
movl %edi, %eax
ret
How can I have the assembly output be printed using Intel assembly syntax?
--------------------------------------------------------------------------
The ``ispc`` compiler is currently only able to emit assembly with AT+T
syntax, where the destination operand is the last operand after an
instruction. If you'd prefer Intel assembly output, one option is to use
Agner Fog's ``objconv`` tool: have ``ispc`` emit a native object file and
then use ``objconv`` to disassemble it, specifying the assembler syntax
that you prefer. ``objconv`` `is available for download here`_.
.. _is available for download here: http://www.agner.org/optimize/#objconv
Why are there multiple versions of exported ispc functions in the assembly output?
----------------------------------------------------------------------------------
Two generations of all functions qualified with ``export`` are generated:
one of them is for being be called by other ``ispc`` functions, and the
other is to be called by the application. The application callable
function has the original function's name, while the ``ispc``-callable
function has a mangled name that encodes the types of the function's
parameters.
The crucial difference between these two functions is that the
application-callable function doesn't take a parameter encoding the current
execution mask, while ``ispc``-callable functions have a hidden mask
parameter. An implication of this difference is that the ``export``
function starts with the execution mask "all on". This allows a number of
improvements in the generated code, particularly on architectures that
don't have support for masked load and store instructions.
As an example, consider this short function, which loads a vector's worth
values from two arrays in memory, adds them, and writes the result to an
output array.
::
export void foo(uniform float a[], uniform float b[],
uniform float result[]) {
float aa = a[programIndex], bb = b[programIndex];
result[programIndex] = aa+bb;
}
Here is the assembly code for the application-callable instance of the
function--note that the selected instructions are ideal.
::
_foo:
movups (%rsi), %xmm1
movups (%rdi), %xmm0
addps %xmm1, %xmm0
movups %xmm0, (%rdx)
ret
And here is the assembly code for the ``ispc``-callable instance of the
function. There are a few things to notice in this code.
The current program mask is coming in via the %xmm0 register and the
initial few instructions in the function essentially check to see if the
mask is all-on or all-off. If the mask is all on, the code at the label
LBB0_3 executes; it's the same as the code that was generated for ``_foo``
above. If the mask is all off, then there's nothing to be done, and the
function can return immediately.
In the case of a mixed mask, a substantial amount of code is generated to
load from and then store to only the array elements that correspond to
program instances where the mask is on. (This code is elided below). This
general pattern of having two-code paths for the "all on" and "mixed" mask
cases is used in the code generated for almost all but the most simple
functions (where the overhead of the test isn't worthwhile.)
::
"_foo___uptr<Uf>uptr<Uf>uptr<Uf>":
movmskps %xmm0, %eax
cmpl $15, %eax
je LBB0_3
testl %eax, %eax
jne LBB0_4
ret
LBB0_3:
movups (%rsi), %xmm1
movups (%rdi), %xmm0
addps %xmm1, %xmm0
movups %xmm0, (%rdx)
ret
LBB0_4:
####
#### Code elided; handle mixed mask case..
####
ret
How can I more easily see gathers and scatters in generated assembly?
---------------------------------------------------------------------
FIXME
Interoperability
================
How can I supply an initial execution mask in the call from the application?
----------------------------------------------------------------------------
Recall that when execution transitions from the application code to an
``ispc`` function, all of the program instances are initially executing.
In some cases, it may desired that only some of them are running, based on
a data-dependent condition computed in the application program. This
situation can easily be handled via an additional parameter from the
application.
As a simple example, consider a case where the application code has an
array of ``float`` values and we'd like the ``ispc`` code to update
just specific values in that array, where which of those values to be
updated has been determined by the application. In C++ code, we might
have:
::
int count = ...;
float *array = new float[count];
bool *shouldUpdate = new bool[count];
// initialize array and shouldUpdate
ispc_func(array, shouldUpdate, count);
Then, the ``ispc`` code could process this update as:
::
export void ispc_func(uniform float array[], uniform bool update[],
uniform int count) {
foreach (i = 0 ... count) {
cif (update[i] == true)
// update array[i+programIndex]...
}
}
(In this case a "coherent" if statement is likely to be worthwhile if the
``update`` array will tend to have sections that are either all-true or
all-false.)
How can I generate a single binary executable with support for multiple instruction sets?
-----------------------------------------------------------------------------------------
``ispc`` can also generate output that supports multiple target instruction
sets, also generating code that chooses the most appropriate one at runtime
if multiple targets are specified with the ``--target`` command-line
argument.
For example, if you run the command:
::
ispc foo.ispc -o foo.o --target=sse2,sse4-x2,avx-x2
Then four object files will be generated: ``foo_sse2.o``, ``foo_sse4.o``,
``foo_avx.o``, and ``foo.o``.[#]_ Link all of these into your executable, and
when you call a function in ``foo.ispc`` from your application code,
``ispc`` will determine which instruction sets are supported by the CPU the
code is running on and will call the most appropraite version of the
function available.
.. [#] Similarly, if you choose to generate assembly langauage output or
LLVM bitcode output, multiple versions of those files will be created.
In general, the version of the function that runs will be the one in the
most general instruction set that is supported by the system. If you only
compile SSE2 and SSE4 variants and run on a system that supports AVX, for
example, then the SSE4 variant will be executed. If the system doesn't
is not able to run any of the available variants of the function (for
example, trying to run a function that only has SSE4 and AVX variants on a
system that only supports SSE2), then the standard library ``abort()``
function will be called.
One subtlety is that all non-static global variables (if any) must have the
same size and layout with all of the targets used. For example, if you
have the global variables:
::
uniform int foo[2*programCount];
int bar;
and compile to both SSE2 and AVX targets, both of these variables will have
different sizes (the first due to program count having the value 4 for SSE2
and 8 for AVX, and the second due to ``varying`` types having different
numbers of elements with the two targets--essentially the same issue as the
first.) ``ispc`` issues an error in this case.
How can I determine at run-time which vector instruction set's instructions were selected to execute?
-----------------------------------------------------------------------------------------------------
``ispc`` doesn't provide any API that allows querying which vector ISA's
instructions are running when multi-target compilation was used. However,
this can be solved in "user space" by writing a small helper function.
Specifically, if you implement a function like this
::
export uniform int isa() {
#if defined(ISPC_TARGET_SSE2)
return 0;
#elif defined(ISPC_TARGET_SSE4)
return 1;
#elif defined(ISPC_TARGET_AVX)
return 2;
#else
return -1;
#endif
}
And then call it from your application code at runtime, it will return 0,
1, or 2, depending on which target's instructions are running.
The way this works is a little surprising, but it's a useful trick. Of
course the preprocessor ``#if`` checks are all compile-time only
operations. What's actually happening is that the function is compiled
multiple times, once for each target, with the appropriate ``ISPC_TARGET``
preprocessor symbol set. Then, a small dispatch function is generated for
the application to actually call. This dispatch function in turn calls the
appropriate version of the function based on the CPU of the system it's
executing on, which in turn returns the appropriate value.
In a similar fashion, it's possible to find out at run-time the value of
``programCount`` for the target that's actually being used.
::
export uniform int width() { return programCount; }
Programming Techniques
======================
What primitives are there for communicating between SPMD program instances?
---------------------------------------------------------------------------
The ``broadcast()``, ``rotate()``, and ``shuffle()`` standard library
routines provide a variety of mechanisms for the running program instances
to communicate values to each other during execution. Note that there's no
need to synchronize the program instances before communicating between
them, due to the synchronized execution model of gangs of program instances
in ``ispc``.
How can a gang of program instances generate variable output efficiently?
-------------------------------------------------------------------------
A useful application of the ``exclusive_scan_add()`` function in the
standard library is when program instances want to generate a variable
amount of output and when one would like that output to be densely packed
in a single array. For example, consider the code fragment below:
::
uniform int func(uniform float outArray[], ...) {
int numOut = ...; // figure out how many to be output
float outLocal[MAX_OUT]; // staging area
// each program instance in the gang puts its results in
// outLocal[0], ..., outLocal[numOut-1]
int startOffset = exclusive_scan_add(numOut);
for (int i = 0; i < numOut; ++i)
outArray[startOffset + i] = outLocal[i];
return reduce_add(numOut);
}
Here, each program instance has computed a number, ``numOut``, of values to
output, and has stored them in the ``outLocal`` array. Assume that four
program instances are running and that the first one wants to output one
value, the second two values, and the third and fourth three values each.
In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6)
to the four program instances, respectively.
The first program instance will write its one result to ``outArray[0]``,
the second will write its two values to ``outArray[1]`` and
``outArray[2]``, and so forth. The ``reduce_add`` call at the end returns
the total number of values that all of the program instances have written
to the array.
FIXME: add discussion of foreach_active as an option here once that's in
Is it possible to use ispc for explicit vector programming?
-----------------------------------------------------------
The typical model for programming in ``ispc`` is an *implicit* parallel
model, where one writes a program that is apparently doing scalar
computation on values and the program is then vectorized to run in parallel
across the SIMD lanes of a processor. However, ``ispc`` also has some
support for explicit vector unit programming, where the vectorization is
explicit. Some computations may be more effectively described in the
explicit model rather than the implicit model.
This support is provided via ``uniform`` instances of short vectors
Specifically, if this short program
::
export uniform float<8> madd(uniform float<8> a, uniform float<8> b,
uniform float<8> c) {
return a + b * c;
}
is compiled with the AVX target, ``ispc`` generates the following assembly:
::
_madd:
vmulps %ymm2, %ymm1, %ymm1
vaddps %ymm0, %ymm1, %ymm0
ret
(And similarly, if compiled with a 4-wide SSE target, two ``mulps`` and two
``addps`` instructions are generated, and so forth.)
Note that ``ispc`` doesn't currently support control-flow based on
``uniform`` short vector types; it is thus not possible to write code like:
::
export uniform int<8> count(uniform float<8> a, uniform float<8> b) {
uniform int<8> sum = 0;
while (a++ < b)
++sum;
}

View File

@@ -2,3 +2,427 @@
Intel® SPMD Program Compiler Performance Guide
==============================================
* `Using ISPC Effectively`_
+ `Gather and Scatter`_
+ `8 and 16-bit Integer Types`_
+ `Low-level Vector Tricks`_
+ `The "Fast math" Option`_
+ `"Inline" Aggressively`_
+ `Small Performance Tricks`_
+ `Instrumenting Your ISPC Programs`_
+ `Choosing A Target Vector Width`_
+ `Implementing Reductions Efficiently`_
* `Disclaimer and Legal Information`_
* `Optimization Notice`_
don't use the system math library unless it's absolutely necessary
opt=32-bit-addressing
Using ISPC Effectively
======================
Gather and Scatter
------------------
The CPU is a poor fit for SPMD execution in some ways, the worst of which
is handling of general memory reads and writes from SPMD program instances.
For example, in a "simple" array index:
::
int i = ....;
uniform float x[10] = { ... };
float f = x[i];
Since the index ``i`` is a varying value, the various SPMD program
instances will in general be reading different locations in the array
``x``. Because the CPU doesn't have a gather instruction, the ``ispc``
compiler has to serialize these memory reads, performing a separate memory
load for each running program instance, packing the result into ``f``.
(And the analogous case would happen for a write into ``x[i]``.)
In many cases, gathers like these are unavoidable; the running program
instances just need to access incoherent memory locations. However, if the
array index ``i`` could actually be declared and used as a ``uniform``
variable, the resulting array index is substantially more
efficient. This is another case where using ``uniform`` whenever applicable
is of benefit.
In some cases, the ``ispc`` compiler is able to deduce that the memory
locations accessed are either all the same or are uniform. For example,
given:
::
uniform int x = ...;
int y = x;
return array[y];
The compiler is able to determine that all of the program instances are
loading from the same location, even though ``y`` is not a ``uniform``
variable. In this case, the compiler will transform this load to a regular vector
load, rather than a general gather.
Sometimes the running program instances will access a
linear sequence of memory locations; this happens most frequently when
array indexing is done based on the built-in ``programIndex`` variable. In
many of these cases, the compiler is also able to detect this case and then
do a vector load. For example, given:
::
uniform int x = ...;
return array[2*x + programIndex];
A regular vector load is done from array, starting at offset ``2*x``.
8 and 16-bit Integer Types
--------------------------
The code generated for 8 and 16-bit integer types is generally not as
efficient as the code generated for 32-bit integer types. It is generally
worthwhile to use 32-bit integer types for intermediate computations, even
if the final result will be stored in a smaller integer type.
Low-level Vector Tricks
-----------------------
Many low-level Intel® SSE coding constructs can be implemented in ``ispc``
code. For example, the following code efficiently reverses the sign of the
given values.
::
float flipsign(float a) {
unsigned int i = intbits(a);
i ^= 0x80000000;
return floatbits(i);
}
This code compiles down to a single XOR instruction.
The "Fast math" Option
----------------------
``ispc`` has a ``--fast-math`` command-line flag that enables a number of
optimizations that may be undesirable in code where numerical preceision is
critically important. For many graphics applications, the
approximations may be acceptable. The following two optimizations are
performed when ``--fast-math`` is used. By default, the ``--fast-math``
flag is off.
* Expressions like ``x / y``, where ``y`` is a compile-time constant, are
transformed to ``x * (1./y)``, where the inverse value of ``y`` is
precomputed at compile time.
* Expressions like ``x / y``, where ``y`` is not a compile-time constant,
are transformed to ``x * rcp(y)``, where ``rcp()`` maps to the
approximate reciprocal instruction from the standard library.
"Inline" Aggressively
---------------------
Inlining functions aggressively is generally beneficial for performance
with ``ispc``. Definitely use the ``inline`` qualifier for any short
functions (a few lines long), and experiment with it for longer functions.
Small Performance Tricks
------------------------
Performance is slightly improved by declaring variables at the same block
scope where they are first used. For example, in code like the
following, if the lifetime of ``foo`` is only within the scope of the
``if`` clause, write the code like this:
::
float func() {
....
if (x < y) {
float foo;
... use foo ...
}
}
Try not to write code as:
::
float func() {
float foo;
....
if (x < y) {
... use foo ...
}
}
Doing so can reduce the amount of masked store instructions that the
compiler needs to generate.
Instrumenting Your ISPC Programs
--------------------------------
``ispc`` has an optional instrumentation feature that can help you
understand performance issues. If a program is compiled using the
``--instrument`` flag, the compiler emits calls to a function with the
following signature at various points in the program (for
example, at interesting points in the control flow, when scatters or
gathers happen.)
::
extern "C" {
void ISPCInstrument(const char *fn, const char *note,
int line, int mask);
}
This function is passed the file name of the ``ispc`` file running, a short
note indicating what is happening, the line number in the source file, and
the current mask of active SPMD program lanes. You must provide an
implementation of this function and link it in with your application.
For example, when the ``ispc`` program runs, this function might be called
as follows:
::
ISPCInstrument("foo.ispc", "function entry", 55, 0xf);
This call indicates that at the currently executing program has just
entered the function defined at line 55 of the file ``foo.ispc``, with a
mask of all lanes currently executing (assuming a four-wide Intel® SSE
target machine).
For a fuller example of the utility of this functionality, see
``examples/aobench_instrumented`` in the ``ispc`` distribution. Ths
example includes an implementation of the ``ISPCInstrument`` function that
collects aggregate data about the program's execution behavior.
When running this example, you will want to direct to the ``ao`` executable
to generate a low resolution image, because the instrumentation adds
substantial execution overhead. For example:
::
% ./ao 1 32 32
After the ``ao`` program exits, a summary report along the following lines
will be printed. In the first few lines, you can see how many times a few
functions were called, and the average percentage of SIMD lanes that were
active upon function entry.
::
ao.ispc(0067) - function entry: 342424 calls (0 / 0.00% all off!), 95.86% active lanes
ao.ispc(0067) - return: uniform control flow: 342424 calls (0 / 0.00% all off!), 95.86% active lanes
ao.ispc(0071) - function entry: 1122 calls (0 / 0.00% all off!), 97.33% active lanes
ao.ispc(0075) - return: uniform control flow: 1122 calls (0 / 0.00% all off!), 97.33% active lanes
ao.ispc(0079) - function entry: 10072 calls (0 / 0.00% all off!), 45.09% active lanes
ao.ispc(0088) - function entry: 36928 calls (0 / 0.00% all off!), 97.40% active lanes
...
Choosing A Target Vector Width
------------------------------
By default, ``ispc`` compiles to the natural vector width of the target
instruction set. For example, for SSE2 and SSE4, it compiles four-wide,
and for AVX, it complies 8-wide. For some programs, higher performance may
be seen if the program is compiled to a doubled vector width--8-wide for
SSE and 16-wide for AVX.
For workloads that don't require many of registers, this method can lead to
significantly more efficient execution thanks to greater instruction level
parallelism and amortization of various overhead over more program
instances. For other workloads, it may lead to a slowdown due to higher
register pressure; trying both approaches for key kernels may be
worthwhile.
This option is only available for each of the SSE2, SSE4 and AVX targets.
It is selected with the ``--target=sse2-x2``, ``--target=sse4-x2`` and
``--target=avx-x2`` options, respectively.
Implementing Reductions Efficiently
-----------------------------------
It's often necessary to compute a "reduction" over a data set--for example,
one might want to add all of the values in an array, compute their minimum,
etc. ``ispc`` provides a few capabilities that make it easy to efficiently
compute reductions like these. However, it's important to use these
capabilities appropriately for best results.
As an example, consider the task of computing the sum of all of the values
in an array. In C code, we might have:
::
/* C implementation of a sum reduction */
float sum(const float array[], int count) {
float sum = 0;
for (int i = 0; i < count; ++i)
sum += array[i];
return sum;
}
Of course, exactly this computation could also be expressed in ``ispc``,
though without any benefit from vectorization:
::
/* inefficient ispc implementation of a sum reduction */
uniform float sum(const uniform float array[], uniform int count) {
uniform float sum = 0;
for (uniform int i = 0; i < count; ++i)
sum += array[i];
return sum;
}
As a first try, one might try using the ``reduce_add()`` function from the
``ispc`` standard library; it takes a ``varying`` value and returns the sum
of that value across all of the active program instances.
::
/* inefficient ispc implementation of a sum reduction */
uniform float sum(const uniform float array[], uniform int count) {
uniform float sum = 0;
// Assumes programCount evenly divides count
for (uniform int i = 0; i < count; i += programCount)
sum += reduce_add(array[i+programIndex]);
return sum;
}
This implementation loads a set of ``programCount`` values from the array,
one for each of the program instances, and then uses ``reduce_add`` to
reduce across the program instances and then update the sum. Unfortunately
this approach loses most benefit from vectorization, as it does more work
on the cross-program instance ``reduce_add()`` call than it saves from the
vector load of values.
The most efficient approach is to do the reduction in two phases: rather
than using a ``uniform`` variable to store the sum, we maintain a varying
value, such that each program instance is effectively computing a local
partial sum on the subset of array values that it has loaded from the
array. When the loop over array elements concludes, a single call to
``reduce_add()`` computes the final reduction across each of the program
instances' elements of ``sum``. This approach effectively compiles to a
single vector load and a single vector add for each ``programCount`` worth
of values--very efficient code in the end.
::
/* good ispc implementation of a sum reduction */
uniform float sum(const uniform float array[], uniform int count) {
float sum = 0;
// Assumes programCount evenly divides count
for (uniform int i = 0; i < count; i += programCount)
sum += array[i+programIndex];
return reduce_add(sum);
}
Disclaimer and Legal Information
================================
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS.
NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE
OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT
OR OTHER INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED
NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD
CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time,
without notice. Designers must not rely on the absence or characteristics of any
features or instructions marked "reserved" or "undefined." Intel reserves these
for future definition and shall have no responsibility whatsoever for conflicts
or incompatibilities arising from future changes to them. The information here
is subject to change without notice. Do not finalize a design with this
information.
The products described in this document may contain design defects or errors
known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest
specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this
document, or other Intel literature, may be obtained by calling 1-800-548-4725,
or by visiting Intel's Web Site.
Intel processor numbers are not a measure of performance. Processor numbers
differentiate features within each processor family, not across different
processor families. See http://www.intel.com/products/processor_number for
details.
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom,
Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile,
i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4,
IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside,
Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst,
Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep,
Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium,
Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside,
skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon,
and Xeon Inside are trademarks of Intel Corporation in the U.S. and other
countries.
* Other names and brands may be claimed as the property of others.
Copyright(C) 2011, Intel Corporation. All rights reserved.
Optimization Notice
===================
Intel compilers, associated libraries and associated development tools may
include or utilize options that optimize for instruction sets that are
available in both Intel and non-Intel microprocessors (for example SIMD
instruction sets), but do not optimize equally for non-Intel
microprocessors. In addition, certain compiler options for Intel
compilers, including some that are not specific to Intel
micro-architecture, are reserved for Intel microprocessors. For a detailed
description of Intel compiler options, including the instruction sets and
specific microprocessors they implicate, please refer to the "Intel
Compiler User and Reference Guides" under "Compiler Options." Many library
routines that are part of Intel compiler products are more highly optimized
for Intel microprocessors than for other microprocessors. While the
compilers and libraries in Intel compiler products offer optimizations for
both Intel and Intel-compatible microprocessors, depending on the options
you select, your code and other factors, you likely will get extra
performance on Intel microprocessors.
Intel compilers, associated libraries and associated development tools may
or may not optimize to the same degree for non-Intel microprocessors for
optimizations that are not unique to Intel microprocessors. These
optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2),
Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental
Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other
optimizations. Intel does not guarantee the availability, functionality,
or effectiveness of any optimization on microprocessors not manufactured by
Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors.
While Intel believes our compilers and libraries are excellent choices to
assist in obtaining the best performance on Intel and non-Intel
microprocessors, Intel recommends that you evaluate other compilers and
libraries to determine which best meet your requirements. We hope to win
your business by striving to offer the best performance of any compiler or
library; please let us know if you find we do not.