Documentation update for multi-target compilation.

2011-10-04 15:50:02 -07:00
parent 59caa3d4e1
commit a68d137df6
1 changed files with 129 additions and 33 deletions
--- a/docs/ispc.txt
+++ b/docs/ispc.txt
@@ -55,7 +55,8 @@ Contents:

 * `Using The ISPC Compiler`_

-  + `Command-line Options`_
+  + `Basic Command-line Options`_
+  + `Selecting The Compilation Target`_

 * `The ISPC Language`_

@@ -117,6 +118,8 @@ Contents:
  + `Using Scan Operations For Variable Output`_
  + `Application-Supplied Execution Masks`_
  + `Explicit Vector Programming With Uniform Short Vector Types`_
+  + `Choosing A Target Vector Width`_
+  + `Compiling With Support For Multiple Instruction Sets`_

 * `Disclaimer and Legal Information`_

@@ -288,8 +291,8 @@ with application code, enter the following command
 compiling it.  (This functionality can be disabled with the ``--nocpp``
 command-line argument.)

-Command-line Options
--------------------
+Basic Command-line Options
+--------------------------

 The ``ispc`` executable can be run with ``--help`` to print a list of
 accepted command-line arguments.  By default, the compiler compiles the
@@ -297,56 +300,83 @@ provided program (and issues warnings and errors), but doesn't
 generate any output.  

 If the ``-o`` flag is given, it will generate an output file (a native
-object file by default).  To generate a text assembly file, pass
-``--emit-asm``:
+object file by default).  

 ::

-   ispc foo.ispc -o foo.s --emit-asm
+   ispc foo.ispc -o foo.obj --emit-asm
+
+To generate a text assembly file, pass ``--emit-asm``:
+
+::
+
+   ispc foo.ispc -o foo.asm --emit-asm

 To generate LLVM bitcode, use the ``--emit-llvm`` flag.

-By default, an optimized x86-64 object file tuned for Intel® Core
-processors CPUs is built.  You can use the ``--arch`` command line flag to
-specify a 32-bit x86 target:
-
-::
-
-   ispc foo.ispc -o foo.obj --arch=x86
-
-Optimizations can be turned off with ``-O0``:
+Optimizations are on by default; they can be turned off with ``-O0``:

 ::

   ispc foo.ispc -o foo.obj -O0

-On Mac\* and Linux\*, there is early support for generating debugging
-symbols; this is enabled with the ``-g`` command-line flag.
+On Mac\* and Linux\*, there is basic support for generating debugging
+symbols; this is enabled with the ``-g`` command-line flag.  Using ``-g``
+causes optimizations to be disabled; to compile with debugging symbols and
+optimizaion, ``-O1`` should be provided as well as the ``-g`` flag.

 The ``-h`` flag can also be used to direct ``ispc`` to generate a C/C++
 header file that includes C/C++ declarations of the C-callable ``ispc``
 functions and the types passed to it.

-On Linux\* and Mac OS\*, ``-D`` can be used to specify definitions to be
-passed along to the C pre-prcessor, which runs over the program input
-before it's compiled.  On Windows®, pre-processor definitions should be
-provided to the ``cl`` call.
-
-By default, the compiler generates x86-64 Intel® SSE4 code.  To generate
-32-bit code, you can use the ``--arch=x86`` command-line flag.  To
-select Intel® SSE2, use ``--target=sse2``.
-
-``ispc`` supports an alternative method for generating Intel® SSE4 code,
-where the program is "doubled up" and eight instances of it run in
-parallel, rather than just four.  For workloads that don't require large
-numbers of registers, this method can lead to significantly more efficient
-execution thanks to greater instruction level parallelism.  This option is
-selected with ``--target=sse4x2``.
+The ``-D`` option can be used to specify definitions to be passed along to
+the pre-processor, which runs over the program input before it's compiled.
+For example, including ``-DTEST=1`` defines the pre-processor symbol
+``TEST`` to have the value ``1`` when the program is compiled.

 The compiler issues a number of performance warnings for code constructs
 that compile to relatively inefficient code.  These warnings can be
 silenced with the ``--wno-perf`` flag (or by using ``--woff``, which turns
-off all warnings.)
+off all compiler warnings.)
+
+Selecting The Compilation Target
+--------------------------------
+
+There are three options that affect the compilation target: ``--arch``,
+which sets the target architecture, ``--cpu``, which sets the target CPU,
+and ``--target``, which sets the target instruction set.
+
+By default, the ``ispc`` compiler generates code for the 64-bit x86-64
+architecture (i.e. ``--arch=x86-64`.)  To compile to a 32-bit x86 target,
+supply ``-arch=x86`` on the command line:
+
+::
+
+   ispc foo.ispc -o foo.obj --arch=x86
+
+No other architectures are currently supported.
+
+The target CPU determines both the default instruction set used as well as
+which CPU architecture the code is tuned for.  ``ispc --help`` provides a
+list of a number of the supported CPUs.  By default, the CPU type of the
+system on which you're running ``ispc`` is used to determine the target
+CPU.
+
+::
+
+   ispc foo.ispc -o foo.obj --cpu=corei7-avx
+
+Finally, ``--target`` selects between the SSE2, SSE4, and AVX instruction
+sets.  (As general context, SSE2 was first introduced in processors that
+shipped in 2001, SSE4 was introduced in 2007, and processors with AVX 
+were introduced in 2010.  Consult your CPU's manual for specifics on which
+vector instruction set it supports.)
+
+By default, the target instruction set is chosen based on which ones are
+supported by the system on which you're running ``ispc``.  You can override
+this choice with the ``--target`` flag; for example, to select Intel® SSE2,
+use ``--target=sse2``.  (As with the other options in this section, see the
+output of ``ispc --help`` for a full list of supported targets.)


 The ISPC Language
@@ -3063,6 +3093,72 @@ Note that ``ispc`` doesn't currently support control-flow based on
    }


+Choosing A Target Vector Width
+------------------------------
+
+By default, ``ispc`` compiles to the natural vector width of the target
+instruction set.  For example, for SSE2 and SSE4, it compiles four-wide,
+and for AVX, it complies 8-wide.  For some programs, higher performance may
+be seen if the program is compiled to a doubled vector width--8-wide for
+SSE and 16-wide for AVX.  
+
+For workloads that don't require many of registers, this method can lead to
+significantly more efficient execution thanks to greater instruction level
+parallelism and amortization of various overhead over more program
+instances.  For other workloads, it may lead to a slowdown due to higher
+register pressure; trying both approaches for key kernels may be
+worthwhile.
+
+This option is currently only available for the SSE4 and AVX targets, and
+is selected with the ``--target=sse4-x2`` and ``--target=avx-x2`` options,
+respectively.
+
+Compiling With Support For Multiple Instruction Sets
+----------------------------------------------------
+
+``ispc`` can also generate output that supports multiple target instruction
+sets, choosing the most appropriate one at runtime.  For example, if you
+run the command:
+
+::
+
+   ispc foo.ispc -o foo.o --target=sse2,sse4-x2,avx-x2
+
+Then four object files will be generated: ``foo_sse2.o``, ``foo_sse4.o``,
+``foo_avx.o``, and ``foo.o``.[#]_  Link all of these into your executable, and
+when you call a function in ``foo.ispc`` from your application code,
+``ispc`` will determine which instruction sets are supported by the CPU the
+code is running on and will call the most appropraite version of the
+function available.  
+
+.. [#] Similarly, if you choose to generate assembly langauage output or
+   LLVM bitcode output, multiple versions of those files will be created.
+
+In general, the version of the function that runs will be the one in the
+most general instruction set that is supported by the system.  If you only
+compile SSE2 and SSE4 variants and run on a system that supports AVX, for
+example, then the SSE4 variant will be executed.  If the system doesn't
+is not able to run any of the available variants of the function (for
+example, trying to run a function that only has SSE4 and AVX variants on a
+system that only supports SSE2), then the standard library ``abort()``
+function will be called.
+
+One subtlety is that all non-static global variables (if any) must have the
+same size and layout with all of the targets used.  For example, if you
+have the global variables:
+
+::
+
+   uniform int foo[2*programCount];
+   int bar;
+
+and compile to both SSE2 and AVX targets, both of these variables will have
+different sizes (the first due to program count having the value 4 for SSE2
+and 8 for AVX, and the second due to ``varying`` types having different
+numbers of elements with the two targets--essentially the same issue as the
+first.)
+
+
 Disclaimer and Legal Information
 ================================