From 82aa6efd1210c09a889120543e442dfe2408c69b Mon Sep 17 00:00:00 2001 From: Matt Pharr Date: Thu, 1 Dec 2011 13:38:17 -0800 Subject: [PATCH] Checkpoint user's guide edits --- docs/ispc.txt | 1047 ++++++++++++++++++++--------------- examples/simple/simple.ispc | 4 +- 2 files changed, 605 insertions(+), 446 deletions(-) diff --git a/docs/ispc.txt b/docs/ispc.txt index 7dbe1399..495bc8c2 100644 --- a/docs/ispc.txt +++ b/docs/ispc.txt @@ -13,9 +13,9 @@ different inputs (the values for different pixels, for example). The main goals behind ``ispc`` are to: -* Build a small C-like language that can deliver good performance to - performance-oriented programmers who want to run SPMD programs on - CPUs. +* Build a small variant of the C programming language that delivers good + performance to performance-oriented programmers who want to run SPMD + programs on CPUs. * Provide a thin abstraction layer between the programmer and the hardware--in particular, to follow the lesson from C for serial programs of having an execution and data model where the programmer can cleanly @@ -29,10 +29,6 @@ The main goals behind ``ispc`` are to: calls betwen the two languages, sharing data directly via pointers without copying or reformating, etc. -``ispc`` has already successfully delivered significant speedups for a -number of non-trivial workloads that aren't handled well by other -compilation approaches (e.g. loop auto-vectorization.) - **We are very interested in your feedback and comments about ispc and in hearing your experiences using the system. We are especially interested in hearing if you try using ispc but see results that are not as you @@ -59,9 +55,19 @@ Contents: + `Basic Command-line Options`_ + `Selecting The Compilation Target`_ + + `Selecting 32 or 64 Bit Addressing`_ + `The Preprocessor`_ + `Debugging`_ +* `Parallel Execution Model in ISPC`_ + + + `Program Instances and Gangs of Program Instances`_ + + `The SPMD-on-SIMD Execution Model`_ + + `Gang Convergence`_ + + `Data Races Within a Gang`_ + + `Uniform Data In A Gang`_ + + `Uniform Variables and Varying Control Flow`_ + * `The ISPC Language`_ + `Relationship To The C Programming Language`_ @@ -69,6 +75,9 @@ Contents: + `Types`_ * `Basic Types and Type Qualifiers`_ + * `"uniform" and "varying" Qualifiers`_ + * `Defining New Names For Types`_ + * `Pointer and Reference Types`_ * `Function Pointer Types`_ * `Enumeration Types`_ * `Short Vector Types`_ @@ -80,24 +89,18 @@ Contents: * `Conditional Statements: "if"`_ * `Basic Iteration Statements: "for", "while", and "do"`_ + * `"Coherent" Control Flow Statements: "cif", "cfor", and Friends`_ * `Parallel Iteration Statements: "foreach" and "foreach_tiled"`_ + * `Parallel Iteration with "programIndex" and "programCount"`_ * `Functions and Function Calls`_ - + `Function Declarations`_ + `Function Overloading`_ + + `Varying Function Pointers`_ -* `Parallel Execution Model in ISPC`_ + * `Task Parallel Execution`_ - + `The SPMD-on-SIMD Execution Model`_ - + `Uniform and Varying Qualifiers`_ - + `Mapping Data to Program Instances`_ - + `"Coherent" Control Flow Statements`_ - + `Program Instance Convergence`_ - + `Data Races`_ - + `Uniform Variables and Varying Control Flow`_ - + `Function Pointers`_ - + `Task Parallelism: Language Syntax`_ - + `Task Parallelism: Runtime Requirements`_ + + `Task Parallelism: "launch" and "sync" Statements`_ + + `Task Parallelism: Runtime Requirements`_ * `The ISPC Standard Library`_ @@ -130,12 +133,55 @@ Contents: Recent Changes to ISPC ====================== -See the file ``ReleaseNotes.txt`` in the ``ispc`` distribution for a list +See the file `ReleaseNotes.txt`_ in the ``ispc`` distribution for a list of recent changes to the compiler. +.. _ReleaseNotes.txt: https://raw.github.com/ispc/ispc/master/docs/ReleaseNotes.txt + Updating ISPC Programs For Changes In ISPC 1.1 ---------------------------------------------- +The 1.1 release of ``ispc`` features first-class support for pointers in +the language. Adding this functionality led to a number of syntactic +changes to the language. These should generally require only +straightforward modification of existing programs. + +These are the relevant changes to the language: + +* The syntax for reference types has been changed to match C++'s syntax for + references and the ``reference`` keyword has been removed. (A diagnostic + message is issued if ``reference`` is used.) + + + Declarations like ``reference float foo`` should be changed to ``float &foo``. + + + Any array parameters in function declaration with a ``reference`` + qualifier should just have ``reference`` removed: ``void foo(reference + float bar[])`` can just be ``void foo(float bar[])``. + +* It is no longer legal to pass a varying lvalue to a function that takes a + reference parameter; references can only be to uniform lvalue types. In + this case, the function should be rewritten to take a varying pointer + parameter. + +* It is now a compile-time error to assign an entire array to another + array. + +* A number of standard library routines have been updated to take + pointer-typed parameters, rather than references or arrays an index + offsets, as appropriate. For example, the ``atomic_add_global()`` + function previously took a reference to the variable to be updated + atomically but now takes a pointer. In a similar fashion, + ``packed_store_active()`` takes a pointer to a ``uniform unsigned int`` + as its first parameter rather than taking a ``uniform unsigned int[]`` as + its first parameter and a ``uniform int`` offset as its second parameter. + +* There are new iteration constructs for looping over computation domains, + ``foreach`` and ``foreach_tiled``. In addition to being syntactically + cleaner than regular ``for`` loops, these can provide performance + benefits in many cases when iterating over data and mapping it to program + instances. See the Section `Parallel Iteration Statements: "foreach" and + "foreach_tiled"`_ for more information about these. + Getting Started with ISPC ========================= @@ -164,8 +210,7 @@ file ``simple.ispc`` in that directory (also reproduced here.) export void simple(uniform float vin[], uniform float vout[], uniform int count) { - for (uniform int i = 0; i < count; i += programCount) { - int index = i + programIndex; + foreach (index = 0 ... count) { float v = vin[index]; if (v < 3.) v = v * v; @@ -183,40 +228,24 @@ of the value. The first thing to notice in this program is the presence of the ``export`` keyword in the function definition; this indicates that the function should be made available to be called from application code. The ``uniform`` -qualifiers on the parameters to ``simple`` as well as for the variable -``i`` indicate that the correpsonding variables are non-vector -quantities--they are discussed in detail in the `Uniform and Varying -Qualifiers`_ section. +qualifiers on the parameters to ``simple`` indicate that the correpsonding +variables are non-vector quantities--this concept is discussed in detail in the +`"uniform" and "varying" Qualifiers`_ section. -Each iteration of the for loop works on a number of input values in -parallel. The built-in ``programCount`` variable indicates how many -program instances are running in parallel; it is equal to the SIMD width of -the machine. (For example, the value is four on Intel® SSE, eight on -Intel® AVX, etc.) Thus, we can see that each execution of the loop will -work on that many output values in parallel. There is an implicit -assumption that ``programCount`` divides the ``count`` parameter without -remainder; the more general case case can be handled with a small amount of -additional code. - -To load the ``programCount``-worth of values, the program computes an index -using the sum of ``i``, which gives the first value to work on in this -iteration, and ``programIndex``, which gives a unique integer identifier -for each running program instance, counting from zero. Thus, the load from -``vin`` loads the values at offset ``i+0``, ``i+1``, ``i+2``, ..., from the -``vin`` array into the vector variable ``v``. This general idiom should be -familiar to CUDA\* or OpenCL\* programmers, where thread ids serve a -similar role to ``programIndex`` in ``ispc``. See the section `Mapping -Data to Program Instances`_ for more detail. - -The program can then proceed, doing computation and control flow based on -the values loaded. The result from the running program instances is -written to the ``vout`` array before the next loop iteration runs. +Each iteration of the ``foreach`` loop works on a number of input values in +parallel--depending on the compilation target chosen, it may be 4, 8, or +even 16 elements of the ``vin`` array, processed efficiently with the CPU's +SIMD hardware. Here, the variable ``index`` takes all values from 0 to +``count-1``. After the load from the array to the variable ``v``, the +program can then proceed, doing computation and control flow based on the +values loaded. The result from the running program instances is written to +the ``vout`` array before the next iteration of the ``foreach`` loop runs. For a simple program like this one, the performance difference versus a -regular scalar C/C++ implementation are minimal. For more -complex programs that do more substantial amounts of computation, doing -that computation in parallel across the machine's SIMD lanes can have a -substantial performance benefit. +regular scalar C/C++ implementation of the same computation is not likely +to be compelling. For more complex programs that do more substantial +amounts of computation, doing that computation in parallel across the +machine's SIMD lanes can have a substantial performance benefit. On Linux\* and Mac OS\*, the makefile in that directory compiles this program. For Windows\*, open the ``examples/examples.sln`` file in Microsoft Visual @@ -276,9 +305,11 @@ When the executable ``simple`` runs, it generates the expected output: 3: simple(3.000000) = 1.732051 ... -There is also a small example of using ``ispc`` to compute the Mandelbrot -set; see the `Mandelbrot set example`_ page on the ``ispc`` website for a -walkthrough of it. +For a slightly more complex example of using ``ispc``, see the `Mandelbrot +set example`_ page on the ``ispc`` website for a walkthrough of an ``ispc`` +implementation of that algorithm. After reading through that example, you +may want to examine the source code of the various examples in the +``examples/`` directory of the ``ispc`` distribution. .. _Mandelbrot set example: http://ispc.github.com/example.html @@ -292,6 +323,8 @@ with application code, enter the following command ispc foo.ispc -o foo.o +(On Windows, you may want to specify ``foo.obj`` as the output filename.) + Basic Command-line Options -------------------------- @@ -305,7 +338,7 @@ object file by default). :: - ispc foo.ispc -o foo.obj --emit-asm + ispc foo.ispc -o foo.obj To generate a text assembly file, pass ``--emit-asm``: @@ -338,8 +371,12 @@ For example, including ``-DTEST=1`` defines the pre-processor symbol The compiler issues a number of performance warnings for code constructs that compile to relatively inefficient code. These warnings can be silenced with the ``--wno-perf`` flag (or by using ``--woff``, which turns -off all compiler warnings.) +off all compiler warnings.) Furthermore, ``--werror`` can be provided to +direct the compiler to treat any warnings as errors. +Position-independent code (for use in shared libraries) is generated if the +``--pic`` command-line argument is provided. + Selecting The Compilation Target -------------------------------- @@ -349,7 +386,7 @@ and ``--target``, which sets the target instruction set. By default, the ``ispc`` compiler generates code for the 64-bit x86-64 architecture (i.e. ``--arch=x86-64`.) To compile to a 32-bit x86 target, -supply ``-arch=x86`` on the command line: +supply ``--arch=x86`` on the command line: :: @@ -373,11 +410,28 @@ shipped in 2001, SSE4 was introduced in 2007, and processors with AVX were introduced in 2010. Consult your CPU's manual for specifics on which vector instruction set it supports.) -By default, the target instruction set is chosen based on which ones are -supported by the system on which you're running ``ispc``. You can override -this choice with the ``--target`` flag; for example, to select Intel® SSE2, -use ``--target=sse2``. (As with the other options in this section, see the -output of ``ispc --help`` for a full list of supported targets.) +By default, the target instruction set is chosen based on the most capable +one supported by the system on which you're running ``ispc``. You can +override this choice with the ``--target`` flag; for example, to select +Intel® SSE2, use ``--target=sse2``. (As with the other options in this +section, see the output of ``ispc --help`` for a full list of supported +targets.) + +Selecting 32 or 64 Bit Addressing +--------------------------------- + +By default, ``ispc`` uses 32-bit arithmetic for performing addressing +calculations, even when using a 64-bit compilation target like x86-64. +This decision can provide substantial performance benefits by reducing the +cost of addressing calculations. (Note that pointers themselves are still +maintained as 64-bit quantities for 64-bit targets.) + +If you need to be able to address more than 4GB of memory from your +``ispc`` programs, the ``--addressing=64`` command-line argument can be +provided to cause the compiler to generate 64-bit arithmetic for addressing +calculations. Note that it is safe to mix object files where some were +compiled with the default ``--addressing=32`` and others were compiled with +``--addressing=64``. The Preprocessor @@ -385,7 +439,7 @@ The Preprocessor ``ispc`` automatically runs the C preprocessor on your input program before compiling it. Thus, you can use ``#ifdef``, ``#define``', and so forth in -your ispc programs (This functionality can be disabled with the ``--nocpp`` +your ispc programs. (This functionality can be disabled with the ``--nocpp`` command-line argument.) Three preprocessor symbols are automatically defined before the @@ -393,10 +447,14 @@ preprocessor runs. First, ``ISPC`` is defined, so that it can be detected that the ``ispc`` compiler is running over the program. Next, a symbol indicating the target instruction set is defined. With an SSE2 target, ``ISPC_TARGET_SSE2`` is defined; ``ISPC_TARGET_SSE4`` is defined for SSE4, -and ``ISPC_TARGET_AVX`` for AVX. Finally, ``PI`` is defined for -convenience, having the value 3.1415926535. +and ``ISPC_TARGET_AVX`` for AVX. -ISPC_MAJOR_VERSION, ISPC_MINOR_VERSION +To detect which version of the compiler is being used, the +``ISPC_MAJOR_VERSION`` and ``ISPC_MINOR_VERSION`` symbols are available. +For the 1.0 releases of ``ispc`` these symbols were not defined; starting +with ``ispc`` 1.1, they are defined, both having value 1. + +For convenience, ``PI`` is defined, having the value 3.1415926535. Debugging --------- @@ -413,19 +471,362 @@ Functions`_ for more information.) You can also use the ability to call back to application code at particular points in the program, passing a set of variable values to be logged or otherwise analyzed from there. +Parallel Execution Model in ISPC +================================ + +handwave to point forward to the language reference in the following +section + +mention task parallelism here, basically that there are no guarantees about +ordering between tasks, no way to synchronize among them, but remidn that +we sync before returning from functions + +Though ``ispc`` has C-based syntax, it is inherently a language for +parallel computation. Understanding the details of ``ispc``'s parallel +execution model is critical for writing efficient and correct programs in +``ispc``. + +``ispc`` supports both task parallelism to parallelize across multiple +cores and SPMD parallelism to parallelize across the SIMD vector lanes on a +single core. This section focuses on SPMD parallelism. See the sections +`Task Parallelism: "launch" and "sync" Statements`_ and `Task Parallelism: +Runtime Requirements`_ for discussion of task parallelism in ``ispc``. + +Program Instances and Gangs of Program Instances +------------------------------------------------ + +The SPMD-on-SIMD Execution Model +-------------------------------- + +In the SPMD model as implemented in ``ispc``, you programs that compute a +set of outputs based on a set of inputs. You must write these +programs so that it is safe to run multiple instances of them in +parallel--i.e. given a program an a set of inputs, the programs shouldn't +have any assumptions about the order in which they will be run over the +inputs, whether one program instances will have completed before another +runs. [#]_ + +.. [#] This is essentially the same requirement that languages like CUDA\* + and OpenCL\* place on the programmer. + +Given this guarantee, the ``ispc`` compiler can safely execute multiple +program instances in parallel, across the SIMD lanes of a single CPU. In +many cases, this execution approach can achieve higher overall performance +than if the program instances had been run serially. + +Upon entry to a ``ispc`` function, the execution model switches from +the application's serial model to SPMD. Conceptually, a number of +``ispc`` program instances will start running in parallel. This +parallelism doesn't involve launching hardware threads. Rather, one +program instance is mapped to each of the SIMD lanes of the CPU's vector +unit (Intel® SSE or Intel® AVX). + +If a ``ispc`` program is written to do a the following computation: + +:: + + float x = ..., y = ...; + return x+y; + +and if the ``ispc`` program is running four-wide on a CPU that supports the +Intel® SSE instructions, then four program instances are running in +parallel, each adding a pair of scalar values. However, these four program +instances store their individual scalar values for ``x`` and ``y`` in the +lanes of an Intel® SSE vector register, so the addition operation for all +four program instances can be done in parallel with a single ``addps`` +instruction. + +Program execution is more complicated in the presence of control flow. The +details are handled by the ``ispc`` compiler, but you may find it helpful +to understand what is going on in order to be a more effective ``ispc`` +programmer. In particular, the mapping of SPMD to SIMD lanes can lead to +reductions in this SIMD efficiency as different program instances want to +perform different computations. For example, consider a simple ``if`` +statement: + +:: + + float x = ..., y = ...; + if (x < y) { + ... + } else { + ... + } + +In general, the test ``x