From 3e4d69cbd3fdc73c88a6bcaf1a4e6122490a628b Mon Sep 17 00:00:00 2001 From: Matt Pharr Date: Fri, 2 Dec 2011 16:01:05 -0800 Subject: [PATCH] Checkpoint work on specifying execution model --- docs/ispc.txt | 1125 ++++++++++++++++++++++++++++++++++--------------- 1 file changed, 790 insertions(+), 335 deletions(-) diff --git a/docs/ispc.txt b/docs/ispc.txt index a1d4c277..6452d6ac 100644 --- a/docs/ispc.txt +++ b/docs/ispc.txt @@ -59,14 +59,22 @@ Contents: + `The Preprocessor`_ + `Debugging`_ -* `Parallel Execution Model in ISPC`_ +* `The ISPC Parallel Execution Model`_ + + + `Basic Concepts: Program Instances and Gangs of Program Instances`_ + + `Control Flow Within A Gang`_ + + * `Control Flow Example: If Statements`_ + * `Control Flow Example: Loops`_ + * `Gang Convergence Guarantees`_ + + + `Uniform Data`_ + + * `Uniform Control Flow`_ + * `Uniform Variables and Varying Control Flow`_ - + `Program Instances and Gangs of Program Instances`_ - + `The SPMD-on-SIMD Execution Model`_ - + `Gang Convergence`_ + `Data Races Within a Gang`_ - + `Uniform Data In A Gang`_ - + `Uniform Variables and Varying Control Flow`_ + + `Tasking Model`_ * `The ISPC Language`_ @@ -254,12 +262,6 @@ program can then proceed, doing computation and control flow based on the values loaded. The result from the running program instances is written to the ``vout`` array before the next iteration of the ``foreach`` loop runs. -For a simple program like this one, the performance difference versus a -regular scalar C/C++ implementation of the same computation is not likely -to be compelling. For more complex programs that do more substantial -amounts of computation, doing that computation in parallel across the -machine's SIMD lanes can have a substantial performance benefit. - On Linux\* and Mac OS\*, the makefile in that directory compiles this program. For Windows\*, open the ``examples/examples.sln`` file in Microsoft Visual C++ 2010\* to build this (and the other) examples. In either case, @@ -435,9 +437,9 @@ Selecting 32 or 64 Bit Addressing By default, ``ispc`` uses 32-bit arithmetic for performing addressing calculations, even when using a 64-bit compilation target like x86-64. -This decision can provide substantial performance benefits by reducing the -cost of addressing calculations. (Note that pointers themselves are still -maintained as 64-bit quantities for 64-bit targets.) +This implementation approach can provide substantial performance benefits +by reducing the cost of addressing calculations. (Note that pointers +themselves are still maintained as 64-bit quantities for 64-bit targets.) If you need to be able to address more than 4GB of memory from your ``ispc`` programs, the ``--addressing=64`` command-line argument can be @@ -451,23 +453,33 @@ The Preprocessor ---------------- ``ispc`` automatically runs the C preprocessor on your input program before -compiling it. Thus, you can use ``#ifdef``, ``#define``', and so forth in +compiling it. Thus, you can use ``#ifdef``, ``#define``, and so forth in your ispc programs. (This functionality can be disabled with the ``--nocpp`` command-line argument.) -Three preprocessor symbols are automatically defined before the -preprocessor runs. First, ``ISPC`` is defined, so that it can be detected -that the ``ispc`` compiler is running over the program. Next, a symbol -indicating the target instruction set is defined. With an SSE2 target, -``ISPC_TARGET_SSE2`` is defined; ``ISPC_TARGET_SSE4`` is defined for SSE4, -and ``ISPC_TARGET_AVX`` for AVX. +A number of preprocessor symbols are automatically defined before the +preprocessor runs: -To detect which version of the compiler is being used, the -``ISPC_MAJOR_VERSION`` and ``ISPC_MINOR_VERSION`` symbols are available. -For the 1.0 releases of ``ispc`` these symbols were not defined; starting -with ``ispc`` 1.1, they are defined, both having value 1. +.. list-table:: Predefined Preprocessor symbols and their values -For convenience, ``PI`` is defined, having the value 3.1415926535. + * - Symbol name + - Value + - Use + * - ISPC + - 1 + - Detecting that the ``ispc`` compiler is processing the file + * - ISPC_TARGET_{SSE2,SSE4,AVX} + - 1 + - One of these will be set, depending on the compilation target. + * - ISPC_MAJOR_VERSION + - 1 + - Major version of the ``ispc`` compiler/language + * - ISPC_MINOR_VERSION + - 1 + - Minor version of the ``ispc`` compiler/language + * - PI + - 3.1415926535 + - Mathematics Debugging --------- @@ -484,138 +496,245 @@ Functions`_ for more information.) You can also use the ability to call back to application code at particular points in the program, passing a set of variable values to be logged or otherwise analyzed from there. -Parallel Execution Model in ISPC -================================ -invariant: will never execute with mask "all off" (at least not observably) +The ISPC Parallel Execution Model +================================= -make the notion of a uniform pc + a mask a clear component - -define what we mean by control flow coherence here - -handwave to point forward to the language reference in the following -section - -mention task parallelism here, basically that there are no guarantees about -ordering between tasks, no way to synchronize among them, but remidn that -we sync before returning from functions - -Though ``ispc`` has C-based syntax, it is inherently a language for +Though ``ispc`` is a C-based language, it is inherently a language for parallel computation. Understanding the details of ``ispc``'s parallel -execution model is critical for writing efficient and correct programs in -``ispc``. +execution model that are introduced in this section is critical for writing +efficient and correct programs in ``ispc``. -``ispc`` supports both task parallelism to parallelize across multiple -cores and SPMD parallelism to parallelize across the SIMD vector lanes on a -single core. This section focuses on SPMD parallelism. See the sections -`Task Parallelism: "launch" and "sync" Statements`_ and `Task Parallelism: -Runtime Requirements`_ for discussion of task parallelism in ``ispc``. +``ispc`` supports two types of parallelism: both task parallelism to +parallelize across multiple processor cores and SPMD parallelism to +parallelize across the SIMD vector lanes on a single core. Most of this +section focuses on SPMD parallelism, but see `Tasking Model`_ at the end of +this section for discussion of task parallelism in ``ispc``. -Program Instances and Gangs of Program Instances ------------------------------------------------- +This section will use some snippets of ``ispc`` code to illustrate various +concepts. Given ``ispc``'s relationship to C, these should generally be +understandable on their own, but you may want to refer to the `The ISPC +Language`_ section for details on language syntax. -The SPMD-on-SIMD Execution Model --------------------------------- -In the SPMD model as implemented in ``ispc``, you programs that compute a -set of outputs based on a set of inputs. You must write these -programs so that it is safe to run multiple instances of them in -parallel--i.e. given a program an a set of inputs, the programs shouldn't -have any assumptions about the order in which they will be run over the -inputs, whether one program instances will have completed before another -runs. [#]_ +Basic Concepts: Program Instances and Gangs of Program Instances +---------------------------------------------------------------- -.. [#] This is essentially the same requirement that languages like CUDA\* - and OpenCL\* place on the programmer. +Upon entry to a ``ispc`` function called from C/C++ code, the execution +model switches from the application's serial model to ``ispc``'s execution +model. Conceptually, a number of ``ispc`` *program instances* start +running in parallel. The group of running program instances is a called +*gang* (harkening to "gang scheduling", since ``ispc`` provides certain +guarantees about the control flow coherence of program instances running +in a gang.) An ``ispc`` program instance is thus roughly similar to a +CUDA* "thread" or an OpenCL* "work-item", and an ``ispc`` gang is roughly +similar to a CUDA* "warp". -Given this guarantee, the ``ispc`` compiler can safely execute multiple -program instances in parallel, across the SIMD lanes of a single CPU. In -many cases, this execution approach can achieve higher overall performance -than if the program instances had been run serially. - -Upon entry to a ``ispc`` function, the execution model switches from -the application's serial model to SPMD. Conceptually, a number of -``ispc`` program instances will start running in parallel. This -parallelism doesn't involve launching hardware threads. Rather, one -program instance is mapped to each of the SIMD lanes of the CPU's vector -unit (Intel® SSE or Intel® AVX). - -If a ``ispc`` program is written to do a the following computation: +An ``ispc`` program, then, expresses the computation performed by a gang of +program instances, using an "implicitly parallel" model, where the ``ispc`` +program generally describes the behavior of a single program instance, even +though a gang of them is actually executing together. This implicit model +is the same that is used for shaders in programmable graphics pipelines, +OpenCL* kernels, and CUDA*. For example, consider the following ``ispc`` +function: :: - float x = ..., y = ...; - return x+y; + float func(float a, float b) { + return a + b / 2.; + } -and if the ``ispc`` program is running four-wide on a CPU that supports the -Intel® SSE instructions, then four program instances are running in -parallel, each adding a pair of scalar values. However, these four program -instances store their individual scalar values for ``x`` and ``y`` in the -lanes of an Intel® SSE vector register, so the addition operation for all -four program instances can be done in parallel with a single ``addps`` -instruction. +In C, this function describes a simple computation on two individual +floating-point values. In ``ispc``, this function describes the +computation to be performed by each program instance in a gang. Each +program instance has distinct values for the variables ``a`` and ``b``, and +thus each program instance generally computes a different result when +executing this function. -Program execution is more complicated in the presence of control flow. The -details are handled by the ``ispc`` compiler, but you may find it helpful -to understand what is going on in order to be a more effective ``ispc`` -programmer. In particular, the mapping of SPMD to SIMD lanes can lead to -reductions in this SIMD efficiency as different program instances want to -perform different computations. For example, consider a simple ``if`` -statement: +The gang of program instances starts executing in the same hardware thread +and context as the application code that called the ``ispc`` function; no +thread creation or context switching is done under the covers by ``ispc``. +Rather, the set of program instances is mapped to the SIMD lanes of the +current processor, leading to excellent utilization of hardware SIMD units +and high performance. + +The number of program instances in a gang is relatively small; in practice, +it's no more than twice the native SIMD width of the hardware it is +executing on. (Thus, four or eight program instances in a gang on a CPU +using the the 4-wide SSE instruction set, and eight or sixteen on a CPU +using 8-wide AVX.) + +Control Flow Within A Gang +-------------------------- + +Almost all the standard control-flow constructs are supported by ``ispc``; +program instances are free to follow different program execution paths than +other ones in their gang. For example, consider a simple ``if`` statement +in ``ispc`` code: :: float x = ..., y = ...; if (x < y) { - ... - } else { - ... + // true statements + } + else { + // false statements } -In general, the test ``x