ispc/docs/faq.rst

=====================================
Frequently Asked Questions About ispc
=====================================

This document includes a number of frequently (and not frequently) asked
questions about ispc, the Intel® SPMD Program Compiler.  The source to this
document is in the file ``docs/faq.rst`` in the ``ispc`` source
distribution.

* Understanding ispc's Output

  + `How can I see the assembly language generated by ispc?`_
  + `How can I have the assembly output be printed using Intel assembly syntax?`_
  + `Why are there multiple versions of exported ispc functions in the assembly output?`_
  + `How can I more easily see gathers and scatters in generated assembly?`_

* Interoperability

  + `How can I supply an initial execution mask in the call from the application?`_
  + `How can I generate a single binary executable with support for multiple instruction sets?`_
  + `How can I determine at run-time which vector instruction set's instructions were selected to execute?`_

* Programming Techniques

  + `What primitives are there for communicating between SPMD program instances?`_
  + `How can a gang of program instances generate variable amounts of output efficiently?`_
  + `Is it possible to use ispc for explicit vector programming?`_
  + `How can I debug my ispc programs using Valgrind?`_

Understanding ispc's Output
===========================

How can I see the assembly language generated by ispc?
------------------------------------------------------

The ``--emit-asm`` flag causes assembly output to be generated.  If the
``-o`` command-line flag is also supplied, the assembly is stored in the
given file, or printed to standard output if ``-`` is specified for the
filename.  For example, given the simple ``ispc`` program:

::

    export uniform int foo(uniform int a, uniform int b) {
        return a+b;
    }

If the SSE4 target is used, then the following assembly is printed:

::

    _foo:
            addl    %esi, %edi
            movl    %edi, %eax
            ret


How can I have the assembly output be printed using Intel assembly syntax?
--------------------------------------------------------------------------

The ``ispc`` compiler is currently only able to emit assembly with AT+T
syntax, where the destination operand is the last operand after an
instruction.  If you'd prefer Intel assembly output, one option is to use
Agner Fog's ``objconv`` tool: have ``ispc`` emit a native object file and
then use ``objconv`` to disassemble it, specifying the assembler syntax
that you prefer.  ``objconv`` `is available for download here`_.

.. _is available for download here: http://www.agner.org/optimize/#objconv

Why are there multiple versions of exported ispc functions in the assembly output?
----------------------------------------------------------------------------------

Two generations of all functions qualified with ``export`` are generated:
one of them is for being be called by other ``ispc`` functions, and the
other is to be called by the application.  The application callable
function has the original function's name, while the ``ispc``-callable
function has a mangled name that encodes the types of the function's
parameters.

The crucial difference between these two functions is that the
application-callable function doesn't take a parameter encoding the current
execution mask, while ``ispc``-callable functions have a hidden mask
parameter.  An implication of this difference is that the ``export``
function starts with the execution mask "all on".  This allows a number of
improvements in the generated code, particularly on architectures that
don't have support for masked load and store instructions.

As an example, consider this short function, which loads a vector's worth
values from two arrays in memory, adds them, and writes the result to an
output array.

::

    export void foo(uniform float a[], uniform float b[],
                    uniform float result[]) {
        float aa = a[programIndex], bb = b[programIndex];
        result[programIndex] = aa+bb;
    }

Here is the assembly code for the application-callable instance of the
function.

::

    _foo:
            movups        (%rsi), %xmm1
            movups        (%rdi), %xmm0
            addps         %xmm1, %xmm0
            movups        %xmm0, (%rdx)
            ret


And here is the assembly code for the ``ispc``-callable instance of the
function.

::

    "_foo___uptr<Uf>uptr<Uf>uptr<Uf>":
            movmskps      %xmm0, %eax
            cmpl          $15, %eax
            je            LBB0_3
            testl         %eax, %eax
            jne           LBB0_4
            ret
    LBB0_3:
            movups        (%rsi), %xmm1
            movups        (%rdi), %xmm0
            addps         %xmm1, %xmm0
            movups        %xmm0, (%rdx)
            ret
    LBB0_4:
    ####
    ####  Code elided; handle mixed mask case..
    ####
            ret

There are a few things to notice in this code.  First, the current program
mask is coming in via the ``%xmm0`` register and the initial few
instructions in the function essentially check to see if the mask is all on
or all off.  If the mask is all on, the code at the label LBB0_3 executes;
it's the same as the code that was generated for ``_foo`` above.  If the
mask is all off, then there's nothing to be done, and the function can
return immediately.

In the case of a mixed mask, a substantial amount of code is generated to
load from and then store to only the array elements that correspond to
program instances where the mask is on.  (This code is elided below).  This
general pattern of having two-code paths for the "all on" and "mixed" mask
cases is used in the code generated for almost all but the most simple
functions (where the overhead of the test isn't worthwhile.)

How can I more easily see gathers and scatters in generated assembly?
---------------------------------------------------------------------

Because CPU vector ISAs don't have native gather and scatter instructions,
these memory operations are turned into sequences of a series of
instructions in the code that ``ispc`` generates.  In some cases, it can be
useful to see where gathers and scatters actually happen in code; there is
an otherwise undocumented command-line flag that provides this information.

Consider this simple program:

::

    void set(uniform int a[], int value, int index) {
        a[index] = value;
    }

When compiled normally to the SSE4 target, this program generates this
extensive code sequence, which makes it more difficult to see what the
program is actually doing.

::

    "_set___uptr<Ui>ii":
            pmulld        LCPI0_0(%rip), %xmm1
            movmskps      %xmm2, %eax
            testb         $1, %al
            je            LBB0_2
            movd          %xmm1, %ecx
            movd          %xmm0, (%rcx,%rdi)
    LBB0_2:
            testb         $2, %al
            je            LBB0_4
            pextrd        $1, %xmm1, %ecx
            pextrd        $1, %xmm0, (%rcx,%rdi)
    LBB0_4:
            testb         $4, %al
            je            LBB0_6
            pextrd        $2, %xmm1, %ecx
            pextrd        $2, %xmm0, (%rcx,%rdi)
    LBB0_6:
            testb        $8, %al
            je            LBB0_8
            pextrd        $3, %xmm1, %eax
            pextrd        $3, %xmm0, (%rax,%rdi)
    LBB0_8:
            ret

If this program is compiled with the
``--opt=disable-handle-pseudo-memory-ops`` command-line flag, then the
scatter is left as an unresolved function call.  The resulting program
won't link without unresolved symbols, but the assembly output is much
easier to understand:

::

    "_set___uptr<Ui>ii":
            movaps        %xmm0, %xmm3
            pmulld        LCPI0_0(%rip), %xmm1
            movdqa        %xmm1, %xmm0
            movaps        %xmm3, %xmm1
            jmp        ___pseudo_scatter_base_offsets32_32 ## TAILCALL


Interoperability
================

How can I supply an initial execution mask in the call from the application?
----------------------------------------------------------------------------

Recall that when execution transitions from the application code to an
``ispc`` function, all of the program instances are initially executing.
In some cases, it may desired that only some of them are running, based on
a data-dependent condition computed in the application program.  This
situation can easily be handled via an additional parameter from the
application.

As a simple example, consider a case where the application code has an
array of ``float`` values and we'd like the ``ispc`` code to update
just specific values in that array, where which of those values to be
updated has been determined by the application.  In C++ code, we might
have:

::

    int count = ...;
    float *array = new float[count];
    bool *shouldUpdate = new bool[count];
    // initialize array and shouldUpdate
    ispc_func(array, shouldUpdate, count);

Then, the ``ispc`` code could process this update as:

::

    export void ispc_func(uniform float array[], uniform bool update[],
                          uniform int count) {
        foreach (i = 0 ... count) {
            cif (update[i] == true)
                // update array[i+programIndex]...
        }
    }

(In this case a "coherent" if statement is likely to be worthwhile if the
``update`` array will tend to have sections that are either all-true or
all-false.)

How can I generate a single binary executable with support for multiple instruction sets?
-----------------------------------------------------------------------------------------

``ispc`` can also generate output that supports multiple target instruction
sets, also generating code that chooses the most appropriate one at runtime
if multiple targets are specified with the ``--target`` command-line
argument.

For example, if you run the command:

::

   ispc foo.ispc -o foo.o --target=sse2,sse4-x2,avx-x2

Then four object files will be generated: ``foo_sse2.o``, ``foo_sse4.o``,
``foo_avx.o``, and ``foo.o``.[#]_  Link all of these into your executable, and
when you call a function in ``foo.ispc`` from your application code,
``ispc`` will determine which instruction sets are supported by the CPU the
code is running on and will call the most appropriate version of the
function available.

.. [#] Similarly, if you choose to generate assembly language output or
   LLVM bitcode output, multiple versions of those files will be created.

In general, the version of the function that runs will be the one in the
most general instruction set that is supported by the system.  If you only
compile SSE2 and SSE4 variants and run on a system that supports AVX, for
example, then the SSE4 variant will be executed.  If the system doesn't
is not able to run any of the available variants of the function (for
example, trying to run a function that only has SSE4 and AVX variants on a
system that only supports SSE2), then the standard library ``abort()``
function will be called.

One subtlety is that all non-static global variables (if any) must have the
same size and layout with all of the targets used.  For example, if you
have the global variables:

::

   uniform int foo[2*programCount];
   int bar;

and compile to both SSE2 and AVX targets, both of these variables will have
different sizes (the first due to program count having the value 4 for SSE2
and 8 for AVX, and the second due to ``varying`` types having different
numbers of elements with the two targets--essentially the same issue as the
first.)  ``ispc`` issues an error in this case.


How can I determine at run-time which vector instruction set's instructions were selected to execute?
-----------------------------------------------------------------------------------------------------

``ispc`` doesn't provide any API that allows querying which vector ISA's
instructions are running when multi-target compilation was used.  However,
this can be solved in "user space" by writing a small helper function.
Specifically, if you implement a function like this

::

    export uniform int isa() {
    #if defined(ISPC_TARGET_SSE2)
        return 0;
    #elif defined(ISPC_TARGET_SSE4)
        return 1;
    #elif defined(ISPC_TARGET_AVX)
        return 2;
    #else
        return -1;
    #endif
    }

And then call it from your application code at runtime, it will return 0,
1, or 2, depending on which target's instructions are running.

The way this works is a little surprising, but it's a useful trick.  Of
course the preprocessor ``#if`` checks are all compile-time only
operations.  What's actually happening is that the function is compiled
multiple times, once for each target, with the appropriate ``ISPC_TARGET``
preprocessor symbol set.  Then, a small dispatch function is generated for
the application to actually call.  This dispatch function in turn calls the
appropriate version of the function based on the CPU of the system it's
executing on, which in turn returns the appropriate value.

In a similar fashion, it's possible to find out at run-time the value of
``programCount`` for the target that's actually being used.

::

    export uniform int width() { return programCount; }


Programming Techniques
======================

What primitives are there for communicating between SPMD program instances?
---------------------------------------------------------------------------

The ``broadcast()``, ``rotate()``, and ``shuffle()`` standard library
routines provide a variety of mechanisms for the running program instances
to communicate values to each other during execution.  Note that there's no
need to synchronize the program instances before communicating between
them, due to the synchronized execution model of gangs of program instances
in ``ispc``.

How can a gang of program instances generate variable amounts of output efficiently?
------------------------------------------------------------------------------------

It's not unusual to have a gang of program instances where each program
instance generates a variable amount of output (perhaps some generate no
output, some generate one output value, some generate many output values
and so forth), and where one would like to have the output densely packed
in an output array.  The ``exclusive_scan_add()`` function from the
standard library is quite useful in this situation.

Consider the following function:

::

    uniform int func(uniform float outArray[], ...) {
       int numOut = ...;  // figure out how many to be output
       float outLocal[MAX_OUT]; // staging area

       // each program instance in the gang puts its results in
       //  outLocal[0], ..., outLocal[numOut-1]

       int startOffset = exclusive_scan_add(numOut);
       for (int i = 0; i < numOut; ++i)
           outArray[startOffset + i] = outLocal[i];
       return reduce_add(numOut);
    }

Here, each program instance has computed a number, ``numOut``, of values to
output, and has stored them in the ``outLocal`` array.  Assume that four
program instances are running and that the first one wants to output one
value, the second two values, and the third and fourth three values each.
In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6)
to the four program instances, respectively.

The first program instance will then write its one result to
``outArray[0]``, the second will write its two values to ``outArray[1]``
and ``outArray[2]``, and so forth.  The ``reduce_add()`` call at the end
returns the total number of values that all of the program instances have
written to the array.

FIXME: add discussion of foreach_active as an option here once that's in

Is it possible to use ispc for explicit vector programming?
-----------------------------------------------------------

The typical model for programming in ``ispc`` is an *implicit* parallel
model, where one writes a program that is apparently doing scalar
computation on values and the program is then vectorized to run in parallel
across the SIMD lanes of a processor.  However, ``ispc`` also has some
support for explicit vector unit programming, where the vectorization is
explicit.  Some computations may be more effectively described in the
explicit model rather than the implicit model.

This support is provided via ``uniform`` instances of short vectors
Specifically, if this short program

::

    export uniform float<8> madd(uniform float<8> a, uniform float<8> b,
                                 uniform float<8> c) {
        return a + b * c;
    }

is compiled with the AVX target, ``ispc`` generates the following assembly:

::

    _madd:
	vmulps	%ymm2, %ymm1, %ymm1
	vaddps	%ymm0, %ymm1, %ymm0
	ret

(And similarly, if compiled with a 4-wide SSE target, two ``mulps`` and two
``addps`` instructions are generated, and so forth.)

Note that ``ispc`` doesn't currently support control-flow based on
``uniform`` short vector types; it is thus not possible to write code like:

::

    export uniform int<8> count(uniform float<8> a, uniform float<8> b) {
        uniform int<8> sum = 0;
        while (a++ < b)
            ++sum;
    }


How can I debug my ispc programs using Valgrind?
------------------------------------------------

The `valgrind`_ memory checker is an extremely useful memory checker for
Linux and OSX; it detects a range of memory errors, including accessing
memory after it has been freed, accessing memory beyond the end of an
array, accessing uninitialized stack variables, and so forth.
In general, applications that use ``ispc`` code run with ``valgrind``
without modification and ``valgrind`` will detect the same range of memory
errors in ``ispc`` code that it does in C/C++ code.

.. _valgrind: http://valgrind.org

One issue to be aware of is that until recently, ``valgrind`` only
supported the SSE2 vector instructions; if you are using a version of
``valgrind`` older than the 3.7.0 release (5 November 2011), you should
compile your ``ispc`` programs with ``--target=sse2`` before running them
through ``valgrind``.  (Note that if no target is specified, then ``ispc``
chooses a target based on the capabilities of the system you're running
``ispc`` on.)  If you run an ``ispc`` program that uses instructions that
``valgrind`` doesn't support, you'll see an error message like:

::

    vex amd64->IR: unhandled instruction bytes: 0xC5 0xFA 0x10 0x0 0xC5 0xFA 0x11 0x84
    ==46059== valgrind: Unrecognised instruction at address 0x100002707.

The just-released valgrind 3.7.0 adds support for the SSE4.2 instruction
set; if you're using that version (and your system supports SSE4.2), then
you can use ``--target=sse4`` when compiling to run with ``valgrind``.

Note that ``valgrind`` does not yet support programs that use the AVX
instruction set.