Documentation work; first pass perf guide complete
This commit is contained in:
124
docs/faq.txt
124
docs/faq.txt
@@ -23,7 +23,7 @@ distribution.
|
||||
* Programming Techniques
|
||||
|
||||
+ `What primitives are there for communicating between SPMD program instances?`_
|
||||
+ `How can a gang of program instances generate variable output efficiently?`_
|
||||
+ `How can a gang of program instances generate variable amounts of output efficiently?`_
|
||||
+ `Is it possible to use ispc for explicit vector programming?`_
|
||||
|
||||
|
||||
@@ -48,8 +48,7 @@ If the SSE4 target is used, then the following assembly is printed:
|
||||
|
||||
::
|
||||
|
||||
_foo: ## @foo
|
||||
## BB#0: ## %allocas
|
||||
_foo:
|
||||
addl %esi, %edi
|
||||
movl %edi, %eax
|
||||
ret
|
||||
@@ -98,7 +97,7 @@ output array.
|
||||
}
|
||||
|
||||
Here is the assembly code for the application-callable instance of the
|
||||
function--note that the selected instructions are ideal.
|
||||
function.
|
||||
|
||||
::
|
||||
|
||||
@@ -111,21 +110,7 @@ function--note that the selected instructions are ideal.
|
||||
|
||||
|
||||
And here is the assembly code for the ``ispc``-callable instance of the
|
||||
function. There are a few things to notice in this code.
|
||||
|
||||
The current program mask is coming in via the %xmm0 register and the
|
||||
initial few instructions in the function essentially check to see if the
|
||||
mask is all-on or all-off. If the mask is all on, the code at the label
|
||||
LBB0_3 executes; it's the same as the code that was generated for ``_foo``
|
||||
above. If the mask is all off, then there's nothing to be done, and the
|
||||
function can return immediately.
|
||||
|
||||
In the case of a mixed mask, a substantial amount of code is generated to
|
||||
load from and then store to only the array elements that correspond to
|
||||
program instances where the mask is on. (This code is elided below). This
|
||||
general pattern of having two-code paths for the "all on" and "mixed" mask
|
||||
cases is used in the code generated for almost all but the most simple
|
||||
functions (where the overhead of the test isn't worthwhile.)
|
||||
function.
|
||||
|
||||
::
|
||||
|
||||
@@ -148,11 +133,84 @@ functions (where the overhead of the test isn't worthwhile.)
|
||||
####
|
||||
ret
|
||||
|
||||
There are a few things to notice in this code. First, the current program
|
||||
mask is coming in via the ``%xmm0`` register and the initial few
|
||||
instructions in the function essentially check to see if the mask is all on
|
||||
or all off. If the mask is all on, the code at the label LBB0_3 executes;
|
||||
it's the same as the code that was generated for ``_foo`` above. If the
|
||||
mask is all off, then there's nothing to be done, and the function can
|
||||
return immediately.
|
||||
|
||||
In the case of a mixed mask, a substantial amount of code is generated to
|
||||
load from and then store to only the array elements that correspond to
|
||||
program instances where the mask is on. (This code is elided below). This
|
||||
general pattern of having two-code paths for the "all on" and "mixed" mask
|
||||
cases is used in the code generated for almost all but the most simple
|
||||
functions (where the overhead of the test isn't worthwhile.)
|
||||
|
||||
How can I more easily see gathers and scatters in generated assembly?
|
||||
---------------------------------------------------------------------
|
||||
|
||||
FIXME
|
||||
Because CPU vector ISAs don't have native gather and scatter instructions,
|
||||
these memory operations are turned into sequences of a series of
|
||||
instructions in the code that ``ispc`` generates. In some cases, it can be
|
||||
useful to see where gathers and scatters actually happen in code; there is
|
||||
an otherwise undocumented command-line flag that provides this information.
|
||||
|
||||
Consider this simple program:
|
||||
|
||||
::
|
||||
|
||||
void set(uniform int a[], int value, int index) {
|
||||
a[index] = value;
|
||||
}
|
||||
|
||||
When compiled normally to the SSE4 target, this program generates this
|
||||
extensive code sequence, which makes it more difficult to see what the
|
||||
program is actually doing.
|
||||
|
||||
::
|
||||
|
||||
"_set___uptr<Ui>ii":
|
||||
pmulld LCPI0_0(%rip), %xmm1
|
||||
movmskps %xmm2, %eax
|
||||
testb $1, %al
|
||||
je LBB0_2
|
||||
movd %xmm1, %ecx
|
||||
movd %xmm0, (%rcx,%rdi)
|
||||
LBB0_2:
|
||||
testb $2, %al
|
||||
je LBB0_4
|
||||
pextrd $1, %xmm1, %ecx
|
||||
pextrd $1, %xmm0, (%rcx,%rdi)
|
||||
LBB0_4:
|
||||
testb $4, %al
|
||||
je LBB0_6
|
||||
pextrd $2, %xmm1, %ecx
|
||||
pextrd $2, %xmm0, (%rcx,%rdi)
|
||||
LBB0_6:
|
||||
testb $8, %al
|
||||
je LBB0_8
|
||||
pextrd $3, %xmm1, %eax
|
||||
pextrd $3, %xmm0, (%rax,%rdi)
|
||||
LBB0_8:
|
||||
ret
|
||||
|
||||
If this program is compiled with the
|
||||
``--opt=disable-handle-pseudo-memory-ops`` command-line flag, then the
|
||||
scatter is left as an unresolved function call. The resulting program
|
||||
won't link without unresolved symbols, but the assembly output is much
|
||||
easier to understand:
|
||||
|
||||
::
|
||||
|
||||
"_set___uptr<Ui>ii":
|
||||
movaps %xmm0, %xmm3
|
||||
pmulld LCPI0_0(%rip), %xmm1
|
||||
movdqa %xmm1, %xmm0
|
||||
movaps %xmm3, %xmm1
|
||||
jmp ___pseudo_scatter_base_offsets32_32 ## TAILCALL
|
||||
|
||||
|
||||
Interoperability
|
||||
================
|
||||
@@ -301,13 +359,17 @@ need to synchronize the program instances before communicating between
|
||||
them, due to the synchronized execution model of gangs of program instances
|
||||
in ``ispc``.
|
||||
|
||||
How can a gang of program instances generate variable output efficiently?
|
||||
-------------------------------------------------------------------------
|
||||
How can a gang of program instances generate variable amounts of output efficiently?
|
||||
------------------------------------------------------------------------------------
|
||||
|
||||
A useful application of the ``exclusive_scan_add()`` function in the
|
||||
standard library is when program instances want to generate a variable
|
||||
amount of output and when one would like that output to be densely packed
|
||||
in a single array. For example, consider the code fragment below:
|
||||
It's not unusual to have a gang of program instances where each program
|
||||
instance generates a variable amount of output (perhaps some generate no
|
||||
output, some generate one output value, some generate many output values
|
||||
and so forth), and where one would like to have the output densely packed
|
||||
in an output array. The ``exclusive_scan_add()`` function from the
|
||||
standard library is quite useful in this situation.
|
||||
|
||||
Consider the following function:
|
||||
|
||||
::
|
||||
|
||||
@@ -331,11 +393,11 @@ value, the second two values, and the third and fourth three values each.
|
||||
In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6)
|
||||
to the four program instances, respectively.
|
||||
|
||||
The first program instance will write its one result to ``outArray[0]``,
|
||||
the second will write its two values to ``outArray[1]`` and
|
||||
``outArray[2]``, and so forth. The ``reduce_add`` call at the end returns
|
||||
the total number of values that all of the program instances have written
|
||||
to the array.
|
||||
The first program instance will then write its one result to
|
||||
``outArray[0]``, the second will write its two values to ``outArray[1]``
|
||||
and ``outArray[2]``, and so forth. The ``reduce_add()`` call at the end
|
||||
returns the total number of values that all of the program instances have
|
||||
written to the array.
|
||||
|
||||
FIXME: add discussion of foreach_active as an option here once that's in
|
||||
|
||||
|
||||
Reference in New Issue
Block a user