Documentation work; first pass perf guide complete

This commit is contained in:
Matt Pharr
2011-12-01 09:42:56 -08:00
parent a2f118a14e
commit f90aa172a6
3 changed files with 611 additions and 257 deletions

View File

@@ -23,7 +23,7 @@ distribution.
* Programming Techniques
+ `What primitives are there for communicating between SPMD program instances?`_
+ `How can a gang of program instances generate variable output efficiently?`_
+ `How can a gang of program instances generate variable amounts of output efficiently?`_
+ `Is it possible to use ispc for explicit vector programming?`_
@@ -48,8 +48,7 @@ If the SSE4 target is used, then the following assembly is printed:
::
_foo: ## @foo
## BB#0: ## %allocas
_foo:
addl %esi, %edi
movl %edi, %eax
ret
@@ -98,7 +97,7 @@ output array.
}
Here is the assembly code for the application-callable instance of the
function--note that the selected instructions are ideal.
function.
::
@@ -111,21 +110,7 @@ function--note that the selected instructions are ideal.
And here is the assembly code for the ``ispc``-callable instance of the
function. There are a few things to notice in this code.
The current program mask is coming in via the %xmm0 register and the
initial few instructions in the function essentially check to see if the
mask is all-on or all-off. If the mask is all on, the code at the label
LBB0_3 executes; it's the same as the code that was generated for ``_foo``
above. If the mask is all off, then there's nothing to be done, and the
function can return immediately.
In the case of a mixed mask, a substantial amount of code is generated to
load from and then store to only the array elements that correspond to
program instances where the mask is on. (This code is elided below). This
general pattern of having two-code paths for the "all on" and "mixed" mask
cases is used in the code generated for almost all but the most simple
functions (where the overhead of the test isn't worthwhile.)
function.
::
@@ -148,11 +133,84 @@ functions (where the overhead of the test isn't worthwhile.)
####
ret
There are a few things to notice in this code. First, the current program
mask is coming in via the ``%xmm0`` register and the initial few
instructions in the function essentially check to see if the mask is all on
or all off. If the mask is all on, the code at the label LBB0_3 executes;
it's the same as the code that was generated for ``_foo`` above. If the
mask is all off, then there's nothing to be done, and the function can
return immediately.
In the case of a mixed mask, a substantial amount of code is generated to
load from and then store to only the array elements that correspond to
program instances where the mask is on. (This code is elided below). This
general pattern of having two-code paths for the "all on" and "mixed" mask
cases is used in the code generated for almost all but the most simple
functions (where the overhead of the test isn't worthwhile.)
How can I more easily see gathers and scatters in generated assembly?
---------------------------------------------------------------------
FIXME
Because CPU vector ISAs don't have native gather and scatter instructions,
these memory operations are turned into sequences of a series of
instructions in the code that ``ispc`` generates. In some cases, it can be
useful to see where gathers and scatters actually happen in code; there is
an otherwise undocumented command-line flag that provides this information.
Consider this simple program:
::
void set(uniform int a[], int value, int index) {
a[index] = value;
}
When compiled normally to the SSE4 target, this program generates this
extensive code sequence, which makes it more difficult to see what the
program is actually doing.
::
"_set___uptr<Ui>ii":
pmulld LCPI0_0(%rip), %xmm1
movmskps %xmm2, %eax
testb $1, %al
je LBB0_2
movd %xmm1, %ecx
movd %xmm0, (%rcx,%rdi)
LBB0_2:
testb $2, %al
je LBB0_4
pextrd $1, %xmm1, %ecx
pextrd $1, %xmm0, (%rcx,%rdi)
LBB0_4:
testb $4, %al
je LBB0_6
pextrd $2, %xmm1, %ecx
pextrd $2, %xmm0, (%rcx,%rdi)
LBB0_6:
testb $8, %al
je LBB0_8
pextrd $3, %xmm1, %eax
pextrd $3, %xmm0, (%rax,%rdi)
LBB0_8:
ret
If this program is compiled with the
``--opt=disable-handle-pseudo-memory-ops`` command-line flag, then the
scatter is left as an unresolved function call. The resulting program
won't link without unresolved symbols, but the assembly output is much
easier to understand:
::
"_set___uptr<Ui>ii":
movaps %xmm0, %xmm3
pmulld LCPI0_0(%rip), %xmm1
movdqa %xmm1, %xmm0
movaps %xmm3, %xmm1
jmp ___pseudo_scatter_base_offsets32_32 ## TAILCALL
Interoperability
================
@@ -301,13 +359,17 @@ need to synchronize the program instances before communicating between
them, due to the synchronized execution model of gangs of program instances
in ``ispc``.
How can a gang of program instances generate variable output efficiently?
-------------------------------------------------------------------------
How can a gang of program instances generate variable amounts of output efficiently?
------------------------------------------------------------------------------------
A useful application of the ``exclusive_scan_add()`` function in the
standard library is when program instances want to generate a variable
amount of output and when one would like that output to be densely packed
in a single array. For example, consider the code fragment below:
It's not unusual to have a gang of program instances where each program
instance generates a variable amount of output (perhaps some generate no
output, some generate one output value, some generate many output values
and so forth), and where one would like to have the output densely packed
in an output array. The ``exclusive_scan_add()`` function from the
standard library is quite useful in this situation.
Consider the following function:
::
@@ -331,11 +393,11 @@ value, the second two values, and the third and fourth three values each.
In this case, ``exclusive_scan_add()`` will return the values (0, 1, 3, 6)
to the four program instances, respectively.
The first program instance will write its one result to ``outArray[0]``,
the second will write its two values to ``outArray[1]`` and
``outArray[2]``, and so forth. The ``reduce_add`` call at the end returns
the total number of values that all of the program instances have written
to the array.
The first program instance will then write its one result to
``outArray[0]``, the second will write its two values to ``outArray[1]``
and ``outArray[2]``, and so forth. The ``reduce_add()`` call at the end
returns the total number of values that all of the program instances have
written to the array.
FIXME: add discussion of foreach_active as an option here once that's in