Add FAQ about foreach code generation.
(i.e. "why's there that extra stuff at the end and what can I do about it if it's not necessary?) Issue #231.
This commit is contained in:
77
docs/faq.rst
77
docs/faq.rst
@@ -34,6 +34,7 @@ distribution.
|
|||||||
+ `How can a gang of program instances generate variable amounts of output efficiently?`_
|
+ `How can a gang of program instances generate variable amounts of output efficiently?`_
|
||||||
+ `Is it possible to use ispc for explicit vector programming?`_
|
+ `Is it possible to use ispc for explicit vector programming?`_
|
||||||
+ `How can I debug my ispc programs using Valgrind?`_
|
+ `How can I debug my ispc programs using Valgrind?`_
|
||||||
|
+ `foreach statements generate more complex assembly than I'd expect; what's going on?`_
|
||||||
|
|
||||||
Understanding ispc's Output
|
Understanding ispc's Output
|
||||||
===========================
|
===========================
|
||||||
@@ -693,3 +694,79 @@ you can use ``--target=sse4`` when compiling to run with ``valgrind``.
|
|||||||
Note that ``valgrind`` does not yet support programs that use the AVX
|
Note that ``valgrind`` does not yet support programs that use the AVX
|
||||||
instruction set.
|
instruction set.
|
||||||
|
|
||||||
|
foreach statements generate more complex assembly than I'd expect; what's going on?
|
||||||
|
-----------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
Given a simple ``foreach`` loop like the following:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
void foo(uniform float a[], uniform int count) {
|
||||||
|
foreach (i = 0 ... count)
|
||||||
|
a[i] *= 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
the ``ispc`` compiler generates approximately 40 instructions--why isn't
|
||||||
|
the generated code simpler?
|
||||||
|
|
||||||
|
There are two main components to the code: one handles
|
||||||
|
``programCount``-sized chunks of elements of the array, and the other
|
||||||
|
handles any excess elements at the end of the array that don't completely
|
||||||
|
fill a gang. The code for the main loop is essentially what one would
|
||||||
|
expect: a vector of values are laoded from the array, the multiply is done,
|
||||||
|
and the result is stored.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
LBB0_2: ## %foreach_full_body
|
||||||
|
movslq %edx, %rdx
|
||||||
|
vmovups (%rdi,%rdx), %ymm1
|
||||||
|
vmulps %ymm0, %ymm1, %ymm1
|
||||||
|
vmovups %ymm1, (%rdi,%rdx)
|
||||||
|
addl $32, %edx
|
||||||
|
addl $8, %eax
|
||||||
|
cmpl %ecx, %eax
|
||||||
|
jl LBB0_2
|
||||||
|
|
||||||
|
|
||||||
|
Then, there is a sequence of instructions that handles any additional
|
||||||
|
elements at the end of the array. (These instructions don't execute if
|
||||||
|
there aren't any left-over values to process, but they do lengthen the
|
||||||
|
amount of generated code.)
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
## BB#4: ## %partial_inner_only
|
||||||
|
vmovd %eax, %xmm0
|
||||||
|
vinsertf128 $1, %xmm0, %ymm0, %ymm0
|
||||||
|
vpermilps $0, %ymm0, %ymm0 ## ymm0 = ymm0[0,0,0,0,4,4,4,4]
|
||||||
|
vextractf128 $1, %ymm0, %xmm3
|
||||||
|
vmovd %esi, %xmm2
|
||||||
|
vmovaps LCPI0_1(%rip), %ymm1
|
||||||
|
vextractf128 $1, %ymm1, %xmm4
|
||||||
|
vpaddd %xmm4, %xmm3, %xmm3
|
||||||
|
# ....
|
||||||
|
vmulps LCPI0_0(%rip), %ymm1, %ymm1
|
||||||
|
vmaskmovps %ymm1, %ymm0, (%rdi,%rax)
|
||||||
|
|
||||||
|
|
||||||
|
If you know that the number of elements to be processed will always be an
|
||||||
|
exact multiple of the 8, 16, etc., then adding a simple assignment to
|
||||||
|
``count`` like the one below gives the compiler enough information to be
|
||||||
|
able to eliminate the code for the additional array elements.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
void foo(uniform float a[], uniform int count) {
|
||||||
|
// This assignment doesn't change the value of count
|
||||||
|
// if it's a multiple of 16, but it gives the compiler
|
||||||
|
// insight into this fact, allowing for simpler code to
|
||||||
|
// be generated for the foreach loop.
|
||||||
|
count = (count & ~(16-1));
|
||||||
|
foreach (i = 0 ... count)
|
||||||
|
a[i] *= 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
With this new version of ``foo()``, only the code for the first loop above
|
||||||
|
is generated.
|
||||||
|
|||||||
Reference in New Issue
Block a user