From 3459c75fbcd47c16d7c6e104bece73eef973b0a9 Mon Sep 17 00:00:00 2001
From: evghenii <egaburov@dds.nl>
Date: Tue, 8 Jul 2014 09:21:20 +0200
Subject: [PATCH] PTX documentation. first commit

---
 docs/ispc.rst | 110 +++++++++++++++++++++++++++++++++-----------------
 1 file changed, 72 insertions(+), 38 deletions(-)

diff --git a/docs/ispc.rst b/docs/ispc.rst
index 209ea64d..2f31755e 100644
--- a/docs/ispc.rst
+++ b/docs/ispc.rst
@@ -4945,77 +4945,76 @@ program instances improves performance.
 
 Experimental support for PTX
 ============================
-``ispc`` has a limited support for PTX code generation which currently targets
-NVIDIA GPUs with compute capability 3.5 [Kepler GPUs with support for dynamic
-parallelism]. Due to its experimental support in ``ispc``, the PTX backend
-currently impose several restrictions on the source code which will detailed
-below.
+``ispc`` provides experimental support for PTX code generation which currently
+targets NVIDIA GPUs with compute capability >3.5 [Kepler GPUs with support for
+dynamic parallelism]. Due to its nature, the PTX backend currently impose
+several restrictions on the ``ispc`` program, which will be described below.
 
 Overview
 --------
-SPMD programming in ``ispc`` with PTX target in mind should be thought of a
-warp-synchronous CUDA programming. In particular, every program instances is
-mapped to a CUDA thread, and a gang is mapped to a CUDA warp. To run efficiently
-on GPU, `ispc`` program must use tasking functionality via ``launch`` keyword.
+SPMD programming in ``ispc`` is similar to a warp-synchronous CUDA programming.
+Namely, program instances in a gang are equivalent of CUDA threads in a single
+warp. Hence, to run efficiently on a GPU `ispc`` program must use tasking
+functionality via ``launch`` keyword to ensure multiple number of warps are
+executed concurrently on the GPU.
 
-``export`` functions are also equipped with a CUDA C wrapper that schedule a
-single thread-block of 32 threads--a warp--. In contract to CPU programming, it
-is expected that this exported function, either directly or otherwise, will
-utilize ``launch`` keyword to schedule a work across GPU. In contrast to CPU,
-there is no other way to efficiently utilize rich GPU compute resources.
+``export`` functions are equipped with a CUDA C wrapper which schedules a
+single warp--a thread-block with a total of 32 threads. In contract to CPU
+programming, this exported function, either directly or otherwise, should
+utilize ``launch`` keyword to schedule work on a GPU.
 
-At PTX level, ``launch`` keyword is mapped to a CUDA Dynamic Parallelism that
-schedules a grid of thread-blocks each 128 threads--or 4 warps--wide
-[dim3(128,1,1)]. Therefore ``ispc`` currently tasking-granularity with PTX
-target is 4 tasks; this restriction will be eliminated in future. 
+At the PTX level, ``launch`` keyword is mapped to CUDA Dynamic Parallelism and
+it schedules a grid of thread-blocks each 4 warps-wide (128 threads).  As a
+result, `ispc`` has a tasking-granularity of 4 tasks with PTX target; this
+restriction will be eliminated in future.
 
-When passing pointers to an ``export`` function compiled for execution on GPU,
-it is important that these pointers remain legal when access from GPU. Prior to
-CUDA 6.0, this pointers has to hold address that is only accessible from the
-GPU.  With the release of CUDA 6.0, it is possible to pass a pointer to unified
-memory. For this, ``ispc`` provides helper wrapper functions that call CUDA API
-for managed memory allocations, therefore allowing the programming to avoid
+When passing pointers to an ``export`` function, it is important that they
+remain legal when are accessed from GPU. Prior to CUDA 6.0, such a pointer were
+holding an address that is only accessible from the GPU.  With the release of
+CUDA 6.0, it is possible to pass a pointer to a unified memory allocated with
+``cudaMallocManaged``. Examples provides rudimentary wrapper functions that
+call CUDA API for managed memory allocations, allowing the programmers to avoid
 explicit memory copies.
 
 
 
 Compiling For The NVIDIA Kepler GPU
 -----------------------------------
-Compilation for NVIDIA Kepler GPU is currently a several step procedure.
+Compilation for NVIDIA Kepler GPU is a several step procedure.
 
-First we need to generate a LLVM bitcode from ``ispc`` source file:
+First, we need to generate a LLVM bitcode from ``ispc`` source file:
 
 ::
 
   $ISPC_HOME/ispc foo.ispc --emit-llvm --target=nvptx -o foo.bc
 
-If ``ispc`` is compiled with LLVM 3.2, the resulting bitcode  can immediately be
-compile to PTX with the help of ``ptxgen`` tool which uses ``libNVVM`` [this
-requires CUDA Toolkit installation]:
+If ``ispc`` is compiled with LLVM 3.2, the resulting bitcode can immediately be
+compiled into PTX with the help of ``ptxgen`` tool; this tool uses ``libNVVM``
+which is a part of a CUDA Toolkit.
 
 ::
 
   $ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.bc -o foo.ptx
 
-Otherwise, we need to decompile the bitcode with the ``llvm-dis`` that comes
-with LLVM 3.2 distribution; this "trick" is required to generate an IR
-compatible with libNVVM:
+If ``ispc`` is compiled with  LLVM >3.2, the resulting bitcode must first be
+decompiled with the ``llvm-dis`` from LLVM 3.2 distribution; this "trick" is
+required to generate an IR compatible with libNVVM:
 
 ::
 
   $LLVM32/bin/llvm-dis foo.bc -o foo.ll
   $ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.ll -o foo.ptx
 
-At this point the resulting PTX code could be used to run on GPU with the help
-of, for example, CUDA Driver API. Instead, we provide a ``ptxcc`` tool, which
-compiles the PTX code into an object file:
+The resulting PTX code is ready for execution on  a GPU, for example via CUDA
+Driver API. Alternatively, we also provide a simple ``ptxcc`` tool, which
+compiles the resulting PTX code into an object file:
 
 ::
 
    $ISPC_HOME/ptxtools/ptxcc foo.ptx -o foo_cu.o -Xnvcc="--maxrregcount=64
    -Xptxas=-v"
 
-Finally, this object file can be linked with the main program via ``nvcc``:
+This object file can be linked with the main program via ``nvcc``:
 
 ::
 
@@ -5024,10 +5023,45 @@ Finally, this object file can be linked with the main program via ``nvcc``:
 
 Hints
 -----
-Few things to observe
-  
+- ``uniform`` arrays in a function scope are statically allocated in
+  ``__shared__`` memory, with all ensuing consequences. For example, if more
+  all allocated than shared memory available per SMX, a linking or runtime error will occur
+- If ``uniform`` arrays of large size are desired, we recommend to use
+  ``uniform new uniform T[size]`` for their allocation, ideally outside the
+  tasking function (see ``deferred/kernels.ispc`` in the deferred shading example)
+
+Examples that produces executables for CPU, XeonPhi and Kepler GPU display
+several tuning approaches that can benefit GPU performance. 
+``ispc`` may also generate performance warning, that if followed, may improve
+GPU application performance.
+
 Limitations & known issues
 --------------------------
+Due to its experimental form, PTX code generation is known to impose several
+limitation on the ``ispc`` program which are documented in the following list:
+
+- Must use ``ispc`` tasking functionality to run efficiently on GPU
+- Must use ``new/delete`` and/or ``ispc_malloc``/``ispc_free``/``ispc_memset``/``ispc_memcpy`` to allocate/free/set/copy memory that is visible to GPU
+- ``export`` functions must have ``void`` return type.
+- ``task``/``export`` functions do not accept varying data-types
+- ``new``/``delete`` currently only works with ``uniform`` data-types
+- ``aossoa``/``soaaos`` is not yet supported
+- ``sizeof(varying)`` is not yet unsupported
+- Function pointers do not work yet (may or may not generate compilation fail)
+- ``memset``/``memcpy``/``memmove`` is not yet supported
+- ``uniform`` arrays in global scope are mapped to global memory
+- ``varying`` arrays in global scope are not yet supported
+- ``uniform`` arrays in local  scope are mapped to shared memory
+- ``varying`` arrays in local  scope are mapped to local  memory
+- ``const uniform/varying`` arrays are mapped to local memory
+- ``const static uniform`` arrays are mapped to constant memory
+- ``const static varying``  arrays are mapped to global   memory
+- ``static`` data types in local scope are not allowed; compilation will fail
+- Best performance is obtained with libNVVM (LLVM PTX backend can also be used but it requires libdevice.compute_35.10.bc that comes with libNVVM)
+
+
+Likely there are more... which, together with some of the above-mentioned
+issues, will be fixed in due time.