From 1fc75ed494cf1c67b4e4adf3d0be384309987c79 Mon Sep 17 00:00:00 2001
From: evghenii <egaburov@dds.nl>
Date: Tue, 8 Jul 2014 08:41:29 +0200
Subject: [PATCH] started to work on documentation

---
 docs/ispc.rst | 91 +++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 77 insertions(+), 14 deletions(-)

diff --git a/docs/ispc.rst b/docs/ispc.rst
index f315354e..209ea64d 100644
--- a/docs/ispc.rst
+++ b/docs/ispc.rst
@@ -181,9 +181,9 @@ Contents:
 * `Experimental support for PTX`_
 
   + `Overview`_
-  + `Generation of PTX`_
-  + `Execution of PTX`_
+  + `Compiling For The NVIDIA Kepler GPU`_
   + `Hints`_
+  + `Limitations & known issues`_
 
 * `Disclaimer and Legal Information`_
 
@@ -4945,27 +4945,90 @@ program instances improves performance.
 
 Experimental support for PTX
 ============================
-One of the ``ispc`` goals is also to offer performance portability of ISPC
-program across various parallel processors, in particular CPUs and GPUs. This
-section describes how to use ISPC in combination with CUDA Toolkit to generate
-and execute PTX.
+``ispc`` has a limited support for PTX code generation which currently targets
+NVIDIA GPUs with compute capability 3.5 [Kepler GPUs with support for dynamic
+parallelism]. Due to its experimental support in ``ispc``, the PTX backend
+currently impose several restrictions on the source code which will detailed
+below.
 
 Overview
 --------
-SPMD programming model can be mapped to CUDA cores.
+SPMD programming in ``ispc`` with PTX target in mind should be thought of a
+warp-synchronous CUDA programming. In particular, every program instances is
+mapped to a CUDA thread, and a gang is mapped to a CUDA warp. To run efficiently
+on GPU, `ispc`` program must use tasking functionality via ``launch`` keyword.
+
+``export`` functions are also equipped with a CUDA C wrapper that schedule a
+single thread-block of 32 threads--a warp--. In contract to CPU programming, it
+is expected that this exported function, either directly or otherwise, will
+utilize ``launch`` keyword to schedule a work across GPU. In contrast to CPU,
+there is no other way to efficiently utilize rich GPU compute resources.
+
+At PTX level, ``launch`` keyword is mapped to a CUDA Dynamic Parallelism that
+schedules a grid of thread-blocks each 128 threads--or 4 warps--wide
+[dim3(128,1,1)]. Therefore ``ispc`` currently tasking-granularity with PTX
+target is 4 tasks; this restriction will be eliminated in future. 
+
+When passing pointers to an ``export`` function compiled for execution on GPU,
+it is important that these pointers remain legal when access from GPU. Prior to
+CUDA 6.0, this pointers has to hold address that is only accessible from the
+GPU.  With the release of CUDA 6.0, it is possible to pass a pointer to unified
+memory. For this, ``ispc`` provides helper wrapper functions that call CUDA API
+for managed memory allocations, therefore allowing the programming to avoid
+explicit memory copies.
 
 
-Generation of PTX
-------------------
-To generate PTX.
-  
-Execution of PTX
-----------------
-To execute PTX
+
+Compiling For The NVIDIA Kepler GPU
+-----------------------------------
+Compilation for NVIDIA Kepler GPU is currently a several step procedure.
+
+First we need to generate a LLVM bitcode from ``ispc`` source file:
+
+::
+
+  $ISPC_HOME/ispc foo.ispc --emit-llvm --target=nvptx -o foo.bc
+
+If ``ispc`` is compiled with LLVM 3.2, the resulting bitcode  can immediately be
+compile to PTX with the help of ``ptxgen`` tool which uses ``libNVVM`` [this
+requires CUDA Toolkit installation]:
+
+::
+
+  $ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.bc -o foo.ptx
+
+Otherwise, we need to decompile the bitcode with the ``llvm-dis`` that comes
+with LLVM 3.2 distribution; this "trick" is required to generate an IR
+compatible with libNVVM:
+
+::
+
+  $LLVM32/bin/llvm-dis foo.bc -o foo.ll
+  $ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.ll -o foo.ptx
+
+At this point the resulting PTX code could be used to run on GPU with the help
+of, for example, CUDA Driver API. Instead, we provide a ``ptxcc`` tool, which
+compiles the PTX code into an object file:
+
+::
+
+   $ISPC_HOME/ptxtools/ptxcc foo.ptx -o foo_cu.o -Xnvcc="--maxrregcount=64
+   -Xptxas=-v"
+
+Finally, this object file can be linked with the main program via ``nvcc``:
+
+::
+
+    nvcc foo_cu.o foo_main.o -o foo
+
 
 Hints
 -----
 Few things to observe
+  
+Limitations & known issues
+--------------------------
+
 
 
 Disclaimer and Legal Information