PTX documentation. first commit

2014-07-08 09:21:20 +02:00
parent 1fc75ed494
commit 3459c75fbc
1 changed files with 72 additions and 38 deletions
--- a/docs/ispc.rst
+++ b/docs/ispc.rst
@@ -4945,77 +4945,76 @@ program instances improves performance.
 Experimental support for PTX
 ============================
-``ispc`` has a limited support for PTX code generation which currently targets
+``ispc`` provides experimental support for PTX code generation which currently
-NVIDIA GPUs with compute capability 3.5 [Kepler GPUs with support for dynamic
+targets NVIDIA GPUs with compute capability >3.5 [Kepler GPUs with support for
-parallelism]. Due to its experimental support in ``ispc``, the PTX backend
+dynamic parallelism]. Due to its nature, the PTX backend currently impose
-currently impose several restrictions on the source code which will detailed
+several restrictions on the ``ispc`` program, which will be described below.
 below.
 Overview
 --------
-SPMD programming in ``ispc`` with PTX target in mind should be thought of a
+SPMD programming in ``ispc`` is similar to a warp-synchronous CUDA programming.
-warp-synchronous CUDA programming. In particular, every program instances is
+Namely, program instances in a gang are equivalent of CUDA threads in a single
-mapped to a CUDA thread, and a gang is mapped to a CUDA warp. To run efficiently
+warp. Hence, to run efficiently on a GPU `ispc`` program must use tasking
-on GPU, `ispc`` program must use tasking functionality via ``launch`` keyword.
+functionality via ``launch`` keyword to ensure multiple number of warps are
 executed concurrently on the GPU.
-``export`` functions are also equipped with a CUDA C wrapper that schedule a
+``export`` functions are equipped with a CUDA C wrapper which schedules a
-single thread-block of 32 threads--a warp--. In contract to CPU programming, it
+single warp--a thread-block with a total of 32 threads. In contract to CPU
-is expected that this exported function, either directly or otherwise, will
+programming, this exported function, either directly or otherwise, should
-utilize ``launch`` keyword to schedule a work across GPU. In contrast to CPU,
+utilize ``launch`` keyword to schedule work on a GPU.
 there is no other way to efficiently utilize rich GPU compute resources.
-At PTX level, ``launch`` keyword is mapped to a CUDA Dynamic Parallelism that
+At the PTX level, ``launch`` keyword is mapped to CUDA Dynamic Parallelism and
-schedules a grid of thread-blocks each 128 threads--or 4 warps--wide
+it schedules a grid of thread-blocks each 4 warps-wide (128 threads).  As a
-[dim3(128,1,1)]. Therefore ``ispc`` currently tasking-granularity with PTX
+result, `ispc`` has a tasking-granularity of 4 tasks with PTX target; this
-target is 4 tasks; this restriction will be eliminated in future. 
+restriction will be eliminated in future.
-When passing pointers to an ``export`` function compiled for execution on GPU,
+When passing pointers to an ``export`` function, it is important that they
-it is important that these pointers remain legal when access from GPU. Prior to
+remain legal when are accessed from GPU. Prior to CUDA 6.0, such a pointer were
-CUDA 6.0, this pointers has to hold address that is only accessible from the
+holding an address that is only accessible from the GPU.  With the release of
-GPU.  With the release of CUDA 6.0, it is possible to pass a pointer to unified
+CUDA 6.0, it is possible to pass a pointer to a unified memory allocated with
-memory. For this, ``ispc`` provides helper wrapper functions that call CUDA API
+``cudaMallocManaged``. Examples provides rudimentary wrapper functions that
-for managed memory allocations, therefore allowing the programming to avoid
+call CUDA API for managed memory allocations, allowing the programmers to avoid
 explicit memory copies.
 Compiling For The NVIDIA Kepler GPU
 -----------------------------------
-Compilation for NVIDIA Kepler GPU is currently a several step procedure.
+Compilation for NVIDIA Kepler GPU is a several step procedure.
-First we need to generate a LLVM bitcode from ``ispc`` source file:
+First, we need to generate a LLVM bitcode from ``ispc`` source file:
 ::
  $ISPC_HOME/ispc foo.ispc --emit-llvm --target=nvptx -o foo.bc
-If ``ispc`` is compiled with LLVM 3.2, the resulting bitcode  can immediately be
+If ``ispc`` is compiled with LLVM 3.2, the resulting bitcode can immediately be
-compile to PTX with the help of ``ptxgen`` tool which uses ``libNVVM`` [this
+compiled into PTX with the help of ``ptxgen`` tool; this tool uses ``libNVVM``
-requires CUDA Toolkit installation]:
+which is a part of a CUDA Toolkit.
 ::
  $ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.bc -o foo.ptx
-Otherwise, we need to decompile the bitcode with the ``llvm-dis`` that comes
+If ``ispc`` is compiled with  LLVM >3.2, the resulting bitcode must first be
-with LLVM 3.2 distribution; this "trick" is required to generate an IR
+decompiled with the ``llvm-dis`` from LLVM 3.2 distribution; this "trick" is
-compatible with libNVVM:
+required to generate an IR compatible with libNVVM:
 ::
  $LLVM32/bin/llvm-dis foo.bc -o foo.ll
  $ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.ll -o foo.ptx
-At this point the resulting PTX code could be used to run on GPU with the help
+The resulting PTX code is ready for execution on  a GPU, for example via CUDA
-of, for example, CUDA Driver API. Instead, we provide a ``ptxcc`` tool, which
+Driver API. Alternatively, we also provide a simple ``ptxcc`` tool, which
-compiles the PTX code into an object file:
+compiles the resulting PTX code into an object file:
 ::
   $ISPC_HOME/ptxtools/ptxcc foo.ptx -o foo_cu.o -Xnvcc="--maxrregcount=64
   -Xptxas=-v"
-Finally, this object file can be linked with the main program via ``nvcc``:
+This object file can be linked with the main program via ``nvcc``:
 ::
@@ -5024,10 +5023,45 @@ Finally, this object file can be linked with the main program via ``nvcc``:
 Hints
 -----
-Few things to observe
+- ``uniform`` arrays in a function scope are statically allocated in
-  
+  ``__shared__`` memory, with all ensuing consequences. For example, if more
  all allocated than shared memory available per SMX, a linking or runtime error will occur
 - If ``uniform`` arrays of large size are desired, we recommend to use
  ``uniform new uniform T[size]`` for their allocation, ideally outside the
  tasking function (see ``deferred/kernels.ispc`` in the deferred shading example)
 Examples that produces executables for CPU, XeonPhi and Kepler GPU display
 several tuning approaches that can benefit GPU performance. 
 ``ispc`` may also generate performance warning, that if followed, may improve
 GPU application performance.
 Limitations & known issues
 --------------------------
 Due to its experimental form, PTX code generation is known to impose several
 limitation on the ``ispc`` program which are documented in the following list:
 - Must use ``ispc`` tasking functionality to run efficiently on GPU
 - Must use ``new/delete`` and/or ``ispc_malloc``/``ispc_free``/``ispc_memset``/``ispc_memcpy`` to allocate/free/set/copy memory that is visible to GPU
 - ``export`` functions must have ``void`` return type.
 - ``task``/``export`` functions do not accept varying data-types
 - ``new``/``delete`` currently only works with ``uniform`` data-types
 - ``aossoa``/``soaaos`` is not yet supported
 - ``sizeof(varying)`` is not yet unsupported
 - Function pointers do not work yet (may or may not generate compilation fail)
 - ``memset``/``memcpy``/``memmove`` is not yet supported
 - ``uniform`` arrays in global scope are mapped to global memory
 - ``varying`` arrays in global scope are not yet supported
 - ``uniform`` arrays in local  scope are mapped to shared memory
 - ``varying`` arrays in local  scope are mapped to local  memory
 - ``const uniform/varying`` arrays are mapped to local memory
 - ``const static uniform`` arrays are mapped to constant memory
 - ``const static varying``  arrays are mapped to global   memory
 - ``static`` data types in local scope are not allowed; compilation will fail
 - Best performance is obtained with libNVVM (LLVM PTX backend can also be used but it requires libdevice.compute_35.10.bc that comes with libNVVM)
 Likely there are more... which, together with some of the above-mentioned
 issues, will be fixed in due time.