PTX documentation. first commit

This commit is contained in:
evghenii
2014-07-08 09:21:20 +02:00
parent 1fc75ed494
commit 3459c75fbc

View File

@@ -4945,77 +4945,76 @@ program instances improves performance.
Experimental support for PTX Experimental support for PTX
============================ ============================
``ispc`` has a limited support for PTX code generation which currently targets ``ispc`` provides experimental support for PTX code generation which currently
NVIDIA GPUs with compute capability 3.5 [Kepler GPUs with support for dynamic targets NVIDIA GPUs with compute capability >3.5 [Kepler GPUs with support for
parallelism]. Due to its experimental support in ``ispc``, the PTX backend dynamic parallelism]. Due to its nature, the PTX backend currently impose
currently impose several restrictions on the source code which will detailed several restrictions on the ``ispc`` program, which will be described below.
below.
Overview Overview
-------- --------
SPMD programming in ``ispc`` with PTX target in mind should be thought of a SPMD programming in ``ispc`` is similar to a warp-synchronous CUDA programming.
warp-synchronous CUDA programming. In particular, every program instances is Namely, program instances in a gang are equivalent of CUDA threads in a single
mapped to a CUDA thread, and a gang is mapped to a CUDA warp. To run efficiently warp. Hence, to run efficiently on a GPU `ispc`` program must use tasking
on GPU, `ispc`` program must use tasking functionality via ``launch`` keyword. functionality via ``launch`` keyword to ensure multiple number of warps are
executed concurrently on the GPU.
``export`` functions are also equipped with a CUDA C wrapper that schedule a ``export`` functions are equipped with a CUDA C wrapper which schedules a
single thread-block of 32 threads--a warp--. In contract to CPU programming, it single warp--a thread-block with a total of 32 threads. In contract to CPU
is expected that this exported function, either directly or otherwise, will programming, this exported function, either directly or otherwise, should
utilize ``launch`` keyword to schedule a work across GPU. In contrast to CPU, utilize ``launch`` keyword to schedule work on a GPU.
there is no other way to efficiently utilize rich GPU compute resources.
At PTX level, ``launch`` keyword is mapped to a CUDA Dynamic Parallelism that At the PTX level, ``launch`` keyword is mapped to CUDA Dynamic Parallelism and
schedules a grid of thread-blocks each 128 threads--or 4 warps--wide it schedules a grid of thread-blocks each 4 warps-wide (128 threads). As a
[dim3(128,1,1)]. Therefore ``ispc`` currently tasking-granularity with PTX result, `ispc`` has a tasking-granularity of 4 tasks with PTX target; this
target is 4 tasks; this restriction will be eliminated in future. restriction will be eliminated in future.
When passing pointers to an ``export`` function compiled for execution on GPU, When passing pointers to an ``export`` function, it is important that they
it is important that these pointers remain legal when access from GPU. Prior to remain legal when are accessed from GPU. Prior to CUDA 6.0, such a pointer were
CUDA 6.0, this pointers has to hold address that is only accessible from the holding an address that is only accessible from the GPU. With the release of
GPU. With the release of CUDA 6.0, it is possible to pass a pointer to unified CUDA 6.0, it is possible to pass a pointer to a unified memory allocated with
memory. For this, ``ispc`` provides helper wrapper functions that call CUDA API ``cudaMallocManaged``. Examples provides rudimentary wrapper functions that
for managed memory allocations, therefore allowing the programming to avoid call CUDA API for managed memory allocations, allowing the programmers to avoid
explicit memory copies. explicit memory copies.
Compiling For The NVIDIA Kepler GPU Compiling For The NVIDIA Kepler GPU
----------------------------------- -----------------------------------
Compilation for NVIDIA Kepler GPU is currently a several step procedure. Compilation for NVIDIA Kepler GPU is a several step procedure.
First we need to generate a LLVM bitcode from ``ispc`` source file: First, we need to generate a LLVM bitcode from ``ispc`` source file:
:: ::
$ISPC_HOME/ispc foo.ispc --emit-llvm --target=nvptx -o foo.bc $ISPC_HOME/ispc foo.ispc --emit-llvm --target=nvptx -o foo.bc
If ``ispc`` is compiled with LLVM 3.2, the resulting bitcode can immediately be If ``ispc`` is compiled with LLVM 3.2, the resulting bitcode can immediately be
compile to PTX with the help of ``ptxgen`` tool which uses ``libNVVM`` [this compiled into PTX with the help of ``ptxgen`` tool; this tool uses ``libNVVM``
requires CUDA Toolkit installation]: which is a part of a CUDA Toolkit.
:: ::
$ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.bc -o foo.ptx $ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.bc -o foo.ptx
Otherwise, we need to decompile the bitcode with the ``llvm-dis`` that comes If ``ispc`` is compiled with LLVM >3.2, the resulting bitcode must first be
with LLVM 3.2 distribution; this "trick" is required to generate an IR decompiled with the ``llvm-dis`` from LLVM 3.2 distribution; this "trick" is
compatible with libNVVM: required to generate an IR compatible with libNVVM:
:: ::
$LLVM32/bin/llvm-dis foo.bc -o foo.ll $LLVM32/bin/llvm-dis foo.bc -o foo.ll
$ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.ll -o foo.ptx $ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.ll -o foo.ptx
At this point the resulting PTX code could be used to run on GPU with the help The resulting PTX code is ready for execution on a GPU, for example via CUDA
of, for example, CUDA Driver API. Instead, we provide a ``ptxcc`` tool, which Driver API. Alternatively, we also provide a simple ``ptxcc`` tool, which
compiles the PTX code into an object file: compiles the resulting PTX code into an object file:
:: ::
$ISPC_HOME/ptxtools/ptxcc foo.ptx -o foo_cu.o -Xnvcc="--maxrregcount=64 $ISPC_HOME/ptxtools/ptxcc foo.ptx -o foo_cu.o -Xnvcc="--maxrregcount=64
-Xptxas=-v" -Xptxas=-v"
Finally, this object file can be linked with the main program via ``nvcc``: This object file can be linked with the main program via ``nvcc``:
:: ::
@@ -5024,10 +5023,45 @@ Finally, this object file can be linked with the main program via ``nvcc``:
Hints Hints
----- -----
Few things to observe - ``uniform`` arrays in a function scope are statically allocated in
``__shared__`` memory, with all ensuing consequences. For example, if more
all allocated than shared memory available per SMX, a linking or runtime error will occur
- If ``uniform`` arrays of large size are desired, we recommend to use
``uniform new uniform T[size]`` for their allocation, ideally outside the
tasking function (see ``deferred/kernels.ispc`` in the deferred shading example)
Examples that produces executables for CPU, XeonPhi and Kepler GPU display
several tuning approaches that can benefit GPU performance.
``ispc`` may also generate performance warning, that if followed, may improve
GPU application performance.
Limitations & known issues Limitations & known issues
-------------------------- --------------------------
Due to its experimental form, PTX code generation is known to impose several
limitation on the ``ispc`` program which are documented in the following list:
- Must use ``ispc`` tasking functionality to run efficiently on GPU
- Must use ``new/delete`` and/or ``ispc_malloc``/``ispc_free``/``ispc_memset``/``ispc_memcpy`` to allocate/free/set/copy memory that is visible to GPU
- ``export`` functions must have ``void`` return type.
- ``task``/``export`` functions do not accept varying data-types
- ``new``/``delete`` currently only works with ``uniform`` data-types
- ``aossoa``/``soaaos`` is not yet supported
- ``sizeof(varying)`` is not yet unsupported
- Function pointers do not work yet (may or may not generate compilation fail)
- ``memset``/``memcpy``/``memmove`` is not yet supported
- ``uniform`` arrays in global scope are mapped to global memory
- ``varying`` arrays in global scope are not yet supported
- ``uniform`` arrays in local scope are mapped to shared memory
- ``varying`` arrays in local scope are mapped to local memory
- ``const uniform/varying`` arrays are mapped to local memory
- ``const static uniform`` arrays are mapped to constant memory
- ``const static varying`` arrays are mapped to global memory
- ``static`` data types in local scope are not allowed; compilation will fail
- Best performance is obtained with libNVVM (LLVM PTX backend can also be used but it requires libdevice.compute_35.10.bc that comes with libNVVM)
Likely there are more... which, together with some of the above-mentioned
issues, will be fixed in due time.