PTX documentation. first commit

This commit is contained in:
evghenii
2014-07-08 09:21:20 +02:00
parent 1fc75ed494
commit 3459c75fbc

View File

@@ -4945,77 +4945,76 @@ program instances improves performance.
Experimental support for PTX
============================
``ispc`` has a limited support for PTX code generation which currently targets
NVIDIA GPUs with compute capability 3.5 [Kepler GPUs with support for dynamic
parallelism]. Due to its experimental support in ``ispc``, the PTX backend
currently impose several restrictions on the source code which will detailed
below.
``ispc`` provides experimental support for PTX code generation which currently
targets NVIDIA GPUs with compute capability >3.5 [Kepler GPUs with support for
dynamic parallelism]. Due to its nature, the PTX backend currently impose
several restrictions on the ``ispc`` program, which will be described below.
Overview
--------
SPMD programming in ``ispc`` with PTX target in mind should be thought of a
warp-synchronous CUDA programming. In particular, every program instances is
mapped to a CUDA thread, and a gang is mapped to a CUDA warp. To run efficiently
on GPU, `ispc`` program must use tasking functionality via ``launch`` keyword.
SPMD programming in ``ispc`` is similar to a warp-synchronous CUDA programming.
Namely, program instances in a gang are equivalent of CUDA threads in a single
warp. Hence, to run efficiently on a GPU `ispc`` program must use tasking
functionality via ``launch`` keyword to ensure multiple number of warps are
executed concurrently on the GPU.
``export`` functions are also equipped with a CUDA C wrapper that schedule a
single thread-block of 32 threads--a warp--. In contract to CPU programming, it
is expected that this exported function, either directly or otherwise, will
utilize ``launch`` keyword to schedule a work across GPU. In contrast to CPU,
there is no other way to efficiently utilize rich GPU compute resources.
``export`` functions are equipped with a CUDA C wrapper which schedules a
single warp--a thread-block with a total of 32 threads. In contract to CPU
programming, this exported function, either directly or otherwise, should
utilize ``launch`` keyword to schedule work on a GPU.
At PTX level, ``launch`` keyword is mapped to a CUDA Dynamic Parallelism that
schedules a grid of thread-blocks each 128 threads--or 4 warps--wide
[dim3(128,1,1)]. Therefore ``ispc`` currently tasking-granularity with PTX
target is 4 tasks; this restriction will be eliminated in future.
At the PTX level, ``launch`` keyword is mapped to CUDA Dynamic Parallelism and
it schedules a grid of thread-blocks each 4 warps-wide (128 threads). As a
result, `ispc`` has a tasking-granularity of 4 tasks with PTX target; this
restriction will be eliminated in future.
When passing pointers to an ``export`` function compiled for execution on GPU,
it is important that these pointers remain legal when access from GPU. Prior to
CUDA 6.0, this pointers has to hold address that is only accessible from the
GPU. With the release of CUDA 6.0, it is possible to pass a pointer to unified
memory. For this, ``ispc`` provides helper wrapper functions that call CUDA API
for managed memory allocations, therefore allowing the programming to avoid
When passing pointers to an ``export`` function, it is important that they
remain legal when are accessed from GPU. Prior to CUDA 6.0, such a pointer were
holding an address that is only accessible from the GPU. With the release of
CUDA 6.0, it is possible to pass a pointer to a unified memory allocated with
``cudaMallocManaged``. Examples provides rudimentary wrapper functions that
call CUDA API for managed memory allocations, allowing the programmers to avoid
explicit memory copies.
Compiling For The NVIDIA Kepler GPU
-----------------------------------
Compilation for NVIDIA Kepler GPU is currently a several step procedure.
Compilation for NVIDIA Kepler GPU is a several step procedure.
First we need to generate a LLVM bitcode from ``ispc`` source file:
First, we need to generate a LLVM bitcode from ``ispc`` source file:
::
$ISPC_HOME/ispc foo.ispc --emit-llvm --target=nvptx -o foo.bc
If ``ispc`` is compiled with LLVM 3.2, the resulting bitcode can immediately be
compile to PTX with the help of ``ptxgen`` tool which uses ``libNVVM`` [this
requires CUDA Toolkit installation]:
If ``ispc`` is compiled with LLVM 3.2, the resulting bitcode can immediately be
compiled into PTX with the help of ``ptxgen`` tool; this tool uses ``libNVVM``
which is a part of a CUDA Toolkit.
::
$ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.bc -o foo.ptx
Otherwise, we need to decompile the bitcode with the ``llvm-dis`` that comes
with LLVM 3.2 distribution; this "trick" is required to generate an IR
compatible with libNVVM:
If ``ispc`` is compiled with LLVM >3.2, the resulting bitcode must first be
decompiled with the ``llvm-dis`` from LLVM 3.2 distribution; this "trick" is
required to generate an IR compatible with libNVVM:
::
$LLVM32/bin/llvm-dis foo.bc -o foo.ll
$ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.ll -o foo.ptx
At this point the resulting PTX code could be used to run on GPU with the help
of, for example, CUDA Driver API. Instead, we provide a ``ptxcc`` tool, which
compiles the PTX code into an object file:
The resulting PTX code is ready for execution on a GPU, for example via CUDA
Driver API. Alternatively, we also provide a simple ``ptxcc`` tool, which
compiles the resulting PTX code into an object file:
::
$ISPC_HOME/ptxtools/ptxcc foo.ptx -o foo_cu.o -Xnvcc="--maxrregcount=64
-Xptxas=-v"
Finally, this object file can be linked with the main program via ``nvcc``:
This object file can be linked with the main program via ``nvcc``:
::
@@ -5024,10 +5023,45 @@ Finally, this object file can be linked with the main program via ``nvcc``:
Hints
-----
Few things to observe
- ``uniform`` arrays in a function scope are statically allocated in
``__shared__`` memory, with all ensuing consequences. For example, if more
all allocated than shared memory available per SMX, a linking or runtime error will occur
- If ``uniform`` arrays of large size are desired, we recommend to use
``uniform new uniform T[size]`` for their allocation, ideally outside the
tasking function (see ``deferred/kernels.ispc`` in the deferred shading example)
Examples that produces executables for CPU, XeonPhi and Kepler GPU display
several tuning approaches that can benefit GPU performance.
``ispc`` may also generate performance warning, that if followed, may improve
GPU application performance.
Limitations & known issues
--------------------------
Due to its experimental form, PTX code generation is known to impose several
limitation on the ``ispc`` program which are documented in the following list:
- Must use ``ispc`` tasking functionality to run efficiently on GPU
- Must use ``new/delete`` and/or ``ispc_malloc``/``ispc_free``/``ispc_memset``/``ispc_memcpy`` to allocate/free/set/copy memory that is visible to GPU
- ``export`` functions must have ``void`` return type.
- ``task``/``export`` functions do not accept varying data-types
- ``new``/``delete`` currently only works with ``uniform`` data-types
- ``aossoa``/``soaaos`` is not yet supported
- ``sizeof(varying)`` is not yet unsupported
- Function pointers do not work yet (may or may not generate compilation fail)
- ``memset``/``memcpy``/``memmove`` is not yet supported
- ``uniform`` arrays in global scope are mapped to global memory
- ``varying`` arrays in global scope are not yet supported
- ``uniform`` arrays in local scope are mapped to shared memory
- ``varying`` arrays in local scope are mapped to local memory
- ``const uniform/varying`` arrays are mapped to local memory
- ``const static uniform`` arrays are mapped to constant memory
- ``const static varying`` arrays are mapped to global memory
- ``static`` data types in local scope are not allowed; compilation will fail
- Best performance is obtained with libNVVM (LLVM PTX backend can also be used but it requires libdevice.compute_35.10.bc that comes with libNVVM)
Likely there are more... which, together with some of the above-mentioned
issues, will be fixed in due time.