PTX documentation. first commit
This commit is contained in:
110
docs/ispc.rst
110
docs/ispc.rst
@@ -4945,77 +4945,76 @@ program instances improves performance.
|
|||||||
|
|
||||||
Experimental support for PTX
|
Experimental support for PTX
|
||||||
============================
|
============================
|
||||||
``ispc`` has a limited support for PTX code generation which currently targets
|
``ispc`` provides experimental support for PTX code generation which currently
|
||||||
NVIDIA GPUs with compute capability 3.5 [Kepler GPUs with support for dynamic
|
targets NVIDIA GPUs with compute capability >3.5 [Kepler GPUs with support for
|
||||||
parallelism]. Due to its experimental support in ``ispc``, the PTX backend
|
dynamic parallelism]. Due to its nature, the PTX backend currently impose
|
||||||
currently impose several restrictions on the source code which will detailed
|
several restrictions on the ``ispc`` program, which will be described below.
|
||||||
below.
|
|
||||||
|
|
||||||
Overview
|
Overview
|
||||||
--------
|
--------
|
||||||
SPMD programming in ``ispc`` with PTX target in mind should be thought of a
|
SPMD programming in ``ispc`` is similar to a warp-synchronous CUDA programming.
|
||||||
warp-synchronous CUDA programming. In particular, every program instances is
|
Namely, program instances in a gang are equivalent of CUDA threads in a single
|
||||||
mapped to a CUDA thread, and a gang is mapped to a CUDA warp. To run efficiently
|
warp. Hence, to run efficiently on a GPU `ispc`` program must use tasking
|
||||||
on GPU, `ispc`` program must use tasking functionality via ``launch`` keyword.
|
functionality via ``launch`` keyword to ensure multiple number of warps are
|
||||||
|
executed concurrently on the GPU.
|
||||||
|
|
||||||
``export`` functions are also equipped with a CUDA C wrapper that schedule a
|
``export`` functions are equipped with a CUDA C wrapper which schedules a
|
||||||
single thread-block of 32 threads--a warp--. In contract to CPU programming, it
|
single warp--a thread-block with a total of 32 threads. In contract to CPU
|
||||||
is expected that this exported function, either directly or otherwise, will
|
programming, this exported function, either directly or otherwise, should
|
||||||
utilize ``launch`` keyword to schedule a work across GPU. In contrast to CPU,
|
utilize ``launch`` keyword to schedule work on a GPU.
|
||||||
there is no other way to efficiently utilize rich GPU compute resources.
|
|
||||||
|
|
||||||
At PTX level, ``launch`` keyword is mapped to a CUDA Dynamic Parallelism that
|
At the PTX level, ``launch`` keyword is mapped to CUDA Dynamic Parallelism and
|
||||||
schedules a grid of thread-blocks each 128 threads--or 4 warps--wide
|
it schedules a grid of thread-blocks each 4 warps-wide (128 threads). As a
|
||||||
[dim3(128,1,1)]. Therefore ``ispc`` currently tasking-granularity with PTX
|
result, `ispc`` has a tasking-granularity of 4 tasks with PTX target; this
|
||||||
target is 4 tasks; this restriction will be eliminated in future.
|
restriction will be eliminated in future.
|
||||||
|
|
||||||
When passing pointers to an ``export`` function compiled for execution on GPU,
|
When passing pointers to an ``export`` function, it is important that they
|
||||||
it is important that these pointers remain legal when access from GPU. Prior to
|
remain legal when are accessed from GPU. Prior to CUDA 6.0, such a pointer were
|
||||||
CUDA 6.0, this pointers has to hold address that is only accessible from the
|
holding an address that is only accessible from the GPU. With the release of
|
||||||
GPU. With the release of CUDA 6.0, it is possible to pass a pointer to unified
|
CUDA 6.0, it is possible to pass a pointer to a unified memory allocated with
|
||||||
memory. For this, ``ispc`` provides helper wrapper functions that call CUDA API
|
``cudaMallocManaged``. Examples provides rudimentary wrapper functions that
|
||||||
for managed memory allocations, therefore allowing the programming to avoid
|
call CUDA API for managed memory allocations, allowing the programmers to avoid
|
||||||
explicit memory copies.
|
explicit memory copies.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Compiling For The NVIDIA Kepler GPU
|
Compiling For The NVIDIA Kepler GPU
|
||||||
-----------------------------------
|
-----------------------------------
|
||||||
Compilation for NVIDIA Kepler GPU is currently a several step procedure.
|
Compilation for NVIDIA Kepler GPU is a several step procedure.
|
||||||
|
|
||||||
First we need to generate a LLVM bitcode from ``ispc`` source file:
|
First, we need to generate a LLVM bitcode from ``ispc`` source file:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
$ISPC_HOME/ispc foo.ispc --emit-llvm --target=nvptx -o foo.bc
|
$ISPC_HOME/ispc foo.ispc --emit-llvm --target=nvptx -o foo.bc
|
||||||
|
|
||||||
If ``ispc`` is compiled with LLVM 3.2, the resulting bitcode can immediately be
|
If ``ispc`` is compiled with LLVM 3.2, the resulting bitcode can immediately be
|
||||||
compile to PTX with the help of ``ptxgen`` tool which uses ``libNVVM`` [this
|
compiled into PTX with the help of ``ptxgen`` tool; this tool uses ``libNVVM``
|
||||||
requires CUDA Toolkit installation]:
|
which is a part of a CUDA Toolkit.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
$ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.bc -o foo.ptx
|
$ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.bc -o foo.ptx
|
||||||
|
|
||||||
Otherwise, we need to decompile the bitcode with the ``llvm-dis`` that comes
|
If ``ispc`` is compiled with LLVM >3.2, the resulting bitcode must first be
|
||||||
with LLVM 3.2 distribution; this "trick" is required to generate an IR
|
decompiled with the ``llvm-dis`` from LLVM 3.2 distribution; this "trick" is
|
||||||
compatible with libNVVM:
|
required to generate an IR compatible with libNVVM:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
$LLVM32/bin/llvm-dis foo.bc -o foo.ll
|
$LLVM32/bin/llvm-dis foo.bc -o foo.ll
|
||||||
$ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.ll -o foo.ptx
|
$ISPC_HOME/ptxtools/ptxgen --use_fast_math foo.ll -o foo.ptx
|
||||||
|
|
||||||
At this point the resulting PTX code could be used to run on GPU with the help
|
The resulting PTX code is ready for execution on a GPU, for example via CUDA
|
||||||
of, for example, CUDA Driver API. Instead, we provide a ``ptxcc`` tool, which
|
Driver API. Alternatively, we also provide a simple ``ptxcc`` tool, which
|
||||||
compiles the PTX code into an object file:
|
compiles the resulting PTX code into an object file:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
$ISPC_HOME/ptxtools/ptxcc foo.ptx -o foo_cu.o -Xnvcc="--maxrregcount=64
|
$ISPC_HOME/ptxtools/ptxcc foo.ptx -o foo_cu.o -Xnvcc="--maxrregcount=64
|
||||||
-Xptxas=-v"
|
-Xptxas=-v"
|
||||||
|
|
||||||
Finally, this object file can be linked with the main program via ``nvcc``:
|
This object file can be linked with the main program via ``nvcc``:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@@ -5024,10 +5023,45 @@ Finally, this object file can be linked with the main program via ``nvcc``:
|
|||||||
|
|
||||||
Hints
|
Hints
|
||||||
-----
|
-----
|
||||||
Few things to observe
|
- ``uniform`` arrays in a function scope are statically allocated in
|
||||||
|
``__shared__`` memory, with all ensuing consequences. For example, if more
|
||||||
|
all allocated than shared memory available per SMX, a linking or runtime error will occur
|
||||||
|
- If ``uniform`` arrays of large size are desired, we recommend to use
|
||||||
|
``uniform new uniform T[size]`` for their allocation, ideally outside the
|
||||||
|
tasking function (see ``deferred/kernels.ispc`` in the deferred shading example)
|
||||||
|
|
||||||
|
Examples that produces executables for CPU, XeonPhi and Kepler GPU display
|
||||||
|
several tuning approaches that can benefit GPU performance.
|
||||||
|
``ispc`` may also generate performance warning, that if followed, may improve
|
||||||
|
GPU application performance.
|
||||||
|
|
||||||
Limitations & known issues
|
Limitations & known issues
|
||||||
--------------------------
|
--------------------------
|
||||||
|
Due to its experimental form, PTX code generation is known to impose several
|
||||||
|
limitation on the ``ispc`` program which are documented in the following list:
|
||||||
|
|
||||||
|
- Must use ``ispc`` tasking functionality to run efficiently on GPU
|
||||||
|
- Must use ``new/delete`` and/or ``ispc_malloc``/``ispc_free``/``ispc_memset``/``ispc_memcpy`` to allocate/free/set/copy memory that is visible to GPU
|
||||||
|
- ``export`` functions must have ``void`` return type.
|
||||||
|
- ``task``/``export`` functions do not accept varying data-types
|
||||||
|
- ``new``/``delete`` currently only works with ``uniform`` data-types
|
||||||
|
- ``aossoa``/``soaaos`` is not yet supported
|
||||||
|
- ``sizeof(varying)`` is not yet unsupported
|
||||||
|
- Function pointers do not work yet (may or may not generate compilation fail)
|
||||||
|
- ``memset``/``memcpy``/``memmove`` is not yet supported
|
||||||
|
- ``uniform`` arrays in global scope are mapped to global memory
|
||||||
|
- ``varying`` arrays in global scope are not yet supported
|
||||||
|
- ``uniform`` arrays in local scope are mapped to shared memory
|
||||||
|
- ``varying`` arrays in local scope are mapped to local memory
|
||||||
|
- ``const uniform/varying`` arrays are mapped to local memory
|
||||||
|
- ``const static uniform`` arrays are mapped to constant memory
|
||||||
|
- ``const static varying`` arrays are mapped to global memory
|
||||||
|
- ``static`` data types in local scope are not allowed; compilation will fail
|
||||||
|
- Best performance is obtained with libNVVM (LLVM PTX backend can also be used but it requires libdevice.compute_35.10.bc that comes with libNVVM)
|
||||||
|
|
||||||
|
|
||||||
|
Likely there are more... which, together with some of the above-mentioned
|
||||||
|
issues, will be fixed in due time.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user