Added updated task launch implementation that now tracks task groups.
Within each function that launches tasks, we now can easily track which tasks that function launched, so that the sync at the end of the function can just sync on the tasks launched by that function (not all tasks launched by all functions.) Implementing this led to a rework of the task system API that ispc generates code to call; the example task systems in examples/tasksys.cpp have been updated to conform to this API. (The updated API is also documented in the ispc user's guide.) As part of this, "launch[n]" syntax was added to launch a number of tasks in a single launch statement, rather than requiring a loop over 'n' to launch n tasks. This commit thus fixes issue #84 (enhancement to launch multiple tasks from a single launch statement) as well as issue #105 (recursive task launches were broken).
This commit is contained in:
220
docs/ispc.txt
220
docs/ispc.txt
@@ -80,7 +80,8 @@ Contents:
|
||||
+ `Program Instance Convergence`_
|
||||
+ `Data Races`_
|
||||
+ `Uniform Variables and Varying Control Flow`_
|
||||
+ `Task Parallelism in ISPC`_
|
||||
+ `Task Parallelism: Language Syntax`_
|
||||
+ `Task Parallelism: Runtime Requirements`_
|
||||
|
||||
* `The ISPC Standard Library`_
|
||||
|
||||
@@ -838,8 +839,8 @@ by default. If a function is declared with a ``static`` qualifier, then it
|
||||
is only visible in the file in which it was declared.
|
||||
|
||||
Any function that can be launched with the ``launch`` construct in ``ispc``
|
||||
must have a ``task`` qualifier; see `Task Parallelism in ISPC`_ for more
|
||||
discussion of launching tasks in ``ispc``.
|
||||
must have a ``task`` qualifier; see `Task Parallelism: Language Syntax`_
|
||||
for more discussion of launching tasks in ``ispc``.
|
||||
|
||||
Functions that are intended to be called from C/C++ application code must
|
||||
have the ``export`` qualifier. This causes them to have regular C linkage
|
||||
@@ -940,8 +941,9 @@ execution model is critical for writing efficient and correct programs in
|
||||
|
||||
``ispc`` supports both task parallelism to parallelize across multiple
|
||||
cores and SPMD parallelism to parallelize across the SIMD vector lanes on a
|
||||
single core. This section focuses on SPMD parallelism. See the section
|
||||
`Task Parallelism in ISPC`_ for discussion of task parallelism in ``ispc``.
|
||||
single core. This section focuses on SPMD parallelism. See the sections
|
||||
`Task Parallelism: Language Syntax`_ and `Task Parallelism: Runtime
|
||||
Requirements`_ for discussion of task parallelism in ``ispc``.
|
||||
|
||||
The SPMD-on-SIMD Execution Model
|
||||
--------------------------------
|
||||
@@ -1384,112 +1386,190 @@ be modified in the above code even if *none* of the program instances
|
||||
evaluated a true value for the test, given the ``ispc`` execution model.
|
||||
|
||||
|
||||
Task Parallelism in ISPC
|
||||
------------------------
|
||||
Task Parallelism: Language Syntax
|
||||
---------------------------------
|
||||
|
||||
One option for combining task-parallelism with ``ispc`` is to just use
|
||||
regular task parallelism in the C/C++ application code (be it through
|
||||
Intel® Cilk(tm), Intel® Thread Building Blocks or another task system,
|
||||
etc.), and for tasks to use ``ispc`` for SPMD parallelism across the vector
|
||||
lanes as appropriate. Alternatively, ``ispc`` also has some support for
|
||||
launching tasks from ``ispc`` code. The approach is similar to Intel®
|
||||
Cilk's task launch feature. (See the ``examples/mandelbrot_tasks`` example
|
||||
to see it used in a non-trivial example.)
|
||||
Intel® Cilk(tm), Intel® Thread Building Blocks or another task system), and
|
||||
for tasks to use ``ispc`` for SPMD parallelism across the vector lanes as
|
||||
appropriate. Alternatively, ``ispc`` also has support for launching tasks
|
||||
from ``ispc`` code. The approach is similar to Intel® Cilk's task launch
|
||||
feature. (See the ``examples/mandelbrot_tasks`` example to see it used in
|
||||
a small example.)
|
||||
|
||||
Any function that is launched as a task must be declared with the ``task``
|
||||
qualifier:
|
||||
First, any function that is launched as a task must be declared with the
|
||||
``task`` qualifier:
|
||||
|
||||
::
|
||||
|
||||
task void func(uniform float a[], uniform int start) {
|
||||
....
|
||||
task void func(uniform float a[], uniform int index) {
|
||||
...
|
||||
a[index] = ....
|
||||
}
|
||||
|
||||
Tasks must return ``void``; a compile time error is issued if a
|
||||
non-``void`` task is defined.
|
||||
|
||||
Given a task, one can then write code that launches tasks as follows:
|
||||
Given a task definitions, there are two ways to write code that launches
|
||||
tasks, using the ``launch`` construct. First, one task can be launched at
|
||||
a time, with parameters passed to the task to help it determine what part
|
||||
of the overall computation it's responsible for:
|
||||
|
||||
::
|
||||
|
||||
for (uniform int i = 0; i < 100; ++i)
|
||||
launch < func(a, i); >
|
||||
launch < func(a, i) >;
|
||||
|
||||
Note the ``launch`` keyword and the brackets around the function call.
|
||||
This code launches 100 tasks, each of which presumably does some
|
||||
computation keyed off of given the value ``i``. In general, one should
|
||||
launch many more tasks than there are processors in the system to
|
||||
computation that is keyed off of given the value ``i``. In general, one
|
||||
should launch many more tasks than there are processors in the system to
|
||||
ensure good load-balancing, but not so many that the overhead of scheduling
|
||||
and running tasks dominates the computation.
|
||||
|
||||
Program execution continues asynchronously after task launch; thus, the
|
||||
function shouldn't access values being generated by the tasks without
|
||||
synchronization. A function uses a ``sync`` statement to wait for all
|
||||
launched tasks to finish:
|
||||
Alternatively, a number of tasks may be launched from a single ``launch``
|
||||
statement. We might instead write the above example with a single
|
||||
``launch`` like this:
|
||||
|
||||
::
|
||||
|
||||
for (uniform int i = 0; i < 100; ++i)
|
||||
launch < func(a, i); >
|
||||
launch[100] < func2(a) >;
|
||||
|
||||
Where an integer value (not necessarily a compile-time constant) is
|
||||
provided to the ``launch`` keyword in square brackets; this number of tasks
|
||||
will be enqueued to be run asynchronously. Within each of the tasks, two
|
||||
special built-in variables are available--``taskIndex``, and ``taskCount``.
|
||||
The first, ``taskIndex``, ranges from zero to one minus the number of tasks
|
||||
provided to ``launch``, and ``taskCount`` equals the number of launched
|
||||
taks. Thus, we might use ``taskIndex`` in the implementation of ``func2``
|
||||
to determine which array element to process.
|
||||
|
||||
::
|
||||
|
||||
task void func2(uniform float a[]) {
|
||||
...
|
||||
a[taskIndex] = ...
|
||||
}
|
||||
|
||||
Program execution continues asynchronously after a ``launch`` statement;
|
||||
thus, a function shouldn't access values being generated by the tasks it
|
||||
has launched within the function without synchronization. If results are
|
||||
needed before function return, a function can use a ``sync`` statement to
|
||||
wait for all launched tasks to finish:
|
||||
|
||||
::
|
||||
|
||||
launch[100] < func2(a) >;
|
||||
sync;
|
||||
// now safe to use computed values in a[]...
|
||||
|
||||
Alternatively, any function that launches tasks has an implicit ``sync``
|
||||
before it returns, so that functions that call a function that launches
|
||||
tasks don't have to worry about outstanding asynchronous computation.
|
||||
Alternatively, any function that launches tasks has an automatically-added
|
||||
``sync`` statement before it returns, so that functions that call a
|
||||
function that launches tasks don't have to worry about outstanding
|
||||
asynchronous computation from that function.
|
||||
|
||||
Inside functions with the ``task`` qualifier, two additional built-in
|
||||
variables are provided: ``threadIndex`` and ``threadCount``.
|
||||
``threadCount`` gives the total number of hardware threads that have been
|
||||
launched by the task system. ``threadIndex`` provides an index between
|
||||
zero and ``threadCount-1`` that gives a unique index that corresponds to
|
||||
the hardware thread that is executing the current task. The
|
||||
``threadIndex`` can be used for accessing data that is private to the
|
||||
current thread and thus doesn't require synchronization to access under
|
||||
parallel execution.
|
||||
variables are provided in addition to ``taskIndex`` and ``taskCount``:
|
||||
``threadIndex`` and ``threadCount``. ``threadCount`` gives the total
|
||||
number of hardware threads that have been launched by the task system.
|
||||
``threadIndex`` provides an index between zero and ``threadCount-1`` that
|
||||
gives a unique index that corresponds to the hardware thread that is
|
||||
executing the current task. The ``threadIndex`` can be used for accessing
|
||||
data that is private to the current thread and thus doesn't require
|
||||
synchronization to access under parallel execution.
|
||||
|
||||
Task Parallelism: Runtime Requirements
|
||||
--------------------------------------
|
||||
|
||||
If you use the task launch feature in ``ispc``, you must provide C/C++
|
||||
implementations of two functions and link them into your final executable
|
||||
file. Although these functions may be implemented in either language, they
|
||||
must have "C" linkage (i.e. their prototypes must be declared inside an
|
||||
``extern "C"`` block if they are defined in C++.)
|
||||
implementations of three specific functions that manage launching and
|
||||
synchronizing parallel tasks; these functions must be linked into your
|
||||
executable. Although these functions may be implemented in any
|
||||
language, they must have "C" linkage (i.e. their prototypes must be
|
||||
declared inside an ``extern "C"`` block if they are defined in C++.)
|
||||
|
||||
By using user-supplied versions of these functions, ``ispc`` programs can
|
||||
easily interoperate with software systems that have existing task systems
|
||||
for managing parallelism. If you're using ``ispc`` with a system that
|
||||
isn't otherwise multi-threaded and don't want to write custom
|
||||
implementations of them, you can use the implementations of these functions
|
||||
provided in the ``examples/tasksys.cpp`` file in the ``ispc``
|
||||
distributions.
|
||||
|
||||
If you are implementing your own task system, the remainder of this section
|
||||
discusses the requirements for these calls. You will also likely want to
|
||||
review the example task systems in ``examples/tasksys.cpp`` for reference.
|
||||
If you are not implmenting your own task system, you can skip reading the
|
||||
remainder of this section.
|
||||
|
||||
Here are the declarations of the three functions that must be provided to
|
||||
manage tasks in ``ispc``:
|
||||
|
||||
::
|
||||
|
||||
void ISPCLaunch(void *funcptr, void *data);
|
||||
void ISPCSync();
|
||||
void *ISPCAlloc(void **handlePtr, int64_t size, int32_t alignment);
|
||||
void ISPCLaunch(void **handlePtr, void *f, void *data, int count);
|
||||
void ISPCSync(void *handle);
|
||||
|
||||
On Windows, two additional functions must be provided to dynamically
|
||||
allocate and free memory to store the arguments passed to tasks. (On OSX
|
||||
and Linux, the stack provides memory for task arguments; on Windows, the
|
||||
stack is generally not large enough to do this for large numbers of tasks.)
|
||||
All three of these functions take an opaque handle (or a pointer to an
|
||||
opaque handle) as their first parameter. This handle allows the task
|
||||
system runtime to distinguish between calls to these functions from
|
||||
different functions in ``ispc`` code. In this way, the task system
|
||||
implementation can efficiently wait for completion on just the tasks
|
||||
launched from a single function.
|
||||
|
||||
The first time one of ``ISPCLaunch()`` or ``ISPCAlloc()`` is called in an
|
||||
``ispc`` functon, the ``void *`` pointed to by the ``handlePtr`` parameter
|
||||
will be ``NULL``. The implementations of these function should then
|
||||
initialize ``*handlePtr`` to a unique handle value of some sort. (For
|
||||
example, it might allocate a small structure to record which tasks were
|
||||
launched by the current function.) In subsequent calls to these functions
|
||||
in the emitted ``ispc`` code, the same value for ``handlePtr`` will be
|
||||
passed in, such that loading from ``*handlePtr`` will retrieve the value
|
||||
stored in the first call.
|
||||
|
||||
At function exit (or at an explicit ``sync`` statement), a call to
|
||||
``ISPCSync()`` will be generated if ``*handlePtr`` is non-``NULL``.
|
||||
Therefore, the handle value is passed directly to ``ISPCSync()``, rather
|
||||
than a pointer to it, as in the other functions.
|
||||
|
||||
The ``ISPCAlloc()`` function is used to allocate small blocks of memory to
|
||||
store parameters passed to tasks. It should return a pointer to memory
|
||||
with the given aize and alignment. Note that there is no explicit
|
||||
``ISPCFree()`` call; instead, all memory allocated within an ``ispc``
|
||||
function should be freed when ``ISPCSync()`` is called.
|
||||
|
||||
``ISPCLaunch()`` is called to launch to launch one or more asynchronous
|
||||
tasks. Each ``launch`` statement in ``ispc`` code causes a call to
|
||||
``ISPCLaunch()`` to be emitted in the generated code. The three parameters
|
||||
after the handle pointer to thie function are relatively straightforward;
|
||||
the ``void *f`` parameter holds a pointer to a function to call to run the
|
||||
work for this task, ``data`` holds a pointer to data to pass to this
|
||||
function, and ``count`` is the number of instances of this function to
|
||||
enqueue for asynchronous execution. (In other words, ``count`` corresponds
|
||||
to the value ``n`` in a multiple-task launch statement like ``launch[n]``.)
|
||||
|
||||
The signature of the provided function pointer ``f`` is
|
||||
|
||||
::
|
||||
|
||||
void *ISPCMalloc(int64_t size, int32_t alignment);
|
||||
void ISPCFree(void *ptr);
|
||||
void (*TaskFuncPtr)(void *data, int threadIndex, int threadCount,
|
||||
int taskIndex, int taskCount)
|
||||
|
||||
These are called by the task launch code generated by the ``ispc``
|
||||
compiler; the first is called to launch to launch a task and the second is
|
||||
called to wait for, respectively. (Factoring them out in this way
|
||||
allows ``ispc`` to inter-operate with the application's task system, if
|
||||
any, rather than having a separate one of its own.) To run a particular
|
||||
task, the task system should cast the function pointer to a ``void (*)(void
|
||||
*, int, int)`` function pointer and then call it with the provided ``void
|
||||
*`` data and then an index for the current hardware thread and the total
|
||||
number of hardware threads the task system has launched--in other words:
|
||||
|
||||
::
|
||||
|
||||
typedef void (*TaskFuncType)(void *, int, int);
|
||||
TaskFuncType tft = (TaskFuncType)(funcptr);
|
||||
tft(data, threadIndex, threadCount);
|
||||
|
||||
A number of sample task system implementations are provided with ``ispc``;
|
||||
see the files ``tasks_concrt.cpp``, ``tasks_gcd.cpp`` and
|
||||
``tasks_pthreads.cpp`` in the ``examples/mandelbrot_tasks`` directory of
|
||||
the ``ispc`` distribution.
|
||||
When this function pointer is called by one of the hardware threads managed
|
||||
bythe task system, the ``data`` pointer passed to ``ISPCLaunch()`` should
|
||||
be passed to it for its first parameter; ``threadCount`` gives the total
|
||||
number of hardware threads that have been spawned to run tasks and
|
||||
``threadIndex`` should be an integer index between zero and ``threadCount``
|
||||
uniquely identifying the hardware thread that is running the task. (These
|
||||
values can be used to index into thread-local storage.)
|
||||
|
||||
The value of ``taskCount`` should be the number of tasks launched in the
|
||||
``launch`` statement that caused the call to ``ISPCLaunch()`` and each of
|
||||
the calls to this function should be given a unique value of ``taskIndex``
|
||||
between zero and ``taskCount``, to distinguish which of the instances
|
||||
of the set of launched tasks is running.
|
||||
|
||||
The ISPC Standard Library
|
||||
=========================
|
||||
|
||||
Reference in New Issue
Block a user