Added updated task launch implementation that now tracks task groups.

Within each function that launches tasks, we now can easily track which
tasks that function launched, so that the sync at the end of the function
can just sync on the tasks launched by that function (not all tasks
launched by all functions.)

Implementing this led to a rework of the task system API that ispc generates
code to call; the example task systems in examples/tasksys.cpp have been
updated to conform to this API.  (The updated API is also documented in
the ispc user's guide.)

As part of this, "launch[n]" syntax was added to launch a number of tasks
in a single launch statement, rather than requiring a loop over 'n' to
launch n tasks.

This commit thus fixes issue #84 (enhancement to launch multiple tasks from
a single launch statement) as well as issue #105 (recursive task launches
were broken).
This commit is contained in:
Matt Pharr
2011-09-30 11:20:53 -07:00
parent 5ee4d7fce8
commit cb7976bbf6
43 changed files with 1309 additions and 1043 deletions

View File

@@ -80,7 +80,8 @@ Contents:
+ `Program Instance Convergence`_
+ `Data Races`_
+ `Uniform Variables and Varying Control Flow`_
+ `Task Parallelism in ISPC`_
+ `Task Parallelism: Language Syntax`_
+ `Task Parallelism: Runtime Requirements`_
* `The ISPC Standard Library`_
@@ -838,8 +839,8 @@ by default. If a function is declared with a ``static`` qualifier, then it
is only visible in the file in which it was declared.
Any function that can be launched with the ``launch`` construct in ``ispc``
must have a ``task`` qualifier; see `Task Parallelism in ISPC`_ for more
discussion of launching tasks in ``ispc``.
must have a ``task`` qualifier; see `Task Parallelism: Language Syntax`_
for more discussion of launching tasks in ``ispc``.
Functions that are intended to be called from C/C++ application code must
have the ``export`` qualifier. This causes them to have regular C linkage
@@ -940,8 +941,9 @@ execution model is critical for writing efficient and correct programs in
``ispc`` supports both task parallelism to parallelize across multiple
cores and SPMD parallelism to parallelize across the SIMD vector lanes on a
single core. This section focuses on SPMD parallelism. See the section
`Task Parallelism in ISPC`_ for discussion of task parallelism in ``ispc``.
single core. This section focuses on SPMD parallelism. See the sections
`Task Parallelism: Language Syntax`_ and `Task Parallelism: Runtime
Requirements`_ for discussion of task parallelism in ``ispc``.
The SPMD-on-SIMD Execution Model
--------------------------------
@@ -1384,112 +1386,190 @@ be modified in the above code even if *none* of the program instances
evaluated a true value for the test, given the ``ispc`` execution model.
Task Parallelism in ISPC
------------------------
Task Parallelism: Language Syntax
---------------------------------
One option for combining task-parallelism with ``ispc`` is to just use
regular task parallelism in the C/C++ application code (be it through
Intel® Cilk(tm), Intel® Thread Building Blocks or another task system,
etc.), and for tasks to use ``ispc`` for SPMD parallelism across the vector
lanes as appropriate. Alternatively, ``ispc`` also has some support for
launching tasks from ``ispc`` code. The approach is similar to Intel®
Cilk's task launch feature. (See the ``examples/mandelbrot_tasks`` example
to see it used in a non-trivial example.)
Intel® Cilk(tm), Intel® Thread Building Blocks or another task system), and
for tasks to use ``ispc`` for SPMD parallelism across the vector lanes as
appropriate. Alternatively, ``ispc`` also has support for launching tasks
from ``ispc`` code. The approach is similar to Intel® Cilk's task launch
feature. (See the ``examples/mandelbrot_tasks`` example to see it used in
a small example.)
Any function that is launched as a task must be declared with the ``task``
qualifier:
First, any function that is launched as a task must be declared with the
``task`` qualifier:
::
task void func(uniform float a[], uniform int start) {
....
task void func(uniform float a[], uniform int index) {
...
a[index] = ....
}
Tasks must return ``void``; a compile time error is issued if a
non-``void`` task is defined.
Given a task, one can then write code that launches tasks as follows:
Given a task definitions, there are two ways to write code that launches
tasks, using the ``launch`` construct. First, one task can be launched at
a time, with parameters passed to the task to help it determine what part
of the overall computation it's responsible for:
::
for (uniform int i = 0; i < 100; ++i)
launch < func(a, i); >
launch < func(a, i) >;
Note the ``launch`` keyword and the brackets around the function call.
This code launches 100 tasks, each of which presumably does some
computation keyed off of given the value ``i``. In general, one should
launch many more tasks than there are processors in the system to
computation that is keyed off of given the value ``i``. In general, one
should launch many more tasks than there are processors in the system to
ensure good load-balancing, but not so many that the overhead of scheduling
and running tasks dominates the computation.
Program execution continues asynchronously after task launch; thus, the
function shouldn't access values being generated by the tasks without
synchronization. A function uses a ``sync`` statement to wait for all
launched tasks to finish:
Alternatively, a number of tasks may be launched from a single ``launch``
statement. We might instead write the above example with a single
``launch`` like this:
::
for (uniform int i = 0; i < 100; ++i)
launch < func(a, i); >
launch[100] < func2(a) >;
Where an integer value (not necessarily a compile-time constant) is
provided to the ``launch`` keyword in square brackets; this number of tasks
will be enqueued to be run asynchronously. Within each of the tasks, two
special built-in variables are available--``taskIndex``, and ``taskCount``.
The first, ``taskIndex``, ranges from zero to one minus the number of tasks
provided to ``launch``, and ``taskCount`` equals the number of launched
taks. Thus, we might use ``taskIndex`` in the implementation of ``func2``
to determine which array element to process.
::
task void func2(uniform float a[]) {
...
a[taskIndex] = ...
}
Program execution continues asynchronously after a ``launch`` statement;
thus, a function shouldn't access values being generated by the tasks it
has launched within the function without synchronization. If results are
needed before function return, a function can use a ``sync`` statement to
wait for all launched tasks to finish:
::
launch[100] < func2(a) >;
sync;
// now safe to use computed values in a[]...
Alternatively, any function that launches tasks has an implicit ``sync``
before it returns, so that functions that call a function that launches
tasks don't have to worry about outstanding asynchronous computation.
Alternatively, any function that launches tasks has an automatically-added
``sync`` statement before it returns, so that functions that call a
function that launches tasks don't have to worry about outstanding
asynchronous computation from that function.
Inside functions with the ``task`` qualifier, two additional built-in
variables are provided: ``threadIndex`` and ``threadCount``.
``threadCount`` gives the total number of hardware threads that have been
launched by the task system. ``threadIndex`` provides an index between
zero and ``threadCount-1`` that gives a unique index that corresponds to
the hardware thread that is executing the current task. The
``threadIndex`` can be used for accessing data that is private to the
current thread and thus doesn't require synchronization to access under
parallel execution.
variables are provided in addition to ``taskIndex`` and ``taskCount``:
``threadIndex`` and ``threadCount``. ``threadCount`` gives the total
number of hardware threads that have been launched by the task system.
``threadIndex`` provides an index between zero and ``threadCount-1`` that
gives a unique index that corresponds to the hardware thread that is
executing the current task. The ``threadIndex`` can be used for accessing
data that is private to the current thread and thus doesn't require
synchronization to access under parallel execution.
Task Parallelism: Runtime Requirements
--------------------------------------
If you use the task launch feature in ``ispc``, you must provide C/C++
implementations of two functions and link them into your final executable
file. Although these functions may be implemented in either language, they
must have "C" linkage (i.e. their prototypes must be declared inside an
``extern "C"`` block if they are defined in C++.)
implementations of three specific functions that manage launching and
synchronizing parallel tasks; these functions must be linked into your
executable. Although these functions may be implemented in any
language, they must have "C" linkage (i.e. their prototypes must be
declared inside an ``extern "C"`` block if they are defined in C++.)
By using user-supplied versions of these functions, ``ispc`` programs can
easily interoperate with software systems that have existing task systems
for managing parallelism. If you're using ``ispc`` with a system that
isn't otherwise multi-threaded and don't want to write custom
implementations of them, you can use the implementations of these functions
provided in the ``examples/tasksys.cpp`` file in the ``ispc``
distributions.
If you are implementing your own task system, the remainder of this section
discusses the requirements for these calls. You will also likely want to
review the example task systems in ``examples/tasksys.cpp`` for reference.
If you are not implmenting your own task system, you can skip reading the
remainder of this section.
Here are the declarations of the three functions that must be provided to
manage tasks in ``ispc``:
::
void ISPCLaunch(void *funcptr, void *data);
void ISPCSync();
void *ISPCAlloc(void **handlePtr, int64_t size, int32_t alignment);
void ISPCLaunch(void **handlePtr, void *f, void *data, int count);
void ISPCSync(void *handle);
On Windows, two additional functions must be provided to dynamically
allocate and free memory to store the arguments passed to tasks. (On OSX
and Linux, the stack provides memory for task arguments; on Windows, the
stack is generally not large enough to do this for large numbers of tasks.)
All three of these functions take an opaque handle (or a pointer to an
opaque handle) as their first parameter. This handle allows the task
system runtime to distinguish between calls to these functions from
different functions in ``ispc`` code. In this way, the task system
implementation can efficiently wait for completion on just the tasks
launched from a single function.
The first time one of ``ISPCLaunch()`` or ``ISPCAlloc()`` is called in an
``ispc`` functon, the ``void *`` pointed to by the ``handlePtr`` parameter
will be ``NULL``. The implementations of these function should then
initialize ``*handlePtr`` to a unique handle value of some sort. (For
example, it might allocate a small structure to record which tasks were
launched by the current function.) In subsequent calls to these functions
in the emitted ``ispc`` code, the same value for ``handlePtr`` will be
passed in, such that loading from ``*handlePtr`` will retrieve the value
stored in the first call.
At function exit (or at an explicit ``sync`` statement), a call to
``ISPCSync()`` will be generated if ``*handlePtr`` is non-``NULL``.
Therefore, the handle value is passed directly to ``ISPCSync()``, rather
than a pointer to it, as in the other functions.
The ``ISPCAlloc()`` function is used to allocate small blocks of memory to
store parameters passed to tasks. It should return a pointer to memory
with the given aize and alignment. Note that there is no explicit
``ISPCFree()`` call; instead, all memory allocated within an ``ispc``
function should be freed when ``ISPCSync()`` is called.
``ISPCLaunch()`` is called to launch to launch one or more asynchronous
tasks. Each ``launch`` statement in ``ispc`` code causes a call to
``ISPCLaunch()`` to be emitted in the generated code. The three parameters
after the handle pointer to thie function are relatively straightforward;
the ``void *f`` parameter holds a pointer to a function to call to run the
work for this task, ``data`` holds a pointer to data to pass to this
function, and ``count`` is the number of instances of this function to
enqueue for asynchronous execution. (In other words, ``count`` corresponds
to the value ``n`` in a multiple-task launch statement like ``launch[n]``.)
The signature of the provided function pointer ``f`` is
::
void *ISPCMalloc(int64_t size, int32_t alignment);
void ISPCFree(void *ptr);
void (*TaskFuncPtr)(void *data, int threadIndex, int threadCount,
int taskIndex, int taskCount)
These are called by the task launch code generated by the ``ispc``
compiler; the first is called to launch to launch a task and the second is
called to wait for, respectively. (Factoring them out in this way
allows ``ispc`` to inter-operate with the application's task system, if
any, rather than having a separate one of its own.) To run a particular
task, the task system should cast the function pointer to a ``void (*)(void
*, int, int)`` function pointer and then call it with the provided ``void
*`` data and then an index for the current hardware thread and the total
number of hardware threads the task system has launched--in other words:
::
typedef void (*TaskFuncType)(void *, int, int);
TaskFuncType tft = (TaskFuncType)(funcptr);
tft(data, threadIndex, threadCount);
A number of sample task system implementations are provided with ``ispc``;
see the files ``tasks_concrt.cpp``, ``tasks_gcd.cpp`` and
``tasks_pthreads.cpp`` in the ``examples/mandelbrot_tasks`` directory of
the ``ispc`` distribution.
When this function pointer is called by one of the hardware threads managed
bythe task system, the ``data`` pointer passed to ``ISPCLaunch()`` should
be passed to it for its first parameter; ``threadCount`` gives the total
number of hardware threads that have been spawned to run tasks and
``threadIndex`` should be an integer index between zero and ``threadCount``
uniquely identifying the hardware thread that is running the task. (These
values can be used to index into thread-local storage.)
The value of ``taskCount`` should be the number of tasks launched in the
``launch`` statement that caused the call to ``ISPCLaunch()`` and each of
the calls to this function should be given a unique value of ``taskIndex``
between zero and ``taskCount``, to distinguish which of the instances
of the set of launched tasks is running.
The ISPC Standard Library
=========================