Within each function that launches tasks, we now can easily track which
tasks that function launched, so that the sync at the end of the function
can just sync on the tasks launched by that function (not all tasks
launched by all functions.)
Implementing this led to a rework of the task system API that ispc generates
code to call; the example task systems in examples/tasksys.cpp have been
updated to conform to this API. (The updated API is also documented in
the ispc user's guide.)
As part of this, "launch[n]" syntax was added to launch a number of tasks
in a single launch statement, rather than requiring a loop over 'n' to
launch n tasks.
This commit thus fixes issue #84 (enhancement to launch multiple tasks from
a single launch statement) as well as issue #105 (recursive task launches
were broken).
Generalize the lScalarizeVector() utility routine (used in determining
when we can change gathers/scatters into vector loads/stores, respectively)
to handle vector shuffles and vector loads. This fixes issue #79, which
provided a case where a gather was being performed even though a vector
load was possible.
Fix RNG state initialization for 16-wide targets
Fix a number of bugs in reduce_add builtin implementations for AVX.
Fix some tests that had incorrect expected results for the 16-wide
case.
For associative atomic ops (add, and, or, xor), we can take advantage of
their associativity to do just a single hardware atomic instruction,
rather than one for each of the running program instances (as the previous
implementation did.)
The basic approach is to locally compute a reduction across the active
program instances with the given op and to then issue a single HW atomic
with that reduced value as the operand. We then take the old value that
was stored in the location that is returned from the HW atomic op and
use that to compute the values to return to each of the program instances
(conceptually representing the cumulative effect of each of the preceding
program instances having performed their atomic operation.)
Issue #56.
Compute a "local" min/max across the active program instances and
then do a single atomic memory op.
Added a few tests to exercise global min/max atomics (which were
previously untested!)
These get slightly wrong results for zero and the denorms and also
don't handle the Inf/NaN stuff correctly, but are much more efficient
than the full versions of these routines.
This commit adds support for swizzles like "foo.zy" (if "foo" is,
for example, a float<3> type) as rvalues. (Still need support for
swizzles as lvalues.)
This way, we match C/C++ in that casting a bool to an int gives either the value
zero or the value one. There is a new stdlib function int sign_extend(bool)
that does sign extension for cases where that's desired.
Add much more suppport for doubles and in64 types in the standard library, basically supporting everything for them that are supported for floats and int32s. (The notable exceptions being the approximate rcp() and rsqrt() functions, which don't really have sensible analogs for doubles (or at least not built-in instructions).)
This checkin provides the standard set of atomic operations and a memory barrier in the ispc standard library. Both signed and unsigned 32- and 64-bit integer types are supported.
When creating function Symbols for functions that were defined in LLVM bitcode for the standard library, if any of the function parameters are integer types, create two ispc-side Symbols: one where the integer types are all signed and the other where they are all unsigned. This allows us to provide, for example, both store_to_int16(reference int a[], uniform int offset, int val) as well as store_to_int16(reference unsigned int a[], uniform int offset, unsigned int val). functions.
Added some additional tests to exercise the new variants of these.
Also fixed some cases where the __{load,store}_int{8,16} builtins would read from/write to memory even if the mask was all off (which could cause crashes in some cases.)
scalar values (that ispc used to smear across the array/struct
elements). Now, initializers in variable declarations must be
{ }-delimited lists, with one element per struct member or array
element, respectively.
There were a few problems with the previous implementation of the
functionality to initialize from scalars. First, the expression
would be evaluated once per value initialized, so if it had side-effects,
the wrong thing would happen. Next, for large multidimensional arrays,
the generated code would be a long series of move instructions, rather
than loops (and this in turn made LLVM take a long time.)
While both of these problems are fixable, it's a non-trivial
amount of re-plumbing for a questionable feature anyway.
Fixes issue #50.