Makefile and vcxproj file updates.
Also modified vcxproj files so that the various files ispc generates go into $(TargetDir),
not the current directory.
Modified the ray tracer example to not have uniform short-vector types in its app-visible
datatypes (these are laid out differently on SSE vs AVX); there was an existing lurking
bug in the way this was done before.
Within each function that launches tasks, we now can easily track which
tasks that function launched, so that the sync at the end of the function
can just sync on the tasks launched by that function (not all tasks
launched by all functions.)
Implementing this led to a rework of the task system API that ispc generates
code to call; the example task systems in examples/tasksys.cpp have been
updated to conform to this API. (The updated API is also documented in
the ispc user's guide.)
As part of this, "launch[n]" syntax was added to launch a number of tasks
in a single launch statement, rather than requiring a loop over 'n' to
launch n tasks.
This commit thus fixes issue #84 (enhancement to launch multiple tasks from
a single launch statement) as well as issue #105 (recursive task launches
were broken).
fp constant undesirably causing computation to be done in double precision.
Makes C scalar versions of the options pricing models, rt, and aobench 3-5% faster.
Makes scalar version of noise about 15% faster.
Others are unchanged.
- Only have a single copy of all of the tasks_*.cpp sample implementations,
stored in examples/.
- Reduce dynamic storage allocation and locking in task launch code paths.
- Don't have a hard limit of the number of tasks that can be launched on
Windows (fix issue #85).
Modified this example to use reduce_equal() to see if all of the program
instances want to load the 8 sample values around the same voxel. When
this is the case, we can just do 8 scalar loads, rather than needing to
do a fully general gather. Once this check fails, it isn't done again,
since it's not likely to start succeeding in the future. This gives
a ~10% speedup with the low-res data set, and basically no performance
difference with the high-res one. (It makes sense that the lower-resolution
the voxel sampling, the longer all of the rays will stay in the same set
of voxels.)