Old run_tests.sh still lives (for now).
Changes include:
- Tests are run in parallel across all of the available CPU cores
- Option to create a statically-linked executable for each test
(rather than using the LLVM JIT). This is in particular useful
for AVX, which doesn't have good JIT support yet.
- Static executables also makes it possible to test x86, not
just x86-64, codegen.
- Fixed a number of tests in failing_tests, which were actually
failing due to the fact that the expected function signature of
tests had changed.
Emit calls to masked_store, not masked_store_blend, when handling
masked stores emitted by the frontend.
Fix bug in binary8to16 macro in builtins.m4
Fix bug in 16-wide version of __reduce_add_float
Remove blend function implementations for masked_store_blend for
AVX; just forward those on to the corresponding real masked store
functions.
Compute a "local" min/max across the active program instances and
then do a single atomic memory op.
Added a few tests to exercise global min/max atomics (which were
previously untested!)
We had been prohibiting Windows users from providing #definitions on the command
line, which is the wrong thing to do ever since we switched to using the
clang preprocessor.
If no CPU is specified, use the host CPU type, not just a default of "nehalem".
Provide better features strings to the LLVM target machinery.
-> Thus ensuring that LLVM doesn't generate SSE>2 instructions for the SSE2
target (Fixes issue #82).
-> Slight code improvements from using cmovs in generated code now
Use the llvm popcnt intrinsic for the SSE2 target now (it now generates code
that doesn't call the popcnt instruction now that we properly tell LLVM
which instructions are and aren't available for SSE2.)
- Only have a single copy of all of the tasks_*.cpp sample implementations,
stored in examples/.
- Reduce dynamic storage allocation and locking in task launch code paths.
- Don't have a hard limit of the number of tasks that can be launched on
Windows (fix issue #85).
Modified this example to use reduce_equal() to see if all of the program
instances want to load the 8 sample values around the same voxel. When
this is the case, we can just do 8 scalar loads, rather than needing to
do a fully general gather. Once this check fails, it isn't done again,
since it's not likely to start succeeding in the future. This gives
a ~10% speedup with the low-res data set, and basically no performance
difference with the high-res one. (It makes sense that the lower-resolution
the voxel sampling, the longer all of the rays will stay in the same set
of voxels.)
Set the Module's target appropriately when it's first created.
Compile separate 32 and 64 bit versions of the builtins-c bitcocde
and load the appropriate one based on the target we're compiling
for.
Just pulling out the elements and doing a set of scalar equality tests
is the best approach for those (nearly 2x better than the rotate and
vector equality check that we use for 32-bit stuff).
These get slightly wrong results for zero and the denorms and also
don't handle the Inf/NaN stuff correctly, but are much more efficient
than the full versions of these routines.