This applies a floating-point scale factor to the image resolution;
it's useful for experiments with many-core systems where the
base image resolution may not give enough work for good load-balancing
with tasks.
All of the masked store calls were inhibiting putting values into
registers, which in turn led to a lot of unnecessary stack traffic.
This approach seems to give better code in the end.
Fix RNG state initialization for 16-wide targets
Fix a number of bugs in reduce_add builtin implementations for AVX.
Fix some tests that had incorrect expected results for the 16-wide
case.
fp constant undesirably causing computation to be done in double precision.
Makes C scalar versions of the options pricing models, rt, and aobench 3-5% faster.
Makes scalar version of noise about 15% faster.
Others are unchanged.
For associative atomic ops (add, and, or, xor), we can take advantage of
their associativity to do just a single hardware atomic instruction,
rather than one for each of the running program instances (as the previous
implementation did.)
The basic approach is to locally compute a reduction across the active
program instances with the given op and to then issue a single HW atomic
with that reduced value as the operand. We then take the old value that
was stored in the location that is returned from the HW atomic op and
use that to compute the values to return to each of the program instances
(conceptually representing the cumulative effect of each of the preceding
program instances having performed their atomic operation.)
Issue #56.
When replacing 'all on' masked store with regular store, set alignment
to be the vector element alignment, not the alignment for a whole vector.
(i.e. 4 or 8 byte alignment, not 32 or 64).
Old run_tests.sh still lives (for now).
Changes include:
- Tests are run in parallel across all of the available CPU cores
- Option to create a statically-linked executable for each test
(rather than using the LLVM JIT). This is in particular useful
for AVX, which doesn't have good JIT support yet.
- Static executables also makes it possible to test x86, not
just x86-64, codegen.
- Fixed a number of tests in failing_tests, which were actually
failing due to the fact that the expected function signature of
tests had changed.
Emit calls to masked_store, not masked_store_blend, when handling
masked stores emitted by the frontend.
Fix bug in binary8to16 macro in builtins.m4
Fix bug in 16-wide version of __reduce_add_float
Remove blend function implementations for masked_store_blend for
AVX; just forward those on to the corresponding real masked store
functions.
Compute a "local" min/max across the active program instances and
then do a single atomic memory op.
Added a few tests to exercise global min/max atomics (which were
previously untested!)
We had been prohibiting Windows users from providing #definitions on the command
line, which is the wrong thing to do ever since we switched to using the
clang preprocessor.
If no CPU is specified, use the host CPU type, not just a default of "nehalem".
Provide better features strings to the LLVM target machinery.
-> Thus ensuring that LLVM doesn't generate SSE>2 instructions for the SSE2
target (Fixes issue #82).
-> Slight code improvements from using cmovs in generated code now
Use the llvm popcnt intrinsic for the SSE2 target now (it now generates code
that doesn't call the popcnt instruction now that we properly tell LLVM
which instructions are and aren't available for SSE2.)