We now do a single atomic hardware swap and then effectively do
swaps between the running program instances such that the result
is the same as if they had happened to run a particular ordering
of hardware swaps themselves.
Also cleaned up __atomic_swap_uniform_* built-in implementations
to not take the mask, which they weren't using anyway.
Finishes Issue #56.
Effectively, the patterns that detected when given a gather or
scatter in base+offsets form, the offsets were actually a multiple
of 2/4/8, were no longer working.
This change not only fixes this, but also expands the set of
patterns that are matched by this. For example, given offsets of
the form 4*v1 + 16*v2, it identifies a scale of 4 and new offsets
of v1 + 4*v2.
This fix makes the volume renderer run 1.19x faster, and noise 1.54x
faster.
In this case, we now emit calls to potentially-specialized functions for the
left/right shifts that take a single integer value for the shift amount. These
in turn can be matched to the corresponding intrinsics for the SSE target.
Issue #145.
Previously, when we had a switch statement with a uniform switch condition
but a 'break' statement that was under varying control flow inside the
switch, we'd promote the switch condition to be varying so that the
break would work correctly.
Now, we leave the condition as uniform and are thus able to use the
more-efficient LLVM switch instruction in this case.
Issue #156.
Specifically, don't use vector select for masked store blend there,
but emit a call to a undefined __masked_store_blend_*() functions.
Added implementations of these functions to the sse4.h and generic-16.h
in examples/instrinsics. (Calls to these will never be generated with
LLVM 3.1).
More specifically, we do a proper masked store (rather than a load-
blend-store) unless we can determine that we're accessing a stack-allocated
"varying" variable. This fixes a number of nefarious bugs where given
code like:
uniform float a[21];
foreach (i = 0 … 21)
a[i] = 0;
We'd use a blend and in turn read past the end of a[] in the last
iteration.
Also made slight changes to inlining in aobench; this keeps compiles
to ~5s, versus ~45s without them (with this change).
Fixes issue #160.
Specialize the code for the innermost loop to not do any masking
computations for the innermost dimension for the iterations where
we are certainly working on a full vector's worth of data.
This fix improves performance/code quality of "foreach" such that
it's essentially the same as the equivalent "for" loop.
Fixes issue #151.
(i.e., stop just reusing the ones for AVX1).
For now the only difference is that the int/uint min/max
functions call the new intrinsic for that. Once gather is
available from LLVM, that will go here as well.
If we call a function pointer, CallInst::getCalledFunction() returns NULL; we
need to be careful about this case when we're matching various function calls
in optimization passes.
(Fixes a crash.)
Switches with both uniform and varying "switch" expressions are
supported. Switch statements with varying expressions and very
large numbers of labels may not perform well; some issues to be
filed shortly will track opportunities for improving these.
Previously, we would return immediately if the current basic block
was NULL; however, this is the wrong thing to do in that goto labels
and case/default labels in switch statements will establish a new
current basic block even if the current one is NULL.
Allow to run from the build directory even if it is not on the path
properly decode subprocess stdout/stderr as UTF-8
Added newlines that were mistakenly left out of print->sys.stdout.wriote() conversion in previous CL
Python 3:
- fixed error message comparison
- explicit list creation
Windows:
- forward/back slash annoyances
- added stdint.h with definitions for int32_t, int64_t
- compile_error_files and run_error_files were being appended to improperly
In short, we were inadvertently trying to emit each function's
code a second time if the function had a mask check at the start
of it. StmtList::EmitCode() was covering this error up by
not emitting code if the current basic block is NULL.