We now do a single atomic hardware swap and then effectively do
swaps between the running program instances such that the result
is the same as if they had happened to run a particular ordering
of hardware swaps themselves.
Also cleaned up __atomic_swap_uniform_* built-in implementations
to not take the mask, which they weren't using anyway.
Finishes Issue #56.