This gets deferred closer to working with the scalar target, but there are still
some issues. (Partially in gamma correction / final clamping, it seems.)
This fix causes a ~0.5% performance degradation with e.g. the AVX target,
though it's not clear that it's worth having a separate code path in order to
not lose this small amount of perf.
(Partially addresses issue #167)