This was unnecessary overhead to impose on all callers; the user
should handle these as needed on their own.
Also added some explanatory text to the documentation that highlights
that memory_barrier() is only needed across HW threads/cores, not
across program instances in a gang.