- Call SSE versions for all the various scalar intrinsics
- Fix names of many (all?) AVX intrinsics; all were missing .256 suffix, others had additional issues.
were expecting vector-width-aligned pointers where in point of fact,
there's no guarantee that they would have been in general.
Removed the aligned memory allocation routines from some of the examples;
they're no longer needed.
No perf. difference on Core2/Core i5 CPUs; older CPUs may see some
regressions.
Still need to update the documentation for this change and finish reviewing
alignment issues in Load/Store instructions generated by .cpp files.