vec16_i64 improved with the addition of the following: __extract_element(), insert_element(), __sub(), __mul(),
__sdiv(), __udiv(), __and(), __or(), __xor(), __shl(), __lshr(), __ashr(), __select()
Fixed a bug in the __mul(__vec16_i64, __vec16_i32) implementation
Constructors are all explicitly inlined, copy constructor and operator=() explicitly provided
Load and stores for __vec16_i64 and __vec16_d use aligned instructions when possible
__rotate_i32() now has a vector implementation
Added several reductions: __reduce_add_i32(), __reduce_min_i32(), __reduce_max_i32(),
__reduce_add_f(), __reduce_min_f(), __reduce_max_f()