Implement 256- and 512- bit in terms of 128-bit, for machines without native wide SIMD.
Generate the full set of optimized implementations to take advantage of the most important hardware feature sets.
Generate only the basic implementations necessary to be able to operate efficiently on 128-bit vectors on this platfrom. For x86-64, that would mean SSE2 and AVX.
Generate only the basic implementations necessary to be able to operate efficiently on 256-bit vectors on this platfrom. For x86-64, that would mean SSE2, AVX, and AVX2.
Ops that depend on word size
Ops that are independent of word size and endian
A vector composed one or more lanes each composed of four words.
A vector composed of multiple 128-bit lanes.
Exchange neigboring ranges of bits of the specified size
Combine single vectors into a multi-lane vector.
A vector composed of two elements, which may be words or themselves vectors.
A vector composed of four elements, which may be words or themselves vectors.
A vector composed of four words; depending on their size, operations may cross lanes.