unsafe fn rows_to_cols(vs: &mut [[__m256i; 4]; 2])avx2 only.Expand description
The goal of this function is to transform the state words from:
[a0, a1, a2, a3] [ 0, 1, 2, 3]
[b0, b1, b2, b3] == [ 4, 5, 6, 7]
[c0, c1, c2, c3] [ 8, 9, 10, 11]
[d0, d1, d2, d3] [12, 13, 14, 15]to:
[a0, a1, a2, a3] [ 0, 1, 2, 3]
[b1, b2, b3, b0] == [ 5, 6, 7, 4]
[c2, c3, c0, c1] [10, 11, 8, 9]
[d3, d0, d1, d2] [15, 12, 13, 14]so that we can apply add_xor_rot to the resulting columns, and have it compute the
“diagonal rounds” (as defined in RFC 7539) in parallel. In practice, this shuffle is
non-optimal: the last state word to be altered in add_xor_rot is b, so the shuffle
blocks on the result of b being calculated.
We can optimize this by observing that the four quarter rounds in add_xor_rot are
data-independent: they only access a single column of the state, and thus the order of
the columns does not matter. We therefore instead shuffle the other three state words,
to obtain the following equivalent layout:
[a3, a0, a1, a2] [ 3, 0, 1, 2]
[b0, b1, b2, b3] == [ 4, 5, 6, 7]
[c1, c2, c3, c0] [ 9, 10, 11, 8]
[d2, d3, d0, d1] [14, 15, 12, 13]See https://github.com/sneves/blake2-avx2/pull/4 for additional details. The earliest known occurrence of this optimization is in floodyberry’s SSE4 ChaCha code from 2014:
- https://github.com/floodyberry/chacha-opt/blob/0ab65cb99f5016633b652edebaf3691ceb4ff753/chacha_blocks_ssse3-64.S#L639-L643