rows_to_cols

Function rows_to_cols 

Source
unsafe fn rows_to_cols([a, _, c, d]: &mut [__m128i; 4])
Available with target feature sse2 only.
Expand description

The goal of this function is to transform the state words from:

[a0, a1, a2, a3]    [ 0,  1,  2,  3]
[b0, b1, b2, b3] == [ 4,  5,  6,  7]
[c0, c1, c2, c3]    [ 8,  9, 10, 11]
[d0, d1, d2, d3]    [12, 13, 14, 15]

to:

[a0, a1, a2, a3]    [ 0,  1,  2,  3]
[b1, b2, b3, b0] == [ 5,  6,  7,  4]
[c2, c3, c0, c1]    [10, 11,  8,  9]
[d3, d0, d1, d2]    [15, 12, 13, 14]

so that we can apply add_xor_rot to the resulting columns, and have it compute the “diagonal rounds” (as defined in RFC 7539) in parallel. In practice, this shuffle is non-optimal: the last state word to be altered in add_xor_rot is b, so the shuffle blocks on the result of b being calculated.

We can optimize this by observing that the four quarter rounds in add_xor_rot are data-independent: they only access a single column of the state, and thus the order of the columns does not matter. We therefore instead shuffle the other three state words, to obtain the following equivalent layout:

[a3, a0, a1, a2]    [ 3,  0,  1,  2]
[b0, b1, b2, b3] == [ 4,  5,  6,  7]
[c1, c2, c3, c0]    [ 9, 10, 11,  8]
[d2, d3, d0, d1]    [14, 15, 12, 13]

See https://github.com/sneves/blake2-avx2/pull/4 for additional details. The earliest known occurrence of this optimization is in floodyberry’s SSE4 ChaCha code from 2014:

  • https://github.com/floodyberry/chacha-opt/blob/0ab65cb99f5016633b652edebaf3691ceb4ff753/chacha_blocks_ssse3-64.S#L639-L643