Hi all,
More than a decade ago, a problem came up on this forum for computing a fast transpose of a 3x3 matrix using SSE. The most sensible implementation stores the matrix internally as a 3x4 matrix (so, one row stores 4 elements, aligned in a vector). A version, which I believe to be the fastest currently known, was presented:
I am pleased to report that I have been able to come up with a version which should be faster:
inline void transpose(__m128& A, __m128& B, __m128& C) {
//Input rows in __m128& A, B, and C. Output in same.
__m128 T0 = _mm_unpacklo_ps(A,B);
__m128 T1 = _mm_unpackhi_ps(A,B);
A = _mm_movelh_ps(T0,C);
B = _mm_shuffle_ps( T0,C, _MM_SHUFFLE(3,1,3,2) );
C = _mm_shuffle_ps( T1,C, _MM_SHUFFLE(3,2,1,0) );
}
This should be 5 instructions instead of ajas95's 8 instructions. Of course, to get that level of performance with either version, you need to inline everything, or else you spend tons of time on moving floating point arguments to/from input registers.
The other thing that is crucial is that the instruction set be VEX encoded. This allows generating instructions that take three arguments, like `vunpcklps`, instead of instructions like `unpcklps` that take only two. VEX is only available in AVX and higher (usually passing e.g. `-mavx` is sufficient to get the compiler to generate VEX instructions).
-G
↧