Towards an Optimal VEX-SSE 3*3*float Matrix Transpose

Hi all, More than a decade ago, a problem came up on this forum for computing a fast transpose of a 3x3 matrix using SSE. The most sensible implementation stores the matrix internally as a 3x4 matrix (so, one row stores 4 elements, aligned in a vector). A version, which I believe to be the fastest currently known, was presented: I am pleased to report that I have been able to come up with a version which should be faster: inline void transpose(__m128& A, __m128& B, __m128& C) { //Input rows in __m128& A, B, and C. Output in same. __m128 T0 = _mm_unpacklo_ps(A,B); __m128 T1 = _mm_unpackhi_ps(A,B); A = _mm_movelh_ps(T0,C); B = _mm_shuffle_ps( T0,C, _MM_SHUFFLE(3,1,3,2) ); C = _mm_shuffle_ps( T1,C, _MM_SHUFFLE(3,2,1,0) ); } This should be 5 instructions instead of ajas95's 8 instructions. Of course, to get that level of performance with either version, you need to inline everything, or else you spend tons of time on moving floating point arguments to/from input registers. The other thing that is crucial is that the instruction set be VEX encoded. This allows generating instructions that take three arguments, like `vunpcklps`, instead of instructions like `unpcklps` that take only two. VEX is only available in AVX and higher (usually passing e.g. `-mavx` is sufficient to get the compiler to generate VEX instructions). -G