Fast 4x4 Matrix Inverse with SSE SIMD, Explained
Before we start, think about this question: do we really need the inverse of a general matrix?
I came to this problem when writing a math library for my game engine. If you are making a game or 3D application, we use 4x4 matrix for object transform, which is a combination of 3D translation, rotation and scale. If most of your matrices are used as transform matrices, because of their special property, we have a fast route for calculating their inverse. In fact transform matrix inverse is only 50% of the cost compared to the optimized general matrix inverse. In the first half of this post we will talk about transform matrix. In the second half we will dive in and explain the SIMD version of general 4x4 matrix inverse, and we compare the performance of our method with commonly used math libraries from UE4, Eigen and DirectX Math.
The matrices used in this post are row major. This is mainly for (1) easier to demonstrate and visualize with matrix data layout; (2) easier to compare with other math library. The same matrix inverse function works for both row major and column major, because \(A^{-1}=((A^{T})^{-1})^{T}\) (inverse is the same as transpose, inverse then transpose again). However if you are a column major guy like myself, I have a full-on column major version for you in [Appendix].
Transform Matrix Inverse
The transform matrix we are talking about here is defined as following:
The first 3 component of the last row is the translation \(\vec{T}\). The top left 3x3 sub-matrix is the scaled rotation matrix, with each row as a scaled axis. We have \(\vec{X}\cdot\vec{Y}=\vec{X}\cdot\vec{Z}=\vec{Y}\cdot\vec{Z}=0\), and \(\left|\vec{X}\right|=\left|\vec{Y}\right|=\left|\vec{Z}\right|=1\). And the scale is \((a,b,c)\).
Most matrices in the game are of this form. For example, \(M\) represents a local to world transform, \(\vec{X}\), \(\vec{Y}\), \(\vec{Z}\) are your local space axes. If you have a point \(\vec{P}(P_0,P_1,P_2)\), and you want to transform it from local space to world space, you do this:
This is the same as extend \(\vec{P}\) to a 4 component vector \(\vec{P}(P_0,P_1,P_2,1)\) and multiply by matrix \(M\). Now what does inverse matrix \(M^{-1}\) mean? In this case it represents a world to local transform, so if we multiply \(\vec{P'}\) by \(M^{-1}\), we should get \(\vec{P}\) back. How do we transform the point \(\vec{P'}\) from world space back in local space? We subtract the local space origin (aka the translation \(\vec{T}\)), then dot each axes to get its local space coordinate and rescale it:
With this, we can actually directly write the form of the inverse of our matrix.
The nice thing about transform matrix \(M\) is its first 3 rows are perpendicular to each other. Its inverse form is basically transpose the 3x3 rotation matrix, and rescale it, and change translation part by doing dot product with 3 rescaled axes. It should be easy to confirm \(MM^{-1}=I\).
Now let’s bake in the scale (so \(\left|\vec{X}\right|=a\),\(\left|\vec{Y}\right|=b\),\(\left|\vec{Z}\right|=c\)) and get a more generic form.
Notice that for rescaling, we divide by squared size of scaled axis, instead of size, which is good news for implementation. And if our transform is of unit scale, which is also common in games, our target becomes even simpler.
Alright, enough theory, let’s see some code. This is our matrix definition.
__declspec(align(16)) struct Matrix4
{
public:
union
{
float m[4][4];
__m128 mVec[4];
};
};
Before we jump in intrinsics, I would like to define a bunch of shuffle/swizzle macros. Hopefully they will make it easier to read. We also make use of faster instructions for special shuffles.
(Thank you Stefan Kaps for pointing out single register shuffle instruction _mm_shuffle_epi32!)
#define MakeShuffleMask(x,y,z,w) (x | (y<<2) | (z<<4) | (w<<6))
// vec(0, 1, 2, 3) -> (vec[x], vec[y], vec[z], vec[w])
#define VecSwizzleMask(vec, mask) _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(vec), mask))
#define VecSwizzle(vec, x, y, z, w) VecSwizzleMask(vec, MakeShuffleMask(x,y,z,w))
#define VecSwizzle1(vec, x) VecSwizzleMask(vec, MakeShuffleMask(x,x,x,x))
// special swizzle
#define VecSwizzle_0022(vec) _mm_moveldup_ps(vec)
#define VecSwizzle_1133(vec) _mm_movehdup_ps(vec)
// return (vec1[x], vec1[y], vec2[z], vec2[w])
#define VecShuffle(vec1, vec2, x,y,z,w) _mm_shuffle_ps(vec1, vec2, MakeShuffleMask(x,y,z,w))
// special shuffle
#define VecShuffle_0101(vec1, vec2) _mm_movelh_ps(vec1, vec2)
#define VecShuffle_2323(vec1, vec2) _mm_movehl_ps(vec2, vec1)
Here is our first function to inverse transform matrix without scaling (always unit scale).
// Requires this matrix to be transform matrix, NoScale version requires this matrix be of scale 1
inline Matrix4 GetTransformInverseNoScale(const Matrix4& inM)
{
Matrix4 r;
// transpose 3x3, we know m03 = m13 = m23 = 0
__m128 t0 = VecShuffle_0101(inM.mVec[0], inM.mVec[1]); // 00, 01, 10, 11
__m128 t1 = VecShuffle_2323(inM.mVec[0], inM.mVec[1]); // 02, 03, 12, 13
r.mVec[0] = VecShuffle(t0, inM.mVec[2], 0,2,0,3); // 00, 10, 20, 23(=0)
r.mVec[1] = VecShuffle(t0, inM.mVec[2], 1,3,1,3); // 01, 11, 21, 23(=0)
r.mVec[2] = VecShuffle(t1, inM.mVec[2], 0,2,2,3); // 02, 12, 22, 23(=0)
// last line
r.mVec[3] = _mm_mul_ps(r.mVec[0], VecSwizzle1(inM.mVec[3], 0));
r.mVec[3] = _mm_add_ps(r.mVec[3], _mm_mul_ps(r.mVec[1], VecSwizzle1(inM.mVec[3], 1)));
r.mVec[3] = _mm_add_ps(r.mVec[3], _mm_mul_ps(r.mVec[2], VecSwizzle1(inM.mVec[3], 2)));
r.mVec[3] = _mm_sub_ps(_mm_setr_ps(0.f, 0.f, 0.f, 1.f), r.mVec[3]);
return r;
}
Very straight forward. This is the fastest function you can have, it only does a transpose and some dot products. If we add in scales, it takes a little more time to do rescaling, but still pretty fast. There is a little trick for calculating squared size, we can make use of the fact that we need to transpose 3x3 rotation part anyway, do squared size after and calculate 3 axes in one go.
#define SMALL_NUMBER (1.e-8f)
// Requires this matrix to be transform matrix
inline Matrix4 GetTransformInverse(const Matrix4& inM)
{
Matrix4 r;
// transpose 3x3, we know m03 = m13 = m23 = 0
__m128 t0 = VecShuffle_0101(inM.mVec[0], inM.mVec[1]); // 00, 01, 10, 11
__m128 t1 = VecShuffle_2323(inM.mVec[0], inM.mVec[1]); // 02, 03, 12, 13
r.mVec[0] = VecShuffle(t0, inM.mVec[2], 0,2,0,3); // 00, 10, 20, 23(=0)
r.mVec[1] = VecShuffle(t0, inM.mVec[2], 1,3,1,3); // 01, 11, 21, 23(=0)
r.mVec[2] = VecShuffle(t1, inM.mVec[2], 0,2,2,3); // 02, 12, 22, 23(=0)
// (SizeSqr(mVec[0]), SizeSqr(mVec[1]), SizeSqr(mVec[2]), 0)
__m128 sizeSqr;
sizeSqr = _mm_mul_ps(r.mVec[0], r.mVec[0]);
sizeSqr = _mm_add_ps(sizeSqr, _mm_mul_ps(r.mVec[1], r.mVec[1]));
sizeSqr = _mm_add_ps(sizeSqr, _mm_mul_ps(r.mVec[2], r.mVec[2]));
// optional test to avoid divide by 0
__m128 one = _mm_set1_ps(1.f);
// for each component, if(sizeSqr < SMALL_NUMBER) sizeSqr = 1;
__m128 rSizeSqr = _mm_blendv_ps(
_mm_div_ps(one, sizeSqr),
one,
_mm_cmplt_ps(sizeSqr, _mm_set1_ps(SMALL_NUMBER))
);
r.mVec[0] = _mm_mul_ps(r.mVec[0], rSizeSqr);
r.mVec[1] = _mm_mul_ps(r.mVec[1], rSizeSqr);
r.mVec[2] = _mm_mul_ps(r.mVec[2], rSizeSqr);
// last line
r.mVec[3] = _mm_mul_ps(r.mVec[0], VecSwizzle1(inM.mVec[3], 0));
r.mVec[3] = _mm_add_ps(r.mVec[3], _mm_mul_ps(r.mVec[1], VecSwizzle1(inM.mVec[3], 1)));
r.mVec[3] = _mm_add_ps(r.mVec[3], _mm_mul_ps(r.mVec[2], VecSwizzle1(inM.mVec[3], 2)));
r.mVec[3] = _mm_sub_ps(_mm_setr_ps(0.f, 0.f, 0.f, 1.f), r.mVec[3]);
return r;
}
Notice the top and bottom of the function is exactly the same as the NoScale version. In the middle we calculate squared size, with an optional divide-by-small-number test.
General Matrix Inverse
For general matrix, things are getting complicated. You can find most of the theory part in the following wiki pages: Invertible Matrix, Adjugate Matrix, Determinant, Trace.
We will introduce some of them as we go. The method is based on the same block matrices method Intel used for its Optimized Matrix Library.
A 4x4 matrix can be described as 4 2x2 sub matrices. The good things about 2x2 matrix are not only it is easy to calculate their inverse or determinant, but also because they can fit in one vector register, their calculation can be done very fast.
For the following derivation, we are going to assume these properties: submatrix \(A\) and \(D\) are invertible, \(C\) and \(D\) commute (\(CD=DC\)). (credits to wychmaster for pointing out the assumptions). These are rather strong assumptions, which would help us derive the final form we use for calculation. Later on in appendix we will prove that the result of derivation still holds for 4x4 matrix even if none of these assumptions is true.
Matrix block-wise inverse is given by the following:
We actually use a mix of these two forms, 2nd row from the first form, and 1st row from the second form.
This choice might not seem obvious. Take the first form for example, it seems we only need to calculate two 2x2 matrix inverse: \(A^{-1}\) and \((D-CA^{-1} B)^{-1}\), however it can be further simplified by proper derivation. Since each corresponding sub-matrix equals to each other, it doesn’t matter which form you choose to work your math on. We just select the easier row from both forms.
Before we start derivation, we need to introduce some concepts. The adjugate of matrix \(A\) is defined as \(A\operatorname{adj}(A)=\left|A\right|I\), where \(\left|A\right|\) is determinant of \(A\). For convenience, in this post we denote adjugate matrix as \(A^{\#}=\operatorname{adj}(A)\). So we can change inverse calculation to adjugate calculation by \(A^{-1}=\frac{1}{\left|A\right|}A^{\#}\). Adjugate of 2x2 matrix is:
Adjugate of 2x2 matrix has the following property: \((AB)^{\#}=B^{\#}A^{\#}\),\((A^{\#})^{\#}=A\), \((cA)^{\#}=cA^{\#}\).
For determinant of 2x2 matrix, we will use the following properties: \(\left|A\right|={A_0}{A_3}-{A_1}{A_2}\), \(\left|-A\right|=\left|A\right|\), \(\left|AB\right|=\left|A\right|\left|B\right|\), \(\left|A+B\right|=\left|A\right| + \left|B\right| + \operatorname{tr}(A^{\#}{B})\).
For trace of matrix we have \(\operatorname{tr}(AB)=\operatorname{tr}(BA)\), \(\operatorname{tr}(-A)=-\operatorname{tr}(A)\).
Finally for our block matrices \(M={\left( \begin{matrix} A & B \\ C & D \\ \end{matrix} \right)}\), the determinant is
I only listed properties needed for derivation. If you are not familiar with these concepts, or want to know more about them, take a look at the wiki pages above.
Let \(M^{-1}={\left( \begin{matrix} A & B \\ C & D \\ \end{matrix} \right)}^{-1}={\left( \begin{matrix} X & Y \\ Z & W \\ \end{matrix} \right)}\).Let’s start with the top left corner.
Similarly we can derive the bottom right corner:
Notice that we put parentheses around \(D^{\#}C\) and \(A^{\#}B\), and you will see the reason soon.
Now let’s do the top right corner, and make use of the result of top left corner \(X\):
Similarly we can derive the bottom left corner:
Here we also changed from \(B^{\#}A\) to \((A^{\#}B)^{\#}\), so we can reuse the result of \(A^{\#}B\). Putting them together:
Now it is clear what kind of calculation we need. We need 2x2 matrix multiply and multiply by adjugate: \(AB\), \(A^{\#}B\) and \(AB^{\#}\). We already know how to do adjugate, but in this case, adjugate can be combined with multiplication so we don’t waste instructions. Just expand the result and rearrange the order, for example:
Here’s the code for these three functions:
// for row major matrix
// we use __m128 to represent 2x2 matrix as A = | A0 A1 |
// | A2 A3 |
// 2x2 row major Matrix multiply A*B
__forceinline __m128 Mat2Mul(__m128 vec1, __m128 vec2)
{
return
_mm_add_ps(_mm_mul_ps( vec1, VecSwizzle(vec2, 0,3,0,3)),
_mm_mul_ps(VecSwizzle(vec1, 1,0,3,2), VecSwizzle(vec2, 2,1,2,1)));
}
// 2x2 row major Matrix adjugate multiply (A#)*B
__forceinline __m128 Mat2AdjMul(__m128 vec1, __m128 vec2)
{
return
_mm_sub_ps(_mm_mul_ps(VecSwizzle(vec1, 3,3,0,0), vec2),
_mm_mul_ps(VecSwizzle(vec1, 1,1,2,2), VecSwizzle(vec2, 2,3,0,1)));
}
// 2x2 row major Matrix multiply adjugate A*(B#)
__forceinline __m128 Mat2MulAdj(__m128 vec1, __m128 vec2)
{
return
_mm_sub_ps(_mm_mul_ps( vec1, VecSwizzle(vec2, 3,0,3,0)),
_mm_mul_ps(VecSwizzle(vec1, 1,0,3,2), VecSwizzle(vec2, 2,1,2,1)));
}
Another trick is after we calculate the 2x2 sub matrix, for example \(\left|D\right|A-B(D^{\#}C)\), the final adjugate to get \(X=(\left|D\right|A-B(D^{\#}C))^{\#}\) can be combined with storing 2x2 sub matrices to the final result 4x4 matrix. You can see this at the end of the function.
The only thing left if determinant. 2x2 determinant is easy, the problem really is the whole 4x4 matrix determinant. Remember the determinant property we give above:
This is good. We need to calculate all sub matrices determinants and matrix \(A^{\#}B\) and \(D^{\#}C\) anyway. And if you derive the trace of 2x2 matrix multiplication:
This is just a shuffle and a dot product, should be easy enough to translate into instructions.
Now we have all pieces ready, here is our function for general 4x4 matrix inverse:
// Inverse function is the same no matter column major or row major
// this version treats it as row major
inline Matrix4 GetInverse(const Matrix4& inM)
{
// use block matrix method
// A is a matrix, then i(A) or iA means inverse of A, A# (or A_ in code) means adjugate of A, |A| (or detA in code) is determinant, tr(A) is trace
// sub matrices
__m128 A = VecShuffle_0101(inM.mVec[0], inM.mVec[1]);
__m128 B = VecShuffle_2323(inM.mVec[0], inM.mVec[1]);
__m128 C = VecShuffle_0101(inM.mVec[2], inM.mVec[3]);
__m128 D = VecShuffle_2323(inM.mVec[2], inM.mVec[3]);
#if 0
__m128 detA = _mm_set1_ps(inM.m[0][0] * inM.m[1][1] - inM.m[0][1] * inM.m[1][0]);
__m128 detB = _mm_set1_ps(inM.m[0][2] * inM.m[1][3] - inM.m[0][3] * inM.m[1][2]);
__m128 detC = _mm_set1_ps(inM.m[2][0] * inM.m[3][1] - inM.m[2][1] * inM.m[3][0]);
__m128 detD = _mm_set1_ps(inM.m[2][2] * inM.m[3][3] - inM.m[2][3] * inM.m[3][2]);
#else
// determinant as (|A| |B| |C| |D|)
__m128 detSub = _mm_sub_ps(
_mm_mul_ps(VecShuffle(inM.mVec[0], inM.mVec[2], 0,2,0,2), VecShuffle(inM.mVec[1], inM.mVec[3], 1,3,1,3)),
_mm_mul_ps(VecShuffle(inM.mVec[0], inM.mVec[2], 1,3,1,3), VecShuffle(inM.mVec[1], inM.mVec[3], 0,2,0,2))
);
__m128 detA = VecSwizzle1(detSub, 0);
__m128 detB = VecSwizzle1(detSub, 1);
__m128 detC = VecSwizzle1(detSub, 2);
__m128 detD = VecSwizzle1(detSub, 3);
#endif
// let iM = 1/|M| * | X Y |
// | Z W |
// D#C
__m128 D_C = Mat2AdjMul(D, C);
// A#B
__m128 A_B = Mat2AdjMul(A, B);
// X# = |D|A - B(D#C)
__m128 X_ = _mm_sub_ps(_mm_mul_ps(detD, A), Mat2Mul(B, D_C));
// W# = |A|D - C(A#B)
__m128 W_ = _mm_sub_ps(_mm_mul_ps(detA, D), Mat2Mul(C, A_B));
// |M| = |A|*|D| + ... (continue later)
__m128 detM = _mm_mul_ps(detA, detD);
// Y# = |B|C - D(A#B)#
__m128 Y_ = _mm_sub_ps(_mm_mul_ps(detB, C), Mat2MulAdj(D, A_B));
// Z# = |C|B - A(D#C)#
__m128 Z_ = _mm_sub_ps(_mm_mul_ps(detC, B), Mat2MulAdj(A, D_C));
// |M| = |A|*|D| + |B|*|C| ... (continue later)
detM = _mm_add_ps(detM, _mm_mul_ps(detB, detC));
// tr((A#B)(D#C))
__m128 tr = _mm_mul_ps(A_B, VecSwizzle(D_C, 0,2,1,3));
tr = _mm_hadd_ps(tr, tr);
tr = _mm_hadd_ps(tr, tr);
// |M| = |A|*|D| + |B|*|C| - tr((A#B)(D#C)
detM = _mm_sub_ps(detM, tr);
const __m128 adjSignMask = _mm_setr_ps(1.f, -1.f, -1.f, 1.f);
// (1/|M|, -1/|M|, -1/|M|, 1/|M|)
__m128 rDetM = _mm_div_ps(adjSignMask, detM);
X_ = _mm_mul_ps(X_, rDetM);
Y_ = _mm_mul_ps(Y_, rDetM);
Z_ = _mm_mul_ps(Z_, rDetM);
W_ = _mm_mul_ps(W_, rDetM);
Matrix4 r;
// apply adjugate and store, here we combine adjugate shuffle and store shuffle
r.mVec[0] = VecShuffle(X_, Y_, 3,1,3,1);
r.mVec[1] = VecShuffle(X_, Y_, 2,0,2,0);
r.mVec[2] = VecShuffle(Z_, W_, 3,1,3,1);
r.mVec[3] = VecShuffle(Z_, W_, 2,0,2,0);
return r;
}
As side products of this function, it also gives you optimized version of calculating determinant and adjugate of 4x4 matrix. There are two things I want to talk a little bit more.
When we calculate the determinants of sub matrices, I do have a version to calculate 4 determinants in one go. However calculate them separately and use _mm_set1_ps to load into vector unit is proven to be faster on my CPU. My guess is since we need them to be separated anyway, even if I can calculate them together I need to use 4 shuffles to separate them out, which is not worth the effort, but I’m not sure. You should test performance in both versions.
(Edit: in my new CPU (Coffee Lake) the second method (4 determinants in one go) is 20% faster than the first method)
Also when calculating trace, I’m using two _mm_hadd_ps to sum up 4 components and have the result in all 4 components. There are a lot of ways to do the same thing. From what I tested, they yield similar performance, so I choose the one with less instructions. Again it could be different on different target platforms, and you should test them.
So how our functions perform? The following measurement and comparison is done in August 2017. We use __rdtsc to count cycles. For each test we loop 10 million times and measure the average cycle counts. We do 5 groups of tests and here is the result on Intel Haswell:
The first three columns are our 3 versions of functions. The SIMD version of general 4x4 matrix inverse only cost less than half (44%) of the float version. And if you know the matrix is a transform matrix, it would cost less than a quarter (21%) of the float version. The more information you have as a programmer, the less work the machine need to do.
Think about that question again, do we really need to inverse a matrix. If we are using transform matrix and all we do is inverse transform a point or vector temporarily (so no need to save inverse matrix for other calculations), write an inverse transform function, which is faster than get inverse matrix and then transform. Hopefully this will help you choose which function to write or use, and how to make it fast.
Appendix 1
We have one more thing to do, prove that this method is valid regardless of our assumptions before derivation. Let’s look back what we assumed:
Assume these properties: submatrix \(A\) and \(D\) are invertible, \(C\) and \(D\) commute (\(CD=DC\)).
Consider this example:
Apparently none of our assumptions holds, but \(M'\) is invertible (its inverse is itself \((M')^{-1}=M'\)). If you use the above method to calculate the inverse of \(M'\), surprisingly you do get the correct result. Now we need to prove our calculation holds for any invertible 4x4 matrix, with no above assumptions. Here’s our final form for calculation:
Remember the definition of adjugate matrix \(M^{-1}=\frac{1}{\left|M\right|}M^{\#}\), here we are going to prove
Starting from proving the top left submatrix \(X=(\left|D\right|A-B(D^{\#}C))^{\#}\),
The adjugate matrix of \(M\) is the transpose of the cofactor matrix \(C\) of \(M\) (\(M^{\#}=C^{T}\)), and the cofactor matrix \(C=((-1)^{i+j} M_{ij})\) where \(M_{ij}\) is the determinant of the (i,j)-minor of \(M\). Thus \(M^{\#}= ((-1)^{j+i}M_{ji})\). Remember the TRANSPOSE here! For details visit Adjugate Matrix wiki page.
Remember
We have
Similarly we can prove other submatrices \(Y\),\(Z\),\(W\).
Now we need to prove the determinant form
Again we start from the left hand side
Remember
We have
We have proved the derivation result holds for any invertible 4x4 matrix. Why this is the case? I think it is due to special properties of 2x2 matrices. With that said I believe there must be a more elegant way to derive the same result, if you know such a way, please leave a comment below!
Appendix 2
This is column major area. The first two functions for transform matrix is exactly the same in column major. Here is the general matrix inverse and helper functions:
// for column major matrix
// we use __m128 to represent 2x2 matrix as A = | A0 A2 |
// | A1 A3 |
// 2x2 column major Matrix multiply A*B
__forceinline __m128 Mat2Mul(__m128 vec1, __m128 vec2)
{
return
_mm_add_ps(_mm_mul_ps( vec1, VecSwizzle(vec2, 0,0,3,3)),
_mm_mul_ps(VecSwizzle(vec1, 2,3,0,1), VecSwizzle(vec2, 1,1,2,2)));
}
// 2x2 column major Matrix adjugate multiply (A#)*B
__forceinline __m128 Mat2AdjMul(__m128 vec1, __m128 vec2)
{
return
_mm_sub_ps(_mm_mul_ps(VecSwizzle(vec1, 3,0,3,0), vec2),
_mm_mul_ps(VecSwizzle(vec1, 2,1,2,1), VecSwizzle(vec2, 1,0,3,2)));
}
// 2x2 column major Matrix multiply adjugate A*(B#)
__forceinline __m128 Mat2MulAdj(__m128 vec1, __m128 vec2)
{
return
_mm_sub_ps(_mm_mul_ps( vec1, VecSwizzle(vec2, 3,3,0,0)),
_mm_mul_ps(VecSwizzle(vec1, 2,3,0,1), VecSwizzle(vec2, 1,1,2,2)));
}
// Inverse function is the same no matter column major or row major
// this version treats it as column major
inline Matrix4 GetInverse(const Matrix4& inM)
{
// use block matrix method
// A is a matrix, then i(A) or iA means inverse of A, A# (or A_ in code) means adjugate of A, |A| (or detA in code) is determinant, tr(A) is trace
// sub matrices
__m128 A = VecShuffle_0101(inM.mVec[0], inM.mVec[1]);
__m128 C = VecShuffle_2323(inM.mVec[0], inM.mVec[1]);
__m128 B = VecShuffle_0101(inM.mVec[2], inM.mVec[3]);
__m128 D = VecShuffle_2323(inM.mVec[2], inM.mVec[3]);
#if 0
__m128 detA = _mm_set1_ps(inM.m[0][0] * inM.m[1][1] - inM.m[0][1] * inM.m[1][0]);
__m128 detC = _mm_set1_ps(inM.m[0][2] * inM.m[1][3] - inM.m[0][3] * inM.m[1][2]);
__m128 detB = _mm_set1_ps(inM.m[2][0] * inM.m[3][1] - inM.m[2][1] * inM.m[3][0]);
__m128 detD = _mm_set1_ps(inM.m[2][2] * inM.m[3][3] - inM.m[2][3] * inM.m[3][2]);
#else
// determinant as (|A| |C| |B| |D|)
__m128 detSub = _mm_sub_ps(
_mm_mul_ps(VecShuffle(inM.mVec[0], inM.mVec[2], 0,2,0,2), VecShuffle(inM.mVec[1], inM.mVec[3], 1,3,1,3)),
_mm_mul_ps(VecShuffle(inM.mVec[0], inM.mVec[2], 1,3,1,3), VecShuffle(inM.mVec[1], inM.mVec[3], 0,2,0,2))
);
__m128 detA = VecSwizzle1(detSub, 0);
__m128 detC = VecSwizzle1(detSub, 1);
__m128 detB = VecSwizzle1(detSub, 2);
__m128 detD = VecSwizzle1(detSub, 3);
#endif
// let iM = 1/|M| * | X Y |
// | Z W |
// D#C
__m128 D_C = Mat2AdjMul(D, C);
// A#B
__m128 A_B = Mat2AdjMul(A, B);
// X# = |D|A - B(D#C)
__m128 X_ = _mm_sub_ps(_mm_mul_ps(detD, A), Mat2Mul(B, D_C));
// W# = |A|D - C(A#B)
__m128 W_ = _mm_sub_ps(_mm_mul_ps(detA, D), Mat2Mul(C, A_B));
// |M| = |A|*|D| + ... (continue later)
__m128 detM = _mm_mul_ps(detA, detD);
// Y# = |B|C - D(A#B)#
__m128 Y_ = _mm_sub_ps(_mm_mul_ps(detB, C), Mat2MulAdj(D, A_B));
// Z# = |C|B - A(D#C)#
__m128 Z_ = _mm_sub_ps(_mm_mul_ps(detC, B), Mat2MulAdj(A, D_C));
// |M| = |A|*|D| + |B|*|C| ... (continue later)
detM = _mm_add_ps(detM, _mm_mul_ps(detB, detC));
// tr((A#B)(D#C))
__m128 tr = _mm_mul_ps(A_B, VecSwizzle(D_C, 0,2,1,3));
tr = _mm_hadd_ps(tr, tr);
tr = _mm_hadd_ps(tr, tr);
// |M| = |A|*|D| + |B|*|C| - tr((A#B)(D#C))
detM = _mm_sub_ps(detM, tr);
const __m128 adjSignMask = _mm_setr_ps(1.f, -1.f, -1.f, 1.f));
// (1/|M|, -1/|M|, -1/|M|, 1/|M|)
__m128 rDetM = _mm_div_ps(adjSignMask, detM);
X_ = _mm_mul_ps(X_, rDetM);
Y_ = _mm_mul_ps(Y_, rDetM);
Z_ = _mm_mul_ps(Z_, rDetM);
W_ = _mm_mul_ps(W_, rDetM);
Matrix4 r;
// apply adjugate and store, here we combine adjugate shuffle and store shuffle
r.mVec[0] = VecShuffle(X_, Z_, 3,1,3,1);
r.mVec[1] = VecShuffle(X_, Z_, 2,0,2,0);
r.mVec[2] = VecShuffle(Y_, W_, 3,1,3,1);
r.mVec[3] = VecShuffle(Y_, W_, 2,0,2,0);
return r;
}