[转帖]矩阵相乘的快速算法
<P>算法介绍<BR>矩阵相乘在进行3D变换的时候是经常用到的。在应用中常用矩阵相乘的定义算法对其进行<BR>计算。这个算法用到了大量的循环和相乘运算,这使得算法效率不高。而矩阵相乘的计算<BR>效率很大程度上的影响了整个程序的运行速度,所以对矩阵相乘算法进行一些改进是必要<BR>的。<BR><BR>这里要介绍的矩阵算法称为斯特拉森方法,它是由v.斯特拉森在1969年提出的一个方法。<BR><BR><BR>我们先讨论二阶矩阵的计算方法。<BR><BR>对于二阶矩阵<BR><BR>a11 a12 b11 b12<BR>A = a21 a22 B = b21 b22 <BR><BR>先计算下面7个量(1)<BR><BR>x1 = (a11 + a22) * (b11 + b22);<BR>x2 = (a21 + a22) * b11;<BR>x3 = a11 * (b12 - b22);<BR>x4 = a22 * (b21 - b11);<BR>x5 = (a11 + a12) * b22;<BR>x6 = (a21 - a11) * (b11 + b12);<BR>x7 = (a12 - a22) * (b21 + b22);<BR><BR>再设C = AB。根据矩阵相乘的规则,C的各元素为(2)<BR><BR>c11 = a11 * b11 + a12 * b21<BR>c12 = a11 * b12 + a12 * b22<BR>c21 = a21 * b11 + a22 * b21<BR>c22 = a21 * b12 + a22 * b22<BR><BR>比较(1)(2),C的各元素可以表示为(3)<BR><BR>c11 = x1 + x4 - x5 + x7<BR>c12 = x3 + x5<BR>c21 = x2 + x4<BR>c22 = x1 + x3 - x2 + x6<BR><BR>根据以上的方法,我们就可以计算4阶矩阵了,先将4阶矩阵A和B划分成四块2阶矩阵,分<BR>别利用公式计算它们的乘积,再使用(1)(3)来计算出最后结果。<BR><BR>ma11 ma12 mb11 mb12<BR>A4 = ma21 ma22 B4 = mb21 mb22 <BR><BR>其中<BR><BR>a11 a12 a13 a14 b11 b12 b13 b14<BR>ma11 = a21 a22 ma12 = a23 a24 mb11 = b21 b22 mb12 = b23 b24 <BR><BR>a31 a32 a33 a34 b31 b32 b33 b34<BR>ma21 = a41 a42 ma22 = a43 a44 mb21 = b41 b42 mb22 = b43 b44 <BR><BR>实现<BR>// 计算2X2矩阵<BR>void Multiply2X2(float& fOut_11, float& fOut_12, float& fOut_21, float&<BR>fOut_22,<BR>float f1_11, float f1_12, float f1_21, float f1_22,<BR>float f2_11, float f2_12, float f2_21, float f2_22)<BR>{<BR>const float x1((f1_11 + f1_22) * (f2_11 + f2_22));<BR>const float x2((f1_21 + f1_22) * f2_11);<BR>const float x3(f1_11 * (f2_12 - f2_22));<BR>const float x4(f1_22 * (f2_21 - f2_11));<BR>const float x5((f1_11 + f1_12) * f2_22);<BR>const float x6((f1_21 - f1_11) * (f2_11 + f2_12));<BR>const float x7((f1_12 - f1_22) * (f2_21 + f2_22));<BR><BR>fOut_11 = x1 + x4 - x5 + x7;<BR>fOut_12 = x3 + x5;<BR>fOut_21 = x2 + x4;<BR>fOut_22 = x1 - x2 + x3 + x6;<BR>}<BR><BR>// 计算4X4矩阵<BR>void Multiply(CLAYMATRIX& mOut, const CLAYMATRIX& m1, const CLAYMATRIX& m2)<BR>{<BR>float fTmp;<BR><BR>// (ma11 + ma22) * (mb11 + mb22)<BR>Multiply2X2(fTmp, fTmp, fTmp, fTmp,<BR>m1._11 + m1._33, m1._12 + m1._34, m1._21 + m1._43, m1._22 + m1._44,<BR>m2._11 + m2._33, m2._12 + m2._34, m2._21 + m2._43, m2._22 + m2._44);<BR><BR>// (ma21 + ma22) * mb11<BR>Multiply2X2(fTmp, fTmp, fTmp, fTmp,<BR>m1._31 + m1._33, m1._32 + m1._34, m1._41 + m1._43, m1._42 + m1._44,<BR>m2._11, m2._12, m2._21, m2._22);<BR><BR>// ma11 * (mb12 - mb22)<BR>Multiply2X2(fTmp, fTmp, fTmp, fTmp,<BR>m1._11, m1._12, m1._21, m1._22,<BR>m2._13 - m2._33, m2._14 - m2._34, m2._23 - m2._43, m2._24 - m2._44);<BR><BR>// ma22 * (mb21 - mb11)<BR>Multiply2X2(fTmp, fTmp, fTmp, fTmp,<BR>m1._33, m1._34, m1._43, m1._44,<BR>m2._31 - m2._11, m2._32 - m2._12, m2._41 - m2._21, m2._42 - m2._22);<BR><BR>// (ma11 + ma12) * mb22<BR>Multiply2X2(fTmp, fTmp, fTmp, fTmp,<BR>m1._11 + m1._13, m1._12 + m1._14, m1._21 + m1._23, m1._22 + m1._24,<BR>m2._33, m2._34, m2._43, m2._44);<BR><BR>// (ma21 - ma11) * (mb11 + mb12)<BR>Multiply2X2(fTmp, fTmp, fTmp, fTmp,<BR>m1._31 - m1._11, m1._32 - m1._12, m1._41 - m1._21, m1._42 - m1._22,<BR>m2._11 + m2._13, m2._12 + m2._14, m2._21 + m2._23, m2._22 + m2._24);<BR><BR>// (ma12 - ma22) * (mb21 + mb22)<BR>Multiply2X2(fTmp, fTmp, fTmp, fTmp,<BR>m1._13 - m1._33, m1._14 - m1._34, m1._23 - m1._43, m1._24 - m1._44,<BR>m2._31 + m2._33, m2._32 + m2._34, m2._41 + m2._43, m2._42 + m2._44);<BR><BR>// 第一块<BR>mOut._11 = fTmp + fTmp - fTmp + fTmp;<BR>mOut._12 = fTmp + fTmp - fTmp + fTmp;<BR>mOut._21 = fTmp + fTmp - fTmp + fTmp;<BR>mOut._22 = fTmp + fTmp - fTmp + fTmp;<BR><BR>// 第二块<BR>mOut._13 = fTmp + fTmp;<BR>mOut._14 = fTmp + fTmp;<BR>mOut._23 = fTmp + fTmp;<BR>mOut._24 = fTmp + fTmp;<BR><BR>// 第三块<BR>mOut._31 = fTmp + fTmp;<BR>mOut._32 = fTmp + fTmp;<BR>mOut._41 = fTmp + fTmp;<BR>mOut._42 = fTmp + fTmp;<BR><BR>// 第四块<BR>mOut._33 = fTmp - fTmp + fTmp + fTmp;<BR>mOut._34 = fTmp - fTmp + fTmp + fTmp;<BR>mOut._43 = fTmp - fTmp + fTmp + fTmp;<BR>mOut._44 = fTmp - fTmp + fTmp + fTmp;<BR>}<BR><BR>比较<BR>在标准的定义算法中我们需要进行n * n * n次乘法运算,新算法中我们需要进行7log2n<BR>次乘法,对于最常用的4阶矩阵: 原算法 新算法 <BR>加法次数 48 72(48次加法,24次减法) <BR>乘法次数 64 49 <BR>需要额外空间 16 * sizeof(float) 28 * sizeof(float) <BR><BR>新算法要比原算法多了24次减法运算,少了15次乘法。但因为浮点乘法的运算速度要远远<BR>慢于加/减法运算,所以新算法的整体速度有所提<BR><BR>来在矩阵论坛</P> 这个倒不错,BLAS是不是用的这个算法?
页:
[1]