Problem :
I converted a MMX to code to corresponding SSE2 code. And I expected almost 1.5x-2x speedup. But both took exactly same time. Why is it?
Scenario:
I am learning SIMD instruction set and their performance comparison. I took an array operation such that, Z = X^2 + Y^2
where X and Y are large one dimensional array of type "char". The values of X and Y are restricted to be less than 10, so that Z is always <255 (1 Byte). ( Not to worry about any overflow).
I wrote its C++ code first, checked its time. Then wrote corresponding ASSEMBLY code (~3x speedup). Then I wrote its MMX code (~12x v/s C++). Then I converted MMX into SSE2 code and it takes exactly same speed as that of MMX code. Theoretically, in SSE2, I expected a speedup of ~2x compared to MMX.
For conversion from MMX to SSE2, I converted all mmx reg to xmm reg. Then changed a couple of movement instructions and so on.
My MMX and SSE codes are pasted here : https://gist.github.com/abidrahmank/5281486
(I don't want to paste them all here)
These functions are later called from main.cpp file where arrays are passed as arguments.
What I have done :
1 - I went through some optimization manuals from Intel and other websites. Main problem with SSE2 codes is the 16 _memory alignment. When I manually checked the addresses, they all are found to be 16 _memory aligned. But I used both MOVDQU and MOVDQA, but both gives the same result and no speedup compared to MMX.
2 - I went to debug mode and checked each register values with instructions executed. And they are being executed exactly same as I thought, ie 16 bytes are taken and resulting 16 bytes are outputted.
Resources :
I am using Intel Core i5 processor with Windows 7 and Visual C++ 2010.
Question :
So final question is, why there is no performance improvement for SSE2 code compared to MMX code ? Am I doing any thing wrong in SSE code ? Or is there any other explanation ?