As codeaholic point me on IRC, if I want to get feed back, I have to say why I have done this.
First, the complete and original implementation of it is generic. And I don't have a good track record of integrating video code to SDL.
The tarball contain a contrib directory, which contain sdl/scale2x.c. At first this file looked to be exactly what I was looking for.
Then I saw that this file doesn't even take care of the optimisation explained
here.
So I wrote the above by merging my simple2x I used in the zelda's games with this one. (BTW, you can see my version don't support 24bpp).
My performance concerns are :
- I'm using #define for readability, but the one in contrib/sdl is using variables. So I'm switching some "stor" to a shift and an increment. Not sure which is faster
- The original implementation take first and last column apart, I'm using (i>0?1:0) in my #define. So this test is done for every pixels. Does is worth lower the readability to remove that test every pixel ?
- I'm using "register" for the 2 loop counter, but I dont even know how much are available on this soc.Is it worth doing the research to find the variable that should be set as a register or gcc is doing a good enough job ?
I'm not going to unroll the loops : that what "-funroll-loop" is for.
I dont want to optimize this to death (anyway if you realy need performances use the complete implementation. I know Pickle have done this). I'm seeing this as a good base to talk about performance in the code and share our experience. (mine is limited so I intend to learn from your answer

)