Advanced Optimization Via Profiling With Gcc4
#1
Posted 22 May 2006 - 07:49 AM
-fprofile-use
and
-fprofile-generate
It turns out that these flags are quite useful for optimization.
if a program is compiled and linked with -fprofile-generate,
,when you compile and run the code, profiling data will be generated for every .o file that was linked and used during the execution.
When you recompile/relink the program with -fprofile-use, the gcc backend will use the profiling data to generate appropriately optimized code. I'm not an expert here, but I'm guessing decisions about inverting loop/branch logic, unrolling if worthwhile, etc. will all be more optimal with stats.
The problem I ran into was that the app you are writing MUST compile AND run on x86. (as the flags don't do very much when running on the gp2x itself, although it could be that I needed to copy the whole dev tree)
With SDL it's not much of a problem, but with ryleh's minimal lib, you will need to fake out or reimplement some calls.
For files with mismatched checksums on functions, you will need to delete the profile data in the
High level instructions:
1) setup dev tree (for gp2x)
2) setup mirrored dev tree with lndir. (for x86)
3) add -fprofile-use to Makefiles on CFLAGS and Link flags (for the gp2x tree)
4) add -fprofile-generate to Makefiles on CFLAGS and Link flags (for the x86 tree)
5) compile x86 tree
6) run the x86 version. You should now end up with many *.gcda and *.gcno files, ideally one per object file.
7) copy *.gcda and *.gcno from the x86 tree into the gp2x tree
8) compile gp2x tree. You will probably run into some checksum errors. For each file that errors out, delete *.gcda and *.gcno for that file only.
9) Hopefully this will build with sufficient number of *.gcda and *.gcno still in place.
10) Be pleasantly surprised with the speed boost.
On the emulator I ported, I got about 10-15% speed boost. :)
(I've not tried this with gcc 3.x or below.)
#4
Posted 22 May 2006 - 09:48 AM
zzhu8192, on May 22 2006, 08:49 AM, said:
Was this boost on the gp2x exec? I'm a bit suprsied that optimiations based on data from x86 build could help an arm build. Do you think that it would make much difference if the data was sampled on the gp2x?
#5
Posted 22 May 2006 - 10:32 AM
MadDog, on May 22 2006, 04:48 AM, said:
zzhu8192, on May 22 2006, 08:49 AM, said:
Was this boost on the gp2x exec? I'm a bit suprsied that optimiations based on data from x86 build could help an arm build. Do you think that it would make much difference if the data was sampled on the gp2x?
Well, I believe the profiling data indicate number of calls, number of positive/negative branches. So the architecture shouldn't matter. If it were sampled on the gp2x, the number of calls would still be the same, unless the code is doing something very architecture specific. Plus, my code was all C. Of course I'm not a gcc expert....
#6
Posted 22 May 2006 - 11:31 AM
zzhu8192, on May 22 2006, 11:32 AM, said:
MadDog, on May 22 2006, 04:48 AM, said:
zzhu8192, on May 22 2006, 08:49 AM, said:
Was this boost on the gp2x exec? I'm a bit suprsied that optimiations based on data from x86 build could help an arm build. Do you think that it would make much difference if the data was sampled on the gp2x?
Well, I believe the profiling data indicate number of calls, number of positive/negative branches. So the architecture shouldn't matter. If it were sampled on the gp2x, the number of calls would still be the same, unless the code is doing something very architecture specific. Plus, my code was all C. Of course I'm not a gcc expert....
Yes true, I guess it depends on what the optimiser does with that data. Would have thought that some of the timings would have been different and so caused some errors. Would be intresting to see if it was better if sampled on the gp2x.
Just out of intrest, is Gcc4 something you setup your self or is it in the new SDK?
#7
Posted 22 May 2006 - 12:56 PM
Quote
Just out of intrest, is Gcc4 something you setup your self or is it in the new SDK?
I built my own gcc environment. Mainly because I wanted to use GCJ actually.
I based it off gcc 4.1.0.
#8
Posted 22 May 2006 - 01:05 PM
zzhu8192, on May 22 2006, 11:32 AM, said:
The ARM architecture has conditional execution which x86 doesn't, this is used quite a bit to replace branches over short (typically 1-2 instructions) distances, this combined with the more flexible register set etc. may make parts less optimal. However, if the overall result is still better, then it's something to look into, especially if you (or anyone) can get the GP2X to run the profiling code for better results!
#9
Posted 22 May 2006 - 01:31 PM
zzhu8192, on May 22 2006, 07:49 AM, said:
-fprofile-use
and
-fprofile-generate
It turns out that these flags are quite useful for optimization.
[...]
(I've not tried this with gcc 3.x or below.)
Just to point out that these features are available in 3.3.x and above (not sure about versions before that). The two flags have different names, but they do essentially the same thing:
-fprofile-arcs -- for the initial compile
-fbranch-probabilities -- to take account of the information written after the run
I have used these extensively with my chess program and, as the original poster stated, they can be worth 10-15% on an already optimized build. This will vary considerably with what your program is doing, however.
#10
Posted 12 June 2006 - 06:02 AM
In gcc, -O3 is a flag used to include relatively expensive optimizations
so there are [many] times where -O2 (only sure optimizations) is rather quickier than a -O3 blindfolded
try to measure speed increase between -O2 and -O3 with profiling, if it's still on 10-15% then it's a deal, for sure!
Fix: fixing my bad english
This post has been edited by Lint: 12 June 2006 - 06:03 AM
#11
Posted 26 June 2006 - 02:32 PM
#12
Posted 07 July 2006 - 06:35 AM
Trenki, on Jun 26 2006, 10:32 AM, said:
Maybe this fits here, maybe not. Just started gp2x development this week, and just finished the first dhrystone runs on the gcc from gamepark. http://www.dwelch.com/gp2x
-O3 has, in general, produced faster code than -O2 since I started using gcc (2.95.x), and that continues to be true here, the code is 20something% faster just using -O3. I have been told that many would prefer to avoid -O3, perhaps to avoid having your code optimized out, you certainly have to manage your volatiles or you code will not work (that may be true for -O2 as well), that is if you are talking to the hardware directly...
This might be an RTFM question, but, does the gp2x have a full 32 bit data bus, or did they do the crippled GBA thing and cut it down to 16 bit? The reason for the question, how does thumb perform on the gp2x compared to arm mode? (32 bit bus, zero wait state memory, arm mode will out run thumb, add enough wait states or run on a 16 bit bus, thumb performance goes up dramatically).
#14
Posted 14 July 2006 - 07:02 AM
Squidge, on Jul 7 2006, 02:53 AM, said:
Only OAM and IWRAM are 32 bit, Palette Ram, Vram, EWRAM and the rom are 16 bit data busses, cartridge ram is an 8 bit bus. I think IWRAM is zero wait states and EWRAM is 2 wait states. rom by default was 3 wait states for the first read then 1 per halfword after that if read sequentially. Could be set for 2/1. And there was a prefetch buffer. In the end, rom for instructions and iwram for data was your fastest combination (thumb mode of course). the fastest I was able to run was 7.35 mips with ARMs RVCT 2.0.1. 45% mips to mhz, which isnt great for an ARM, esp with an ARM compiler. I wonder how many developers actually used the arm tools as they are fairly expensive (I certainly cant afford them, would rather buy a car, or upgrade the one I have with that kind of money). So with gcc the last gcc test I recorded (3.x) was 5.9 mips or about 35% mips to mhz. It would be interesting to see what 4.1.1 does on the GBA.
Anyway, so far, for the gp2x, I have seen 177.48 mips on the 920 and 135.3 on the 940 using gcc 4.x. 89% and 68%.
I think the gba was also crippled by not having a cache...The gp2x relies heavily on the cache, without it (on the 940) I am getting a little over 3 mips or 1.7% mips to mhz, so the memory is very slow compared to the processor clock. It looks like the gp2x is almost the same speed as the gba, clock for clock in a fair fight (no cache).
Hmm, PC100 SDRAM is $11 as is PC133, I would have paid an extra $10 or so to get one wait state ram for this platform <g>.
Yes, at least on the gba where there was no cache to ruin the (compiler comparison) numbers, by "simply" switching compilers your code could run twice as fast. This 2x gain was using the same source, before profiling.
Looks like the cache's in both the 920 and 940 are the same, just different sizes. They have the ability to lock down code in the cache, so if you profiling finds a key code segment (that is not too big) you can lock it down in the cache and in theory greatly improve your performance. Rearranging functions within files and files on the command line can change things, I have not done this in a while so I have to think about how to optimize using that approach The goal is to avoid having one highly used code segment from bumping out the other highly used code segment, set them up to avoid each other instead of bumping each other. I think what you want is to have the two of them right next to each other in memory within a cache line. I think a cache line is 64 bytes in the 920 and 32 in the 940, but would have to re-check that. We had a developer on a project that had an uncanny knack to produce code that would would increase the cache misses, no matter how we modified the cache. We need someone who is the opposite of that, with an uncanny ability to increase hits...The cache is clearly the key to success or failure on this platform as the system relies heavily on it.
You want to write to memory as big as you can, dont write four bytes write one word, dont write two halfwords write one word. Then from there use STMs as much as you can. For both the 920 and 940 the write buffer has only four addresses, so four byte writes and the write buffer is full. But four word writes will fill it too, basically you can get a 4x speed improvement by trying to write blocks of data word instead of a byte. I assume a good arm compiler already uses STMs for memcpy's and strcpys where it can, need to see what gcc does. If it doesnt, I would be curious to see one of your real-world applications that uses str's instead of strh's and strb's and stms instead of str's where possible. See how much of this 4x gain you can absorb. Actually if you compare STMs to strbs, you can get a 16x performance increase in writing to memory on the 920 and 8x on the 940. Definitly worth investigating, if your programs use any data or variables <g>.
I think with some of these additional optimizations you can gain even more than the 10-15%, but who knows until someone tries. Arms compiler is definitely worth it if you plan on earning a living off of your game/application, used to cost around $5500 a few years ago, dont know what it is today. It looks like they bought Keil, which is very interesting. And Keil has an arm compiler, with a license free evaluation version, limited to 16kbytes. Might be worth looking at, just to see how well it generates code.
#15
Posted 29 July 2006 - 08:09 AM
dwelch, on Jul 14 2006, 07:02 AM, said:
I'm sorry guys, can somebody give the following code a try (and provide results to me)?
http://maemo.org/pip...rch/003269.html
It works quite fast on ARM926 and appears to use half cache line bursts when writing data. Maybe it can be good for 920 too. Also here are the results for different ARM based devices:
http://maemo.org/pip...rch/003373.html

Sign In
Register
Help

MultiQuote
