GP32X.com - GP32 GP2X Pandora The Wiz - open source entertainment: Advanced Optimization Via Profiling With Gcc4 - GP32X.com - GP32 GP2X Pandora The Wiz - open source entertainment

Jump to content

  • (2 Pages)
  • +
  • 1
  • 2
  • You cannot start a new topic
  • You cannot reply to this topic

Advanced Optimization Via Profiling With Gcc4

#1 User is offline   zzhu8192

  • GP32 User
  • PipPipPip
  • Group: Members
  • Posts: 50
  • Joined: 10-May 06
  • Location:Austin, TX
  • Interests:Video Games (duh)<br />More Video Games<br />Computers<br />Piano<br />Anime<br />

Posted 22 May 2006 - 07:49 AM

When using gcc 4.1.0, I noticed there are the following flags:

-fprofile-use
and
-fprofile-generate

It turns out that these flags are quite useful for optimization.

if a program is compiled and linked with -fprofile-generate,
,when you compile and run the code, profiling data will be generated for every .o file that was linked and used during the execution.

When you recompile/relink the program with -fprofile-use, the gcc backend will use the profiling data to generate appropriately optimized code. I'm not an expert here, but I'm guessing decisions about inverting loop/branch logic, unrolling if worthwhile, etc. will all be more optimal with stats.

The problem I ran into was that the app you are writing MUST compile AND run on x86. (as the flags don't do very much when running on the gp2x itself, although it could be that I needed to copy the whole dev tree)
With SDL it's not much of a problem, but with ryleh's minimal lib, you will need to fake out or reimplement some calls.

For files with mismatched checksums on functions, you will need to delete the profile data in the

High level instructions:

1) setup dev tree (for gp2x)
2) setup mirrored dev tree with lndir. (for x86)
3) add -fprofile-use to Makefiles on CFLAGS and Link flags (for the gp2x tree)
4) add -fprofile-generate to Makefiles on CFLAGS and Link flags (for the x86 tree)
5) compile x86 tree
6) run the x86 version. You should now end up with many *.gcda and *.gcno files, ideally one per object file.
7) copy *.gcda and *.gcno from the x86 tree into the gp2x tree
8) compile gp2x tree. You will probably run into some checksum errors. For each file that errors out, delete *.gcda and *.gcno for that file only.
9) Hopefully this will build with sufficient number of *.gcda and *.gcno still in place.
10) Be pleasantly surprised with the speed boost.
On the emulator I ported, I got about 10-15% speed boost. :)


(I've not tried this with gcc 3.x or below.)

#2 User is offline   Vimacs

  • Don't be evil!
  • Icon
  • Group: X-treme Team
  • Posts: 5,208
  • Joined: 22-October 03
  • Location:Germany

Posted 22 May 2006 - 08:07 AM

10-15% over what? -03?
Sounds great.

#3 User is offline   zzhu8192

  • GP32 User
  • PipPipPip
  • Group: Members
  • Posts: 50
  • Joined: 10-May 06
  • Location:Austin, TX
  • Interests:Video Games (duh)<br />More Video Games<br />Computers<br />Piano<br />Anime<br />

Posted 22 May 2006 - 09:38 AM

View PostVimacs, on May 22 2006, 03:07 AM, said:

10-15% over what? -03?
Sounds great.


10-15% over -O3 without profiling. -O3 is still used as a flag.

#4 User is offline   MadDog

  • GP32 Hardcore
  • PipPipPipPip
  • Group: GP32 Hardcore
  • Posts: 262
  • Joined: 04-March 06
  • Location:UK

Posted 22 May 2006 - 09:48 AM

View Postzzhu8192, on May 22 2006, 08:49 AM, said:

On the emulator I ported, I got about 10-15% speed boost. :)


Was this boost on the gp2x exec? I'm a bit suprsied that optimiations based on data from x86 build could help an arm build. Do you think that it would make much difference if the data was sampled on the gp2x?

#5 User is offline   zzhu8192

  • GP32 User
  • PipPipPip
  • Group: Members
  • Posts: 50
  • Joined: 10-May 06
  • Location:Austin, TX
  • Interests:Video Games (duh)<br />More Video Games<br />Computers<br />Piano<br />Anime<br />

Posted 22 May 2006 - 10:32 AM

View PostMadDog, on May 22 2006, 04:48 AM, said:

View Postzzhu8192, on May 22 2006, 08:49 AM, said:

On the emulator I ported, I got about 10-15% speed boost. :)


Was this boost on the gp2x exec? I'm a bit suprsied that optimiations based on data from x86 build could help an arm build. Do you think that it would make much difference if the data was sampled on the gp2x?


Well, I believe the profiling data indicate number of calls, number of positive/negative branches. So the architecture shouldn't matter. If it were sampled on the gp2x, the number of calls would still be the same, unless the code is doing something very architecture specific. Plus, my code was all C. Of course I'm not a gcc expert....

#6 User is offline   MadDog

  • GP32 Hardcore
  • PipPipPipPip
  • Group: GP32 Hardcore
  • Posts: 262
  • Joined: 04-March 06
  • Location:UK

Posted 22 May 2006 - 11:31 AM

View Postzzhu8192, on May 22 2006, 11:32 AM, said:

View PostMadDog, on May 22 2006, 04:48 AM, said:

View Postzzhu8192, on May 22 2006, 08:49 AM, said:

On the emulator I ported, I got about 10-15% speed boost. :)


Was this boost on the gp2x exec? I'm a bit suprsied that optimiations based on data from x86 build could help an arm build. Do you think that it would make much difference if the data was sampled on the gp2x?


Well, I believe the profiling data indicate number of calls, number of positive/negative branches. So the architecture shouldn't matter. If it were sampled on the gp2x, the number of calls would still be the same, unless the code is doing something very architecture specific. Plus, my code was all C. Of course I'm not a gcc expert....

Yes true, I guess it depends on what the optimiser does with that data. Would have thought that some of the timings would have been different and so caused some errors. Would be intresting to see if it was better if sampled on the gp2x.

Just out of intrest, is Gcc4 something you setup your self or is it in the new SDK?

#7 User is offline   zzhu8192

  • GP32 User
  • PipPipPip
  • Group: Members
  • Posts: 50
  • Joined: 10-May 06
  • Location:Austin, TX
  • Interests:Video Games (duh)<br />More Video Games<br />Computers<br />Piano<br />Anime<br />

Posted 22 May 2006 - 12:56 PM

Quote

Yes true, I guess it depends on what the optimiser does with that data. Would have thought that some of the timings would have been different and so caused some errors. Would be intresting to see if it was better if sampled on the gp2x.

Just out of intrest, is Gcc4 something you setup your self or is it in the new SDK?

I built my own gcc environment. Mainly because I wanted to use GCJ actually.
I based it off gcc 4.1.0.

#8 User is offline   paeryn

  • Reclusive maniac
  • Icon
  • Group: GP Guru
  • Posts: 389
  • Joined: 28-November 05
  • Location:Sheffield, England
  • Interests:Programming, reading (sci-fi, horror, humour) and cats!

Posted 22 May 2006 - 01:05 PM

View Postzzhu8192, on May 22 2006, 11:32 AM, said:

Well, I believe the profiling data indicate number of calls, number of positive/negative branches. So the architecture shouldn't matter. If it were sampled on the gp2x, the number of calls would still be the same, unless the code is doing something very architecture specific. Plus, my code was all C. Of course I'm not a gcc expert....

The ARM architecture has conditional execution which x86 doesn't, this is used quite a bit to replace branches over short (typically 1-2 instructions) distances, this combined with the more flexible register set etc. may make parts less optimal. However, if the overall result is still better, then it's something to look into, especially if you (or anyone) can get the GP2X to run the profiling code for better results!

#9 User is offline   evening2005

  • GP32 Hardcore
  • PipPipPipPip
  • Group: Members
  • Posts: 137
  • Joined: 23-September 05

Posted 22 May 2006 - 01:31 PM

View Postzzhu8192, on May 22 2006, 07:49 AM, said:

When using gcc 4.1.0, I noticed there are the following flags:

-fprofile-use
and
-fprofile-generate

It turns out that these flags are quite useful for optimization.

[...]

(I've not tried this with gcc 3.x or below.)


Just to point out that these features are available in 3.3.x and above (not sure about versions before that). The two flags have different names, but they do essentially the same thing:
-fprofile-arcs -- for the initial compile
-fbranch-probabilities -- to take account of the information written after the run

I have used these extensively with my chess program and, as the original poster stated, they can be worth 10-15% on an already optimized build. This will vary considerably with what your program is doing, however.

#10 User is offline   Lint

  • GP32 Hardcore
  • PipPipPipPip
  • Group: Members
  • Posts: 186
  • Joined: 05-June 06

Posted 12 June 2006 - 06:02 AM

I'm not sure about this, but here it go:
In gcc, -O3 is a flag used to include relatively expensive optimizations
so there are [many] times where -O2 (only sure optimizations) is rather quickier than a -O3 blindfolded
try to measure speed increase between -O2 and -O3 with profiling, if it's still on 10-15% then it's a deal, for sure!

Fix: fixing my bad english

This post has been edited by Lint: 12 June 2006 - 06:03 AM


#11 User is offline   Trenki

  • GP32 Hardcore
  • PipPipPipPip
  • Group: Members
  • Posts: 114
  • Joined: 15-June 06
  • Location:South Tyrol, Italy
  • Interests:Programming

Posted 26 June 2006 - 02:32 PM

Hi all! I just found an interesting optimization tool called Acovea which I think could also be put to good use on the gp2x. Using this tool one could find the best compiler flags for e.g. the polygon rendering functions. It finds the compiler flags producing the fastest executable by using a genetic algorithm.

#12 User is offline   dwelch

  • GP32 Hardcore
  • PipPipPipPip
  • Group: Members
  • Posts: 114
  • Joined: 07-July 06

Posted 07 July 2006 - 06:35 AM

View PostTrenki, on Jun 26 2006, 10:32 AM, said:

Hi all! I just found an interesting optimization tool called Acovea which I think could also be put to good use on the gp2x. Using this tool one could find the best compiler flags for e.g. the polygon rendering functions. It finds the compiler flags producing the fastest executable by using a genetic algorithm.


Maybe this fits here, maybe not. Just started gp2x development this week, and just finished the first dhrystone runs on the gcc from gamepark. http://www.dwelch.com/gp2x
-O3 has, in general, produced faster code than -O2 since I started using gcc (2.95.x), and that continues to be true here, the code is 20something% faster just using -O3. I have been told that many would prefer to avoid -O3, perhaps to avoid having your code optimized out, you certainly have to manage your volatiles or you code will not work (that may be true for -O2 as well), that is if you are talking to the hardware directly...

This might be an RTFM question, but, does the gp2x have a full 32 bit data bus, or did they do the crippled GBA thing and cut it down to 16 bit? The reason for the question, how does thumb perform on the gp2x compared to arm mode? (32 bit bus, zero wait state memory, arm mode will out run thumb, add enough wait states or run on a 16 bit bus, thumb performance goes up dramatically).

#13 User is online   Squidge

  • Mega GP Mania
  • Icon
  • View blog
  • Group: X-treme Team
  • Posts: 8,496
  • Joined: 16-November 03
  • Gender:Male
  • Location:UK

Posted 07 July 2006 - 06:53 AM

The GP2X has a full 32-bit data bus to it's 64MB of RAM. The GBA also has a full 32-bit data bus to it's RAM, the only part that was crippled was access to the ROM cartridge (hence the reason why a lot of people copied code from ROM to RAM before execution).

#14 User is offline   dwelch

  • GP32 Hardcore
  • PipPipPipPip
  • Group: Members
  • Posts: 114
  • Joined: 07-July 06

Posted 14 July 2006 - 07:02 AM

View PostSquidge, on Jul 7 2006, 02:53 AM, said:

The GP2X has a full 32-bit data bus to it's 64MB of RAM. The GBA also has a full 32-bit data bus to it's RAM, the only part that was crippled was access to the ROM cartridge (hence the reason why a lot of people copied code from ROM to RAM before execution).


Only OAM and IWRAM are 32 bit, Palette Ram, Vram, EWRAM and the rom are 16 bit data busses, cartridge ram is an 8 bit bus. I think IWRAM is zero wait states and EWRAM is 2 wait states. rom by default was 3 wait states for the first read then 1 per halfword after that if read sequentially. Could be set for 2/1. And there was a prefetch buffer. In the end, rom for instructions and iwram for data was your fastest combination (thumb mode of course). the fastest I was able to run was 7.35 mips with ARMs RVCT 2.0.1. 45% mips to mhz, which isnt great for an ARM, esp with an ARM compiler. I wonder how many developers actually used the arm tools as they are fairly expensive (I certainly cant afford them, would rather buy a car, or upgrade the one I have with that kind of money). So with gcc the last gcc test I recorded (3.x) was 5.9 mips or about 35% mips to mhz. It would be interesting to see what 4.1.1 does on the GBA.

Anyway, so far, for the gp2x, I have seen 177.48 mips on the 920 and 135.3 on the 940 using gcc 4.x. 89% and 68%.

I think the gba was also crippled by not having a cache...The gp2x relies heavily on the cache, without it (on the 940) I am getting a little over 3 mips or 1.7% mips to mhz, so the memory is very slow compared to the processor clock. It looks like the gp2x is almost the same speed as the gba, clock for clock in a fair fight (no cache).
Hmm, PC100 SDRAM is $11 as is PC133, I would have paid an extra $10 or so to get one wait state ram for this platform <g>.

Yes, at least on the gba where there was no cache to ruin the (compiler comparison) numbers, by "simply" switching compilers your code could run twice as fast. This 2x gain was using the same source, before profiling.

Looks like the cache's in both the 920 and 940 are the same, just different sizes. They have the ability to lock down code in the cache, so if you profiling finds a key code segment (that is not too big) you can lock it down in the cache and in theory greatly improve your performance. Rearranging functions within files and files on the command line can change things, I have not done this in a while so I have to think about how to optimize using that approach The goal is to avoid having one highly used code segment from bumping out the other highly used code segment, set them up to avoid each other instead of bumping each other. I think what you want is to have the two of them right next to each other in memory within a cache line. I think a cache line is 64 bytes in the 920 and 32 in the 940, but would have to re-check that. We had a developer on a project that had an uncanny knack to produce code that would would increase the cache misses, no matter how we modified the cache. We need someone who is the opposite of that, with an uncanny ability to increase hits...The cache is clearly the key to success or failure on this platform as the system relies heavily on it.

You want to write to memory as big as you can, dont write four bytes write one word, dont write two halfwords write one word. Then from there use STMs as much as you can. For both the 920 and 940 the write buffer has only four addresses, so four byte writes and the write buffer is full. But four word writes will fill it too, basically you can get a 4x speed improvement by trying to write blocks of data word instead of a byte. I assume a good arm compiler already uses STMs for memcpy's and strcpys where it can, need to see what gcc does. If it doesnt, I would be curious to see one of your real-world applications that uses str's instead of strh's and strb's and stms instead of str's where possible. See how much of this 4x gain you can absorb. Actually if you compare STMs to strbs, you can get a 16x performance increase in writing to memory on the 920 and 8x on the 940. Definitly worth investigating, if your programs use any data or variables <g>.

I think with some of these additional optimizations you can gain even more than the 10-15%, but who knows until someone tries. Arms compiler is definitely worth it if you plan on earning a living off of your game/application, used to cost around $5500 a few years ago, dont know what it is today. It looks like they bought Keil, which is very interesting. And Keil has an arm compiler, with a license free evaluation version, limited to 16kbytes. Might be worth looking at, just to see how well it generates code.

#15 User is offline   Serge

  • Member
  • PipPip
  • Group: Members
  • Posts: 8
  • Joined: 18-June 06

Posted 29 July 2006 - 08:09 AM

View Postdwelch, on Jul 14 2006, 07:02 AM, said:

You want to write to memory as big as you can, dont write four bytes write one word, dont write two halfwords write one word. Then from there use STMs as much as you can. For both the 920 and 940 the write buffer has only four addresses, so four byte writes and the write buffer is full. But four word writes will fill it too, basically you can get a 4x speed improvement by trying to write blocks of data word instead of a byte. I assume a good arm compiler already uses STMs for memcpy's and strcpys where it can, need to see what gcc does. If it doesnt, I would be curious to see one of your real-world applications that uses str's instead of strh's and strb's and stms instead of str's where possible. See how much of this 4x gain you can absorb. Actually if you compare STMs to strbs, you can get a 16x performance increase in writing to memory on the 920 and 8x on the 940. Definitly worth investigating, if your programs use any data or variables <g>.

I'm sorry guys, can somebody give the following code a try (and provide results to me)?
http://maemo.org/pip...rch/003269.html

It works quite fast on ARM926 and appears to use half cache line bursts when writing data. Maybe it can be good for 920 too. Also here are the results for different ARM based devices:
http://maemo.org/pip...rch/003373.html

  • (2 Pages)
  • +
  • 1
  • 2
  • You cannot start a new topic
  • You cannot reply to this topic