Jump to content


Photo

Mupen64Plus


  • Please log in to reply
728 replies to this topic

#1 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 29 August 2009 - 01:48 PM

I have rewritten the dynamic recompiler for Mupen64plus to generate ARM code. This will run on OpenPandora, TouchBook, and BeagleBoard.

Unfortunately OpenGL acceleration does not work. It is possible to use software rendering, but it is very slow and some of the textures display incorrectly. Since several people are working on OpenGL ES libraries, I am posting this as-is for testing purposes.

Running the emulator requires approximately 100MB free memory. Large ROMs, or games that use the memory expansion, require slightly more. Be sure to close other applications first.

Because the emulator is not in a usable state, I have not made any attempt to test the controls. However, the mupen64_input.so plugin seems to work. blight_input does not. You will need to set the appropriate plugin.

Compiling mupen64plus (or at least rice_video) requires more than 256MB RAM. You will need to create and activate a swap partition on an external hard disk before compiling mupen64plus.

If you are using Angstrom, you will need to install the following packages: gcc, g++, make, pkgconfig, libsdl-ttf-dev, gtk+-dev, libgl-dev, libglu-dev, xserver-xorg-extension-glx

Some of the plugins have x86 assembly code which will not build on ARM. After building mupen64plus, do "make plugins NO_ASM=1" to build these plugins.

The mesa-7.2 package in the current version of Angstrom is broken. If you wish to use the software rasterizer, you will need to compile mesa from source, then copy swrast_dri.so into /usr/lib/dri/

The original dynamic recompiler in Mupen64 used far too much memory to run on the Pandora because it retained all of the translated code and metadata. To limit memory usage, it now allocates a 16MB buffer for code translation, and if this gets full, old blocks are removed to make space. 16MB appears to be sufficient for most games without causing excessive thrashing. This code is in r4300/new_dynarec/.

http://dl.free.fr/tmkLzOpvB


Edit: Updated version here. OpenGL ES plugin here.

Edited by Ari64, 02 April 2010 - 04:33 PM.


#2 guizm

guizm

    GP32 Hardcore

  • Member
  • PipPipPipPip
  • 119 posts
  • Gender:Male

Posted 29 August 2009 - 05:46 PM

What a first post man :lol: :D
I don't know how to help but its nice to see that Mupen can be ported.

#3 midna25

midna25

    GP32 Hardcore

  • Member
  • PipPipPipPip
  • 141 posts

Posted 29 August 2009 - 06:29 PM

Finally someone giving n64 emulation a shot. I don't know what will come of this particular project, but if nothing else it's at least PoC.

#4 fischju2000

fischju2000

    Mega GP Mania

  • GP32 Hardcore
  • PipPipPipPipPipPip
  • 763 posts
  • Gender:Male

Posted 29 August 2009 - 07:46 PM

Sounds great, I can't really help but I can say that not everybody can read french and that site is very, very slow, so I uploaded the files somewhere faster:

http://hotfile.com/dl/11090512/8190d1e/mupen64plus-arm-20090829.tar.gz.html

#5 craigix

craigix

    Mega GP Mania

  • GP Guru
  • 6345 posts
  • Gender:Male
  • Location:England

Posted 29 August 2009 - 08:33 PM

Hi Ari,

Could you give us an idea of the kind of PC based system this needs to run? I'd be interested to know the sort of speeds required.

It should be possible to get the openGL going, Pickle might be able to give some advice there.

#6 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 29 August 2009 - 08:43 PM

Sounds great, I can't really help but I can say that not everybody can read french and that site is very, very slow, so I uploaded the files somewhere faster:

http://hotfile.com/d...829.tar.gz.html

I originally tried to upload to mediafire, but it was even slower.

#7 Vorporeal

Vorporeal

    Yes, no, I, this is.

  • GP32 Hardcore
  • PipPipPipPipPipPip
  • 1475 posts
  • Gender:Male

Posted 29 August 2009 - 08:45 PM

Yeah, this is a job for the illustrious Pickle. We just need another virgin sacrifice to bring him out of hiding.

#8 Exophase

Exophase

    Exophase is bad. Nothing good will ever come of him.

  • GP Guru
  • 5463 posts
  • Location:Cleveland OH

Posted 29 August 2009 - 08:57 PM

I don't suppose there's any way I could get you to post MIPS to ARM block comparisons? I love those.

If you can remove a block in isolation as opposed to flushing the entire cache on overflow then it must mean that the recompiler is not directly linking blocks, which is a pretty big red flag for performance :/

Edited by Exophase, 29 August 2009 - 09:00 PM.


#9 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 29 August 2009 - 08:58 PM

Hi Ari,

Could you give us an idea of the kind of PC based system this needs to run? I'd be interested to know the sort of speeds required.

It should be possible to get the openGL going, Pickle might be able to give some advice there.

Compatibility should be roughly the same as the original mupen64plus, I just replaced the dynamic recompiler. It will compile and run on x86. It will work on x86-64 too, although it only generates 32-bit code. I didn't really try to optimize x86-64 since my focus was on ARM. It only uses around 15-20% cpu time on my core 2, so there probably isn't much to be gained from optimizing for this type of CPU anyway.

#10 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 29 August 2009 - 09:20 PM

I don't suppose there's any way I could get you to post MIPS to ARM block comparisons? I love those.

If you can remove a block in isolation as opposed to flushing the entire cache on overflow then it must mean that the recompiler is not directly linking blocks, which is a pretty big red flag for performance :/

If you define assem_debug in new_dynarec.c, you will get debugging output from the code generator - I assume that's what you want.

It is directly linking blocks, and it does flush the cache. I thought I could get away with not doing so, but the Cortex-A8 has a random replacement policy, which means you're always left with a few old cache lines.

It doesn't flush the cache every time it links a block though. If an old, unresolved branch address is present in the i-cache, then we just end up in dyna_linker again, which has code to check if this happened.

Edited by Ari64, 29 August 2009 - 09:29 PM.


#11 Exophase

Exophase

    Exophase is bad. Nothing good will ever come of him.

  • GP Guru
  • 5463 posts
  • Location:Cleveland OH

Posted 29 August 2009 - 09:45 PM

I was referring to the translation cache, not the Cortex-A8's L1/L2 caches. Maybe I'm misunderstanding this line:

To limit memory usage, it now allocates a 16MB buffer for code translation, and if this gets full, old blocks are removed to make space.


That suggests that you're not flushing the entire translation cache when you run out of space, but only individual blocks. If the blocks are directly linked to each other then this is more or less impossible. I'll have to look at your code, I suppose.

I don't know what debugging output is provided, but I also can't test it myself because I don't have any hardware that can run this. If it does what I want (shows original MIPS blocks vs recompiled ARM blocks) then would it be possible for you to run it and give me some samples? This of course means that it'd have to have an ARM disassembler to be intelligible. I am curious to see what the general quality of the recompiler is, and it's much easier to discern this from examples than from looking at the code.

#12 Laurent

Laurent

    Mega GP Mania

  • GP32 Hardcore
  • PipPipPipPipPipPip
  • 1036 posts
  • Location:France

Posted 29 August 2009 - 09:53 PM

If you can remove a block in isolation as opposed to flushing the entire cache on overflow then it must mean that the recompiler is not directly linking blocks, which is a pretty big red flag for performance :/

What makes you think removing a translated block implies blocks are not linked? QEMU and DynamoRIO can do that (by chaining block descriptors in the case of QEMU).

#13 Exophase

Exophase

    Exophase is bad. Nothing good will ever come of him.

  • GP Guru
  • 5463 posts
  • Location:Cleveland OH

Posted 29 August 2009 - 10:43 PM

If you can remove a block in isolation as opposed to flushing the entire cache on overflow then it must mean that the recompiler is not directly linking blocks, which is a pretty big red flag for performance :/

What makes you think removing a translated block implies blocks are not linked? QEMU and DynamoRIO can do that (by chaining block descriptors in the case of QEMU).


Because if you remove a block you have to remove everything that's linked to it, recursively. Tracking such a thing is not worth it when you're going to end up removing a major chunk of your blocks that way anyway. You'll end up fragmenting memory pretty heavily this way too.

The Dynamo paper made it clear that it flushed the entire cache for these reasons. I don't know if this "RIO" is different in this regard, since I've never heard of it. Also, the description you gave of QEMU's dispatcher makes it clear that this is not happening there either.

Just to be clear, when I say "direct linking" I mean a translated direct branch from one block to another. If it goes through any kind of indirect lookup then it's not direct linking, even if there's a cached value stored somewhere to make it faster than doing a full emulated PC to translated block conversion (plus potential new block compilation). No matter how you slice it, this is going to be considerably slower than the alternative.

Edited by Exophase, 29 August 2009 - 10:56 PM.


#14 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 29 August 2009 - 11:03 PM

I was referring to the translation cache, not the Cortex-A8's L1/L2 caches. Maybe I'm misunderstanding this line:

To limit memory usage, it now allocates a 16MB buffer for code translation, and if this gets full, old blocks are removed to make space.


That suggests that you're not flushing the entire translation cache when you run out of space, but only individual blocks. If the blocks are directly linked to each other then this is more or less impossible. I'll have to look at your code, I suppose.

The 16MB buffer is split up into 2MB regions, and when it needs space, it dumps an entire 2MB region. Before this happens it makes a pass through the entire cache, removing links to this area. This is done incrementally and not all at once, so there isn't a single point in time where we have to stop everything to clean up pointers. The code that does this is at the end of new_recompile_block().

I don't know what debugging output is provided, but I also can't test it myself because I don't have any hardware that can run this. If it does what I want (shows original MIPS blocks vs recompiled ARM blocks) then would it be possible for you to run it and give me some samples? This of course means that it'd have to have an ARM disassembler to be intelligible. I am curious to see what the general quality of the recompiler is, and it's much easier to discern this from examples than from looking at the code.

It basically just prints out every instruction that it disassembles, and every instruction that it generates. This happens before the linker stage, so the branches are unresolved and the literal pool is not generated yet, but it shows all of the instructions. Is there any code sequence in particular you want to see? It generates quite a lot of output.

#15 Exophase

Exophase

    Exophase is bad. Nothing good will ever come of him.

  • GP Guru
  • 5463 posts
  • Location:Cleveland OH

Posted 30 August 2009 - 12:08 AM

The 16MB buffer is split up into 2MB regions, and when it needs space, it dumps an entire 2MB region. Before this happens it makes a pass through the entire cache, removing links to this area. This is done incrementally and not all at once, so there isn't a single point in time where we have to stop everything to clean up pointers. The code that does this is at the end of new_recompile_block().


Okay, this is clearer - the 2MB region splits make sense, but the lazy scanning and removing of direct links is not something I'd personally consider doing. Instead I'd prefer to just force direct branches that cross that region to go through indirection. But only profiling would really show which is worth it.

It basically just prints out every instruction that it disassembles, and every instruction that it generates. This happens before the linker stage, so the branches are unresolved and the literal pool is not generated yet, but it shows all of the instructions. Is there any code sequence in particular you want to see? It generates quite a lot of output.


I'll take any snippets at all ;D But the more you think it's representative of typical executed code the better. If you have any profiling and give stuff near the top on some popular games that'd be neat.

By the way, I'm very glad that a new recompiler author has joined, and I hope that we'll have lots of interesting discussions in the future.. heheh..

Congratulations on your work so far. I think you're going to end up being very important to the Pandora.

Edited by Exophase, 30 August 2009 - 12:12 AM.