Jump to content


Photo

Mupen64Plus


  • Please log in to reply
728 replies to this topic

#31 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 30 August 2009 - 08:54 PM

Your code generation looks great, I think that people won't have much to worry about getting good speed N64 emulation out of this. I really like how simple the memory map emulation is, not that it's news to me regarding N64 but just seeing it is nice, especially since you don't need a register for it on ARM. It's at least one nice break you have emulating N64 over other things.. Just a few questions:

It you're referring to the cmp/bvc, it should look familiar, Daedalus does something very similar.

- Any plans for any global register allocation strategies? Of course this would help with the 64bit stuff too, I think. Not that pretty much anything would be at all simple to implement..

There is a preferred mapping, which is r1->r1, r2->r2...r7->r7, r8->r0, r9->r1, etc. This helps with branches, as things are usually in the same registers, however if it runs out of registers, then it will use whatever registers are available.

- Are you considering using MMU protection (mmap) for the self modifying code check on the store? Since you're using the same page granularity anyway, which kind of suggests to me that you plan for it later.

I hadn't seriously considered it, but it would be possible. The page size is 4K because that's the native page size on the r4300, although almost no N64 games actually use the MMU. (And if any do use the MMU, they probably won't work, since I haven't tested this.)

- Any plans on scheduling for Cortex-A8? Naturally this will make the register allocation more constricted but with a lot of loads in the picture it seems worthwhile.

To schedule instructions this way would require an instruction-reordering pass after code generation. This could be done but would take some work. I guess the Cortex-A9 CPU will do this in hardware. It might be almost as effective to simply change the register allocation so that registers are allocated/loaded one instruction before they are needed.

- Option for shadow stack pointer? Of course, since the example code is not even using the stack I can't tell if you aren't already..

I can't see any advantage to this. r29 is generally used as a stack, so the example code is in fact using the stack.

And of cousre any other optimization plans you have, or tricks you're currently doing, I'd love to hear.

Something could probably be done to improve the floating point performance. Currently it calls libc/libm, which I assume is softvfp. However, floating point operations are typically less than 5% of the instructions in most N64 games, so it's not a showstopper.

I can't imagine this current OpenGL ES problem is going to be a huge barrier. Pandora could very well launch with good N64 emulation.

I don't think I have the correct binary blob to test hardware acceleration, so maybe someone more familiar with this can comment.

Also some of the textures are clearly not right. I don't know if this is a bug in mesa or rice video, but it doesn't happen on x86.

#32 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 30 August 2009 - 09:32 PM

So does sound emulation work, at least in theory? Or would that require a bit of work as well?

In theory. I don't have any hardware to test it, but I can't think of any reason why it wouldn't work, other than it increases CPU usage 10-20%.

#33 Exophase

Exophase

    Exophase is bad. Nothing good will ever come of him.

  • GP Guru
  • 5463 posts
  • Location:Cleveland OH

Posted 30 August 2009 - 09:52 PM

It you're referring to the cmp/bvc, it should look familiar, Daedalus does something very similar.


Indeed, I'm familiar with it. It's just nice that here you don't need to use a register. Was just hoping you were aware of the simple mapping, and fortunately you are.

There is a preferred mapping, which is r1->r1, r2->r2...r7->r7, r8->r0, r9->r1, etc. This helps with branches, as things are usually in the same registers, however if it runs out of registers, then it will use whatever registers are available.


Okay, so it's a kind of mix between static and dynamic. But I was referring to dynamic between blocks.

I hadn't seriously considered it, but it would be possible. The page size is 4K because that's the native page size on the r4300, although almost no N64 games actually use the MMU. (And if any do use the MMU, they probably won't work, since I haven't tested this.)


Would at least cut down the generated code size a bit. I'm always nervous about blowing away L1 icache with huge translated code blocks. Past a certain amount I'd move things to a function (probably for a store like that).

To schedule instructions this way would require an instruction-reordering pass after code generation. This could be done but would take some work. I guess the Cortex-A9 CPU will do this in hardware. It might be almost as effective to simply change the register allocation so that registers are allocated/loaded one instruction before they are needed.


Right, it's not the easiest thing in the world to do. But you'd be able to exploit parallelism a little more in general. If you can manage the register space for that then it should at least help a little, yeah.

I can't see any advantage to this. r29 is generally used as a stack, so the example code is in fact using the stack.


You're right, I missed that, but because it was an add and I was looking for loads/stores. What I also mean by shadow stack pointer is to assume that the stack is always in RAM, and loads/stores plus additions and subtractions can go to the shadow stack. So long as games don't copy from the stack pointer to other registers much (ie, frame pointer..) then it should be faster this way. Of course it also fails if games use the stack pointer on memory that has to be trapped, but I doubt they do.

Something could probably be done to improve the floating point performance. Currently it calls libc/libm, which I assume is softvfp. However, floating point operations are typically less than 5% of the instructions in most N64 games, so it's not a showstopper.


Yes, definitely start outputting VFP code, if not NEON code (where possible), however..

By 5% do you mean of executable space or of code that's being ran? Because if only 5% of executed code is floating point then that means the Wiz probably can do N64 emulation afterall. In which case, I wonder if you'd be interested in attempting a port there too.

I don't think I have the correct binary blob to test hardware acceleration, so maybe someone more familiar with this can comment.

Also some of the textures are clearly not right. I don't know if this is a bug in mesa or rice video, but it doesn't happen on x86.


Instead, if you're interested you can test OpenGL ES on x86, using PowerVR's wrapper. I'm sure that'd make it a lot easier for someone else to port it.

#34 cowai

cowai

    hellolo.

  • GP32 Hardcore
  • PipPipPipPipPip
  • 457 posts
  • Gender:Male
  • Location:Norway,
  • Interests:F5

Posted 30 August 2009 - 10:05 PM

Someone please send Ari some hardware.

#35 Exophase

Exophase

    Exophase is bad. Nothing good will ever come of him.

  • GP Guru
  • 5463 posts
  • Location:Cleveland OH

Posted 30 August 2009 - 10:45 PM

It seems to me like he already has something, although exactly what I'm not sure.

#36 Adventus

Adventus

    GP Mania

  • GP32 Hardcore
  • PipPipPipPipPip
  • 460 posts
  • Gender:Male
  • Location:Canberra, Australia

Posted 30 August 2009 - 10:59 PM

Very impressive.

Is most the floating point computations done in the graphics plugins? I had a look at rice_video and Glide64, they seem like a very good target for NEON.... most are already SSE optimized. From initial impressions I think we could port rice_video pretty easy using NanoGL. Porting Glide64 might be a bit harder since the glide wrapper uses shaders.

BTW. I'm pretty close to done on a NEON-ized softfp/hardfp math library (in my sig), however i need some hardwae to test it.

Edited by Adventus, 30 August 2009 - 11:22 PM.


#37 MonkeyChops

MonkeyChops

    NO! I don't play basketball

  • GP32 Hardcore
  • PipPipPipPipPipPip
  • 949 posts
  • Gender:Male
  • Location:OHIO
  • Interests:retro games, music, drums, beer brewing

Posted 30 August 2009 - 11:13 PM

ari, if you don't mind me asking, how long have you been working on this project?

I think that most of us here were hoping that someone was lurking behind the scenes with a n64 project. Its awesome that you made that dream come true. You sir, are my hero. Thank you.

#38 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 31 August 2009 - 12:03 AM

Something could probably be done to improve the floating point performance. Currently it calls libc/libm, which I assume is softvfp. However, floating point operations are typically less than 5% of the instructions in most N64 games, so it's not a showstopper.


Yes, definitely start outputting VFP code, if not NEON code (where possible), however..

By 5% do you mean of executable space or of code that's being ran? Because if only 5% of executed code is floating point then that means the Wiz probably can do N64 emulation afterall. In which case, I wonder if you'd be interested in attempting a port there too.

Code that's being ran. There's no specific profiling support in the dynarec, but I added a counter to the code generator in float_assemble() to see how many times this code got run. It was 2.02% of instructions in Super Mario 64, and 3.54% in Ocarina of Time. The latter is probably more typical, Super Mario 64 spends a lot of time in its idle loop.

The problem with Wiz isn't the floating point, it's that the CPU is slower overall, the RAM is a lot slower, and there is no L2 cache. I don't have a Wiz, but go ahead and try it if you do. You'll have to replace the UXTH instructions with ARM9 code, and you might as well get rid of the PLD instructions since these do nothing on the ARM926EJ.

Instead, if you're interested you can test OpenGL ES on x86, using PowerVR's wrapper. I'm sure that'd make it a lot easier for someone else to port it.

I guess I'd have to do something like: mupen -> NanoGL -> wrapper -> OpenGL
I wonder if I can get that to build...

Edited by Ari64, 31 August 2009 - 12:15 AM.


#39 Exophase

Exophase

    Exophase is bad. Nothing good will ever come of him.

  • GP Guru
  • 5463 posts
  • Location:Cleveland OH

Posted 31 August 2009 - 12:18 AM

I'm aware how much Wiz sucks compared to Pandora, and for what reasons. It's just that people find N64 emulation on PSP worthwhile, even if it's far from realtime speed. If floats are not executed too often then something on Wiz can potentially be about PSP level, barring the difference a lack of VFPU makes in assisting RSP T&L + clipping.

Unfortunately I can't port anything to Wiz right now and I'm not really a porting kind of person.. but I have a good feeling someone else would be up to the task. Quite possibly Pickle?

I think that converting from OGL to OGL ES2 won't be that bad. I'm not well familiar with Rice, but I would guess that it'd already be using shaders in order to get the blend modes correct. Might just have to convert primitive state commands to VBOs. I bet Adventus would be willing to take a look at it.

By the way, how many instructions per frame are we looking at here? I take it you're doing a fixed number of cycles per instruction and it's probably something like 2 90MHz cycles, so that'd mean 45 million per second? How complex are the idle loops? Can you give any examples of what they're like?

Edited by Exophase, 31 August 2009 - 12:20 AM.


#40 zodttd

zodttd

    Solving your premature emulation since the Tapwave Zodiac!

  • GP Guru
  • 1151 posts

Posted 31 August 2009 - 12:43 AM

Hi Ari64!

Thank you so much for this! Amazing work! I'm looking through the sources as we speak.

I'm very much looking forward to porting this to the Apple iPhone 3gs. My port's source will be available on my github.com/zodttd.

You have no clue how happy you just made me! :)

Thanks again,
ZodTTD

#41 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 31 August 2009 - 12:59 AM

There is a preferred mapping, which is r1->r1, r2->r2...r7->r7, r8->r0, r9->r1, etc. This helps with branches, as things are usually in the same registers, however if it runs out of registers, then it will use whatever registers are available.


Okay, so it's a kind of mix between static and dynamic. But I was referring to dynamic between blocks.

How would that work?

I hadn't seriously considered it, but it would be possible. The page size is 4K because that's the native page size on the r4300, although almost no N64 games actually use the MMU. (And if any do use the MMU, they probably won't work, since I haven't tested this.)


Would at least cut down the generated code size a bit. I'm always nervous about blowing away L1 icache with huge translated code blocks. Past a certain amount I'd move things to a function (probably for a store like that).

This would probably result in a gain in speed. I'd need to write a signal 11 handler tho, and this would make debugging more complicated, so this isn't one of the first things I'd want to optimize.

#42 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 31 August 2009 - 01:13 AM

ari, if you don't mind me asking, how long have you been working on this project?

I think that most of us here were hoping that someone was lurking behind the scenes with a n64 project. Its awesome that you made that dream come true. You sir, are my hero. Thank you.

Around 4-5 months. I didn't want to say anything until I was sure that this would actually work, otherwise it would have just been speculation about something that might never be finished.

#43 Exophase

Exophase

    Exophase is bad. Nothing good will ever come of him.

  • GP Guru
  • 5463 posts
  • Location:Cleveland OH

Posted 31 August 2009 - 01:34 AM

How would that work?


Sorry, I can't give you any clear ideas because I haven't spec'd something like this. But with a "deep" recursive recompiler you can translate blocks linked to before translating the current block, then register allocate weighted around that. Anything you can do to relieve pressure around block transitions.. I'm sure there are a lot of schemes, but any of them sound like they'd be fairly complex.

This would probably result in a gain in speed. I'd need to write a signal 11 handler tho, and this would make debugging more complicated, so this isn't one of the first things I'd want to optimize.


*nod*

Said handler would also have to decipher the instruction. Blegh. And it wouldn't be hard for a game to run it into the ground (although it's not really hard for one now) with data accesses in the page. You'd probably have to track and patch the instruction if it gets out of hand.

I hope you're looking forward to being a big time celebrity around here, at any rate ;D Good job keeping quiet about it this long. I much prefer it when people make a substantial first showing instead of generating a ton of hype that they usually end up being incapable of delivering on.

#44 RenegadeChic

RenegadeChic

    Mega GP Mania

  • GP32 Hardcore
  • PipPipPipPipPipPip
  • 517 posts

Posted 31 August 2009 - 01:43 AM

Around 4-5 months. I didn't want to say anything until I was sure that this would actually work, otherwise it would have just been speculation about something that might never be finished.

good lord! the way you just casually posted this topic it was like you had just had a quick look over it. major MAJOR props for all your hard work! i think you may possibly be on the way to something that pushes the interest in the pandora massively! killer app much! i bow my head to you and hope it all bears fruit

#45 greendots

greendots

    Its finally here!

  • GP32 Hardcore
  • PipPipPipPipPipPip
  • 590 posts
  • Gender:Male

Posted 31 August 2009 - 01:54 AM

Thanks a lot for your work on this Ari64!