Jump to content


Photo

Mupen64Plus


  • Please log in to reply
884 replies to this topic

#61 Adventus

Adventus

    GP Mania

  • GP32 Hardcore
  • PipPipPipPipPip
  • 460 posts
  • Gender:Male
  • Location:Canberra, Australia

Posted 04 June 2010 - 01:24 AM

There is no copy to screen part with something rendered with OGL ES 2, or if there is the drivers are not very well done at all. Should be rendered straight to a flipped framebuffer.

Is this true with X11? If you want to upscale the rendering I do render to a framebuffer which is not ideal, but you have to enable this in the configs (its not on by default).

Granted, removing discards just lightens SGX load which usually isn't a problem, but you did mention Banjo Kazooie being render limited.

Yeah, I have a config option to disable alpha testing. It has a measurable effect but its not huge.

I would be curious to see just how many unique shaders are present among the 80 shader changes you've recorded for OoT.

I'm implementing this test at the moment, every shader change increments the outputted shaders usage count. I know that well written gl drivers have some sort of state cache so that flipping between just a few states regularly doesn't incur much of a performance hit.

#62 Exophase

Exophase

    Exophase is bad. Nothing good will ever come of him.

  • GP Guru
  • 5463 posts
  • Location:Cleveland OH

Posted 04 June 2010 - 01:59 AM

Is this true with X11? If you want to upscale the rendering I do render to a framebuffer which is not ideal, but you have to enable this in the configs (its not on by default).


I don't know if it's true, it probably depends on the X11 drivers - I would assume if they're worth anything then it's rendering straight to the screen with a render mask for windows. Or at least should have the screen when in fullscreen mode, which it normally is in, right?

Yeah, I have a config option to disable alpha testing. It has a measurable effect but its not huge.


Does it only improve fill-limited games, or does it actually win CPU? Can you tell?

I'm implementing this test at the moment, every shader change increments the outputted shaders usage count. I know that well written gl drivers have some sort of state cache so that flipping between just a few states regularly doesn't incur much of a performance hit.


State caching should only apply to reads, not writes. IE, if you read state it can return it from the cache instead of querying hardware. All writes would have to effect things without any very useful reordering, although I suppose things could be buffered so you can go on doing non-GPU things afterwards. But that's probably not usually how the calls go.

I've actually been pondering the exact nature of all the wasted CPU time for shader changes.

Let's say within a tile you have two polygons using different shaders. To make things work without major stalls there would have to be shader IDs for each pixel since you constantly weave between topmost visible polygons, rather than an IMR which draws an entire polygon contiguously. There would also have to be very efficient context switches. IMG claims zero-overhead switches, but this may not apply to switches between different shaders - the fetches may be synchronous for all threads on a USSE core.

So if it doesn't support this what does it mean for a shader change? The only alternative that comes to mind is performing a separate pass, or a "resolve" in AMD lingo, for each shader change. This would require synchronization between GPU and CPU and would diminish the benefits of a TBDR. A bit less so if the hardware can queue shader changes per tile then perform multi-pass within the tile, without having to go to main memory or synchronize with the CPU. Then that should just cost fillrate.

I guess some of this can be observed by seeing if the shader change causes a big stall immediately or just adds to the flush time.

#63 silver

silver

    GP32 Hardcore

  • Members
  • PipPipPipPip
  • 149 posts

Posted 07 June 2010 - 02:25 PM

Heres me profiling LoZ:OOT over 10 sec



Forgot to ask earlier re your profiling results - could you paste up a profile of the everything when mupen64p is running? I've seen ED post saying "sound is the bottleneck in N64" ( http://www.gp32x.com...post__p__869056 ) and I was curious to know the overall breakdown....

Edited by silver, 07 June 2010 - 02:26 PM.


#64 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 07 June 2010 - 03:01 PM

I've seen ED post saying "sound is the bottleneck in N64" ( http://www.gp32x.com...post__p__869056 ) and I was curious to know the overall breakdown....

Sound isn't really the bottleneck, however at high frameskip, the emulator will block on sound output, meaning the cpu usage will be less than 100% because it will be waiting on sound output rather than rendering more frames. It's something to look into if we ever get around to doing an auto-frameskip option. Ideally the rendering would happen in a seperate thread so it can render as many frames as possible without waiting on the sound output.

#65 silver

silver

    GP32 Hardcore

  • Members
  • PipPipPipPip
  • 149 posts

Posted 07 June 2010 - 04:19 PM

Sound isn't really the bottleneck, however at high frameskip, the emulator will block on sound output, meaning the cpu usage will be less than 100% because it will be waiting on sound output rather than rendering more frames.


Thanks for the clarification. I'll have a play with profiling when I get my 'dora, just curious mainly.

Did you ever get a Dev Pandora, by the way? (hope so)

#66 HackModford

HackModford

    Mega GP Mania

  • GP32 Hardcore
  • PipPipPipPipPipPip
  • 779 posts
  • Gender:Male

Posted 07 June 2010 - 04:22 PM

Is there a pnd of this? IF not how can I beta test this?

#67 Pickle

Pickle

    Mega GP Mania

  • X-treme Team
  • 4074 posts
  • Gender:Male
  • Location:Detroit, Michigan

Posted 07 June 2010 - 04:30 PM

Is there a pnd of this? IF not how can I beta test this?


Read the first post.

Run it in the terminal window

#68 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 07 June 2010 - 05:04 PM

Did you ever get a Dev Pandora, by the way? (hope so)

Nope. Still waiting for my Pandora.

#69 silver

silver

    GP32 Hardcore

  • Members
  • PipPipPipPip
  • 149 posts

Posted 08 June 2010 - 08:55 AM


Did you ever get a Dev Pandora, by the way? (hope so)

Nope. Still waiting for my Pandora.


Ouch, sorry to hear that. Thought you had been offered a dev unit after knocking out your ARM recompiler... have you also ordered a pandora?


(EDIT: Completely OT: reading about the ARM port of Mame, and it's lack of a MIPS dynamic recompiler for ARM - how portable is your recompiler to Mame? That has a R4000 MIPS driver (and a R3000/R5000 - not sure if they are seperate drivers. Just curious, and don't want to derail this thread....)

Edited by silver, 08 June 2010 - 09:22 AM.


#70 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 08 June 2010 - 01:17 PM

Ouch, sorry to hear that. Thought you had been offered a dev unit after knocking out your ARM recompiler... have you also ordered a pandora?

Nope, and I really didn't need a pandora to work on the recompiler, just a device with the same CPU (beagleboard). But since I don't have a Pandora, I haven't been able to test anything related to the OS or controls.

I did order a Pandora, and I assume my order would be within the first 1500 boards already produced, but there is this little problem where they only have 1000 cases.

(EDIT: Completely OT: reading about the ARM port of Mame, and it's lack of a MIPS dynamic recompiler for ARM - how portable is your recompiler to Mame? That has a R4000 MIPS driver (and a R3000/R5000 - not sure if they are seperate drivers. Just curious, and don't want to derail this thread....)

I haven't looked at the MAME code, so I don't know about that. The main difference between R3000 and R4000 is the load delay slots. I know that PSX emulators generally do try to emulate the load delay slots accurately, so I assume that this is necessary for PSX games. N64 doesn't need this, and it's kind of a pain to do it so I haven't tried.

#71 silver

silver

    GP32 Hardcore

  • Members
  • PipPipPipPip
  • 149 posts

Posted 08 June 2010 - 01:50 PM

I did order a Pandora, and I assume my order would be within the first 1500 boards already produced, but there is this little problem where they only have 1000 cases.


Aha, yes join the club - I thought I'd be safely in the first lot, although it appears queue numbers did not include overseas orders... Still, we've had plenty of practice waiting...

I haven't looked at the MAME code, so I don't know about that. The main difference between R3000 and R4000 is the load delay slots. I know that PSX emulators generally do try to emulate the load delay slots accurately, so I assume that this is necessary for PSX games. N64 doesn't need this, and it's kind of a pain to do it so I haven't tried.


Thanks for that - realise it sounded like a "can you please port" question, which wasn't the intention - I've been meaning to get properly into the Mame source for years (I fixed up some basic lightgun code years ago, but then got distracted by x86 assembler hacking up ancient dos driver binaries on - ironically - certain N64 backup devices... )

Anyway, apologies for the OT post...

#72 hlide

hlide

    GP32 Hardcore

  • GP32 Hardcore
  • PipPipPipPip
  • 225 posts

Posted 08 June 2010 - 02:20 PM

I haven't looked at the MAME code, so I don't know about that. The main difference between R3000 and R4000 is the load delay slots. I know that PSX emulators generally do try to emulate the load delay slots accurately, so I assume that this is necessary for PSX games. N64 doesn't need this, and it's kind of a pain to do it so I haven't tried.

As far as i saw in most PSX emulator source, there is no specific thing to do for a load delay slot. You seem to imply a load delay slot may work as a delay branch slot, which I'm not sure about it. Normally games will add a NOP just after if they need to read the target register of the load instruction because the result is totally unpredictable. I'm pretty sure the behavior of " lw $a0, off16($a1); addu $a3, $a0, 0 " is considered as undetermined, so "forbidden". If so, there is no point to handle such a case.

Even if it could be determined in the same ways as delay branch slot, you probably just need to swap instruction order between load instruction and load delay slot instruction when the second instruction is using for rs, rt or rd the same register in rt of the first instruction before recompiling, or just interpreting the load delay slot instruction before the loading instruction in any case.

Edited by hlide, 08 June 2010 - 02:29 PM.


#73 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 08 June 2010 - 03:07 PM

I've been meaning to get properly into the Mame source for years (I fixed up some basic lightgun code years ago, but then got distracted by x86 assembler hacking up ancient dos driver binaries on - ironically - certain N64 backup devices... )

Yeah, the parallel port interface of the v64jr is a pain in the butt. My only computer that has a parallel port is ten years old and doesn't work well anymore.

I'm pretty sure the behavior of " lw $a0, off16($a1); addu $a3, $a0, 0 " is considered as undetermined, so "forbidden". If so, there is no point to handle such a case.

Well, from looking at the pcsx source code, it seems that someone went to a lot of trouble to handle exactly that case, so I can only assume there must be psx games that depend on it.

#74 hlide

hlide

    GP32 Hardcore

  • GP32 Hardcore
  • PipPipPipPip
  • 225 posts

Posted 08 June 2010 - 03:45 PM

I've been meaning to get properly into the Mame source for years (I fixed up some basic lightgun code years ago, but then got distracted by x86 assembler hacking up ancient dos driver binaries on - ironically - certain N64 backup devices... )

Yeah, the parallel port interface of the v64jr is a pain in the butt. My only computer that has a parallel port is ten years old and doesn't work well anymore.

I'm pretty sure the behavior of " lw $a0, off16($a1); addu $a3, $a0, 0 " is considered as undetermined, so "forbidden". If so, there is no point to handle such a case.

Well, from looking at the pcsx source code, it seems that someone went to a lot of trouble to handle exactly that case, so I can only assume there must be psx games that depend on it.

Hummm, let us say they need to do so.

- Is it possible to find a branch instruction in the load delay slot ? if so, nighmare in sight for emulation, i agree.
- Is it possible to find a load instruction in the load delay slot ? if so, what about the delay slot of the loading instruction ?

EDIT: i'm looking at their source...
EDIT2: it looks a little bit academic though
EDIT3:

huh... "0: JAL 9f; 1: LW $A0, 0($A1); 2: MOVE $A2, $A0; ...; 9: JAL PRINTINT; 10: MOVE $A0, $A2" will give use the execution if i'm following their code :

0: JAL 9f ---> first run branch delay slot : "LW $A0, 0($A1)" then jump to 9
1: LW $A0, 0($A1) ---> first run load delay slot : "MOVE $A2, $A0"
2: MOVE $A2, $A0 ---> keep old value of $A0 in $A2 before loading
1: LW $A0, 0($A1) ---> $A0 contains a new value before jumping
0: JAL 9f --> jump to 9 now
9: ...

oh wait, how is it possible ? this order cannot be done in a pipeline !?

i would expect in the pipeline :

[0 --> 1 --> 9] and not [0 --> 1 --> 2 --> 9]. How is it possible to insert 2 into [0 -> 1 -> 9] ?

I probably miss something.

Edited by hlide, 08 June 2010 - 04:08 PM.


#75 Ari64

Ari64

    Magic Emulator Fairy

  • GP32 Hardcore
  • PipPipPipPipPip
  • 477 posts

Posted 09 June 2010 - 01:28 PM

- Is it possible to find a branch instruction in the load delay slot ? if so, nighmare in sight for emulation, i agree.

In this case you need to use the old value for the branch condition, which is not too hard. The more difficult case is where a load occurs in the branch delay slot.

- Is it possible to find a load instruction in the load delay slot ? if so, what about the delay slot of the loading instruction ?

The loads are pipelined and don't interfere with each other.

i would expect in the pipeline :

[0 --> 1 --> 9] and not [0 --> 1 --> 2 --> 9]. How is it possible to insert 2 into [0 -> 1 -> 9] ?

I probably miss something.

It should be 0, 1, 9. Basically what happens is that during the branch it calls psxDelayTest/psxTestLoadDelay to figure out what to do with the next instruction.

I wonder if this is really necessary. You could try removing it and seeing how many games still work. If most do then this might be an option for speeding up psx4all.