Go Back   Project64 Forums > Public Version > Project 64 - v2.x - Suggestions

Reply
 
Thread Tools Display Modes
  #11  
Old 22nd August 2014, 11:04 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

I've been learning about recompilers. I'm actually baffled at how people end up with inaccurate recompilers. It just doesn't make any sense to me. One should start with a slow and accurate recompiler and eventually tweak it for speed. In the case of PJ64's RSP, I can understand the inaccuracy, since even the interpreter wasn't super accurate itself.

I'm curious though. For android, what does it use for RSP?

At this rate, I'm almost certain you are better off doing it yourself, instead of waiting for someone else to do it. In my case, it looks like no one is going to bother making or improving an RSP recompiler anytime soon. So it's up to me to make that happen. I'm already making progress .

It doesn't matter how little experience/knowledge you have with programming. Tweaking code is much easier than writing your own code. Even a noob can change numbers that end up benefiting the program. Adding a few if statements could speed up a recompiler that hasn't been well optimized.

Wow I didn't know Mupen64Plus AE was written in Java. That pretty much seals the deal. Would not even bother looking at it.

Last edited by RPGMaster; 21st September 2014 at 10:24 AM.
Reply With Quote
  #12  
Old 23rd August 2014, 12:28 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Yeah, wouldn't use MMX for RSP. I was happy using MMX with certain parts of the RDP since it had a few facilities for 64-bit pack or unpack, shuffle and other decoding processes with the DP command shuffler. With the RSP...not so much. Most vector stuff on the RSP is 128-bit (except stuff like SP DMA, SDV/LDV, few others), so it wouldn't really be "accurate" to resign oneself to MMX on the RSP.
Reply With Quote
  #13  
Old 23rd August 2014, 01:05 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Quote:
Originally Posted by HatCat View Post
Yeah, wouldn't use MMX for RSP. I was happy using MMX with certain parts of the RDP since it had a few facilities for 64-bit pack or unpack, shuffle and other decoding processes with the DP command shuffler. With the RSP...not so much. Most vector stuff on the RSP is 128-bit (except stuff like SP DMA, SDV/LDV, few others), so it wouldn't really be "accurate" to resign oneself to MMX on the RSP.
Lol after reviewing the simple instructions like ADD, XOR, etc, I briefly looked at instructions like VMULF and decided to not even bother going further. It was fun being able to contribute minor optimizations and implement a few instructions for recompiler, like NOR though .

I agree that there's no point in using MMX when dealing with 128-bit data. Looking at the Recompiler source was definitely worthwhile, just because it made me realize how much room for improvement there is. When I looked at PJ64 1.6 and 1964's core recompiler, I thought to myself "Wow this looks complex, I doubt I can improve this". Then I look at the RSP recompiler and was intrigued at how much simpler and incomplete it was. I used to be so confused as to how a compiler can effect the speed of the recompiler, then I find out it still uses interpreter.

So ya, I'm going to continue learning from your interpreter, and then figure out a few things about recompiler and I'll be able to make my own RSP recompiler . I wonder how the speed for LLE audio will compare to HLE audio.
Reply With Quote
  #14  
Old 23rd August 2014, 01:19 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

ADD is not a simple vector instruction.

NOR isn't as simple as OR.
With vector NOR you have to do ~(VR[vs][i] | VR[vt][i]). With OR, it's just VR[vs][i] | VR[vt][i].

All it really means is you're taking 128 bits and doing a bit-wise operation with that and another 128 bits.
If you're trying to practice SSE or MMX then you're looking the wrong way. VMULF is not a simple application of SSE, and ADD, XOR and NOR are only scalar opcodes with no use for SIMD. You should be practicing on VAND, VOR, VXOR, ...
Reply With Quote
  #15  
Old 23rd August 2014, 01:26 AM
V1del V1del is offline
Project Supporter
Senior Member
 
Join Date: Feb 2012
Posts: 442
Default

Quote:
Originally Posted by RPGMaster View Post
I'm curious though. For android, what does it use for RSP?
mupen64plus-hle-rsp

Quote:
Wow I didn't know Mupen64Plus AE was written in Java. That pretty much seals the deal. Would not even bother looking at it.
Except it isn't, it wouldn't run in any acceptable way. That's just frontend/UI code which has to be in java on android afaik. To know what it REALLY uses you should look at the JNI folder which contains the native code: https://github.com/mupen64plus-ae/mu...ree/master/jni

Specifically for the arm dynarec: https://github.com/mupen64plus-ae/mu...00/new_dynarec
Reply With Quote
  #16  
Old 23rd August 2014, 01:43 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Quote:
Originally Posted by HatCat View Post
ADD is not a simple vector instruction.

NOR isn't as simple as OR.
With vector NOR you have to do ~(VR[vs][i] | VR[vt][i]). With OR, it's just VR[vs][i] | VR[vt][i].

All it really means is you're taking 128 bits and doing a bit-wise operation with that and another 128 bits.
If you're trying to practice SSE or MMX then you're looking the wrong way. VMULF is not a simple application of SSE, and ADD, XOR and NOR are only scalar opcodes with no use for SIMD. You should be practicing on VAND, VOR, VXOR, ...
I first looked at the scalar op codes, just to see how it was implemented in the RSP recompiler. Since those were done right, accuracy wise, I felt I should look at those first. I was going to try implementing SSE into PJ64's rsp, but I'm pretty sure they didn't arrange the data, in an optimal way, unlike yours which is done right, so I'd rather not bother doing much more with that rsp source. Now I've decided to write an RSP recompiler from scratch.

I've already practiced with instructions like VAND, VOR, VXOR with 1964 audio HLE a while back. Now I just need to understand the complex vector instructions.

I'm not solely focusing on SSE and MMX, I want to learn all of the RSP instructions so I can do a complete job.

Quote:
Originally Posted by V1del View Post
mupen64plus-hle-rsp
Oh ok.

Quote:
Originally Posted by V1del View Post
Except it isn't, it wouldn't run in any acceptable way. That's just frontend/UI code which has to be in java on android afaik. To know what it REALLY uses you should look at the JNI folder which contains the native code: https://github.com/mupen64plus-ae/mu...ree/master/jni
Lol . I had a hard time believing it was all in java, so I looked through a few of them, but ironically skipped jni I guess ;/ .

Quote:
Originally Posted by V1del View Post
Oh alright, thanks. I might take a peek out of curiosity.
Reply With Quote
  #17  
Old 23rd August 2014, 02:40 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by RPGMaster View Post
I've already practiced with instructions like VAND, VOR, VXOR with 1964 audio HLE a while back. Now I just need to understand the complex vector instructions.

I'm not solely focusing on SSE and MMX, I want to learn all of the RSP instructions so I can do a complete job.
In my next release of RSP LLE, I'm deciding to re-centralize vector shuffling to the interpreter core loop. At first I was already doing that; then I changed my mind. Now, I changed my mind again.

zilmar ESPECIALLY needs to do this. To be honest one of zilmar's biggest screw-ups with his RSP interpreter, intended as "readable" and "accurate" but in his claim, not for speed/performance, was forcibly inlining his horrible (not to mention inaccurate with real hardware algorithm) shuffle code.

Because he forces shuffling to happen within EVERY vector opcode, ALL of them are annoying to read to figure out what zilmar is trying to do. Take the simple Vector AND opcode as an example from zilmar's Interpreter Ops.c:
Code:
void RSP_Vector_VAND (void) {
    int count, el, del;
    VECTOR result;

    for ( count = 0; count < 8; count ++ ){
        el = Indx[RSPOpC.rs].B[count];
        del = EleSpec[RSPOpC.rs].B[el];
        result.HW[el] = RSP_Vect[RSPOpC.rd].HW[el] & RSP_Vect[RSPOpC.rt].HW[del];
        RSP_ACCUM[el].HW[1] = result.HW[el];
    }	
    RSP_Vect[RSPOpC.sa] = result;
}
First, `Indx` should not exist. That was zilmar's LUT from like, 1999, that he wrote to transpose the vector reads to prevent premature overwriting. This confused him 13 years later into wondering why his RSP plugin didn't exhibit what he thought was supposed to be a bug.

Second, the names "el" and "del" are switched. The SOURCE element is the target vector register slice of VR[vt], and the destination is VR[vd] computed off of VR[vs].

Third, if he had just centralized this horrible word-swapping shuffling algorithm to happen into one single function, rather than 40+ vector opcode functions, like accurate RSP timing would do to shuffle in the main vector scheduler thread rather than inside each opcode (not that making this change exactly promotes cycle-accuracy, but still it's readability++ and filesize--), his VAND interpreter opcode would be reduced down to only this:
Code:
void RSP_Vector_VAND (void) {
    int count;
    VECTOR result;

    for ( count = 0; count < 8; count ++ ){
        result.HW[count] = RSP_Vect[RSPOpC.rd].HW[count] & RSP_Vect[RSPOpC.rt].HW[count];
        RSP_ACCUM[count].HW[1] = result.HW[count];
    }	
    RSP_Vect[RSPOpC.sa] = result;
}
Which is WAY the fuck more readable.

He also made that change after RSP 1.7 to use VECTOR union type so he could say VR[vd] = result;, bragging about how it moved 128 bits at once using the C language. What he doesn't get is, he's actually multiplying the amount of 16-bit moves he's doing, not reducing it. He's temporarily moving 16-bit elements into a temporary `result' union, delaying the actual writeback to VR[vd = RSPOpC.sa]. So it adds to the file size, adds to the C code, adds to the amount of variables/objects declared...there is no need for a "result" temporary.

Thus, further reduced to:
Code:
void RSP_Vector_VAND (void) {
    int count;

    for ( count = 0; count < 8; count ++ ){
        RSP_Vect[RSPOpC.sa].HW[count] = RSP_Vect[RSPOpC.rd].HW[count] & RSP_Vect[RSPOpC.rt].HW[count];
        RSP_ACCUM[count].HW[1] = result.HW[count];
    }
}
Has zil's VAND function been 100% optimized yet...no not yet, because, RSP_ACCUM is really 48 bits, not 64. The low 16 bits of the 48-bit acc, is really RSP_ACCUM[count].HW[1], or bits 31..16 in his emulator. In theory it can be smaller code and faster to just say .HW[0], OR, do what Michael Tedder did in Project Unreality when he reversed the RSP, and create separate arrays of accumulator low, mid and high elements (although this sacrifices "accuracy" of the algorithm).

Finally, it shouldn't be:
Code:
void RSP_Vector_VAND (void) {
It should really be:
Code:
#ifdef ARCH_MIN_SSE2
typedef __m128i v128;
#else
typedef short * v128;
#endif

void RSP_Vector_VAND (v128 vd, v128 vs, v128 vt) {
Since, in SSE2 __m128i type, no pushes/pops happen with interpreter functions. It's just xmm0, xmm1, xmm2 being directly written in the main interpreter loop, which eliminates some register decode time.

And then there's more shit I could go on about.

I'm not telling you all this to bore you though. I'm telling you because you're learning about RSP recompiler from Jabo's MMX recompiler ... well the unfortunate thing is it was based on zilmar's interpreter, so you might have to understand both sides and how the interpreter really could have been different, and simpler.
Reply With Quote
  #18  
Old 23rd August 2014, 03:12 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

I'm actually really interested in the RSP right now. I honestly can't stand looking at some of those vector instructions in PJ64's RSP.

I was thinking that using 3 arrays would be great for speed. That was one of the reasons I didn't want to use PJ64's rsp as a base.

Now that I have a basic understanding of how recompilers works, I'm going to need to fully understand how interpreter works before I begin the recompiler plugin. At this point, I wouldn't even benefit from looking at that MMX code in the recompiler. Reading your interpreter source and figuring out the SSE code generated is far more useful. Once I know exactly what each instruction does, I'll be able to make my own sse recompiler implementation.

Lol one weird thing about zilmar's RSP is that it assigns addresses to a function pointer array in the BuildInterpreterCPU and BuildRecompilerCPU functions. Is there any reason to assign the addresses at runtime? It just seems like overhead, to me.
Reply With Quote
  #19  
Old 17th November 2014, 08:53 AM
retroben's Avatar
retroben retroben is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jul 2013
Posts: 687
Default

Gotta bring this up again since porting just became a lot less difficult if I understand correctly.
PJ64 primarily uses .NET Framework,right?

Microsoft recently made .NET Framework open-sourced and are getting into stuff for other platforms including the platform with Android!
I think Visual Studio or Visual C (might be C# instead) is also in the picture as well.


At least there already is so much progress with Mupen64Plus AE since DK64 collision has been completely fixed and Banjo-Tooie is now crashless for me after play-testing it for twelve hours straight! (got to just before fighting Weldar)

I wonder if PCSX2 is now more capable of being ported,even if it is incredibly slow.
(Dolphin Emulator for Gamecube AND Wii got ported to Android)
Reply With Quote
  #20  
Old 17th November 2014, 02:57 PM
V1del V1del is offline
Project Supporter
Senior Member
 
Join Date: Feb 2012
Posts: 442
Default

It doesn't. So no this doesn't help at all. And yeah the .NET framework implies C# which again doesn't matter at all in this context.

This doesn't help PCSX2 at all, PCSX2 is maybe even worse off than PJ64 and it doesn't run on anything except x86 (not even x64), the core is so dependant on the architecture it would need a rewrite to even start to think about running on ARM

You can't compare dolphin and pcsx2, this won't do you any good (don't ever do this with any emulator of different machines anyway). The reason that dolphin runs is

a) It was designed to be portable from the getgo (relatively small, but substantial nonetheless, part of why it works)

b) It has at least 2 developers that are pretty much solely dedicated to make it work on android (the biggest part, something PJ64 and PCSX2 lack, for the reason that those 2 are even there, see a) )

Last edited by V1del; 17th November 2014 at 03:02 PM.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 02:31 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.