Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #391  
Old 29th August 2013, 02:09 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
I gave the file I/O crazy idea a second try and somehow it's no longer unpredictable, at least not under Mupen64 CPU interpreter. ;P
Errr... minor detail I forgot to mention: you have to use the (pure) interpreter CPU to make this technique work.
Reply With Quote
  #392  
Old 29th August 2013, 06:36 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Maybe, maybe not.

Usually in Mupen64 0.5.1 the regular interpreter works more stable than the pure interpreter.

I don't know if that's a bug with RCP feedback in LLE from the CPU due to wrong informations sent from maybe the regular interpreter, causing falsely stable games.

But otherwise, several ROMs (like Mario64, Action Replay Pro) won't even boot on the "Pure Interpreter" mode (or in Mario's case, get past the logo), only the regular "Interpreter" mode or sometimes of course, the dynarec (but not for Action Replay Pro).

These things may have been the sort fixed at some point during Mupen64Plus dev but I am not really interested in Mupen64Plus. ShadowPrince's port of Hacktarux's original emulator has always been sufficient for my needs.
Reply With Quote
  #393  
Old 1st September 2013, 06:39 PM
shunyuan's Avatar
shunyuan shunyuan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Apr 2013
Posts: 491
Default

Send you some motivation to speed up bug fixing.
__________________
---------------------
CPU: Intel U7300 1.3 GHz
GPU: Mobile Intel 4 Series (on board)
AUDIO: Realtek HD Audio (on board)
RAM: 4 GB
OS: Windows 7 - 32 bit
Reply With Quote
  #394  
Old 1st September 2013, 07:27 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

This accumulation was helpful, although unnecessary.

I actually got tired of infinitely guessing the problem, that I decided to watch some shit.
Not the kind that you linked, but I took a break from bug-fixing and this RSP emulator, and watched all 66 episodes of the animated Teen Titans.

It only lasted 3 maybe 4 days of watching them all, but I guess I am ready to go back to work now. :P
Reply With Quote
  #395  
Old 2nd September 2013, 07:15 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Just what I was afraid of at first!
I had a bug in my debugger.

I was so amazed at how lucky I felt for writing MM's file I/O debugger idea working on the very first try (had to guess a lot of things related to argument vector endianness in between imports/exports) that I forgot that it wouldn't necessarily be perfect.

No, still haven't committed anything to Git because I hate this stupid debugger and I want this to be a one-time deal.

Anyway, Super Mario 64 had these results for the very first sync fail:
Code:
task number:  1
PC offset:  0x7D4
count:  70
SR[23] = 0xFFFFFEB8
wrong :  0x00227000
Now this is a much more believable outcome!
I picked this with HLE audio so audio ucodes wouldn't get in the way.
Mario64 immediately at the very start exploited my bug in this plugin so that is why I picked it.

It means, in order of listing:
  • Very first task ever executed on the RSP game boot compares successfully, but not the next one straight after that.
  • Instruction delay slot for this PC displacement: 0x04001 || 0x7D4
  • 70 total instructions executed so far (in this task only).
  • RSP scalar GPR no. 23 has the async: Correct result is -328, but instead signaled 227000 [hex].

So tomorrow (EDIT, today when I wake up :P) I can jump in my RSP disassembler for my current unstable beta tomorrow and exit it 70 times (70 total instructions executed) and compare all the results and find out which instruction just got executed to raise the bug.

Also I am no longer using prototype of zilmar's 1.4 RSP plugin for the stable comparison base; I got my newer RSP public release 4 source code from this very thread plugged in instead to guarantee accurate RSP comparison. ;P

Last edited by HatCat; 2nd September 2013 at 07:18 AM.
Reply With Quote
  #396  
Old 3rd September 2013, 02:03 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Caught the bug finally.
I find this hard to believe even now, but it really was a bug with LBU/LHU.

Only I find it hard to believe because I could have sworn the new zero-extension algorithm wasn't added until after I already started fixing pre-existing bugs.

Anyway, to be more specific:
Code:
...
1FC 304200FE ANDI    $2, $2, 0x00FE
200 84420076 LH      $2, 0x076($2)
204 00400008 JR      $2
208 9361FFFF LBU     $at, -1($27) # DMEM[0x6B4]:  0x80
288 9361FFFB LBU     $at, -5($27) # DMEM[0x6B0]:  0x06
...
SP data cache stores 0x80 at cache byte 0x6B4 on both unstable beta and old stable version.
The difference was that the latest release in this thread wrote it out to the assembler temporary as 0x00000080, whereas my unstable beta was writing it as 0x00000000.

Reason why:
The zero-extension macro I wrote zero-extended it by the little-endian bit number, using this purposefully slower and software-forced mechanism:
Code:
#define ZE(x, b)    (-(x & (0 << b)) | (x & ~(~0 << b)))
If I had changed the bit-number to shift ~0 << by to increment to the actual exterior interval then the MSB of the intended word clamp would have been preserved.
Instead, it took this value and flushed it accidentally.

Rather than directly fix it I just changed to use type conversion instead of the ZE macro, though I rewrote it anyway.
Code:
#define ZE(x, b)    (+(x & (1 << b)) | (x & ~(~0 << b)))
Again, I know that's slow and overly forceful. I made it that way on purpose.

Anyway, all the games are booting without issues now.
It appears I'm back to the way I was.
I just need to work out how I am going to test speeds now with MarathonMan's advice.

Last edited by HatCat; 3rd September 2013 at 02:09 AM.
Reply With Quote
  #397  
Old 3rd September 2013, 03:47 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
Caught the bug finally.
I find this hard to believe even now, but it really was a bug with LBU/LHU
Sucks, doesn't it?

I actually did the opposite (kind of) and didn't sign-extend LH and it took me a long time to find.
Reply With Quote
  #398  
Old 3rd September 2013, 03:56 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Current speed tests are the same as they were when I posted them for Conker's.

Code:
new:  57-65 VI/s
old:  58-66 VI/s
for that motionless File Select screen in Mario64

It's still just 1 VI/s slower, all the time, consistently it appears.

I just need to switch some shit back on, and I am pretty confident it will be faster than the old version.
(Maybe not the LH/LW/SH/SW if/else-if trees for trying to write multiple bytes at once though...this branch tree checking seems to make it an extra VI/s slower rather than faster unfortunately.)

Quote:
Originally Posted by MarathonMan View Post
Sucks, doesn't it?

I actually did the opposite (kind of) and didn't sign-extend LH and it took me a long time to find.
Due to my habit of constantly using plain C keywords (signed/unsigned followed by short/int all the time, unless the signed-ness is irrelevant to the algorithm) that would have been an extremely easy bug for me to catch.

I always knew to sign them, but I had an earlier bug where I assumed that, just because for LSV, LLV, LDV etc., the offset used to compute the address is multiplied by * 2, * 4 or * 8? Well I falsely assumed you're supposed to follow the same pattern for LH/LW.

I'm still amazed my bug was with zero-extension though.
What a simple thing. All you do is & 0xFF or & 0xFFFF.
I added that ZE() macro while I was already searching for the bug, so it seemed impossible to me that it had anything to do with that.

Proof speaks otherwise though. :/
Reply With Quote
  #399  
Old 4th September 2013, 05:03 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

More speed gains along the way!

In the meantime, I have made a discovery.

The union-indexing service for decoding the 6-bit MIPS opcode:
Code:
            EX_SCALAR[inst.J.op][inst.W>>sub_op_table[inst.J.op] & 077]();
...you would think is already optimized.

After all, if you use a union instead of trying to hardcode all the bit masks and shifts yourself, compiler intrinsics can figure out shortcuts for you right?

Well, even though the answer is yes, it seems that factor may sometimes be outweighed.
Code:
            EX_SCALAR[inst.W >> 26][inst.W>>sub_op_table[inst.W >> 26] & 077]();
I just thought of doing this to reduce the possible risk of extra pushing from the memory stack under the possible illusion to the compiler of extra variables, just to experiment.
[(inst.W >> 26) is the same value as (inst.R.op, inst.I.op, inst.J.op).]

It is fortunate that I tried this experiment.
The second method, while a little bit more arrogantly written (rather than letting the compiler *virtually* do the shift by 26 for me), trims a little bit of excess fetching off the end computation.

AT&T x86 code output for the faster, second way to write it:
Code:
L1055:
    movl    $0, _SR
    movl    _inst, %eax
    movl    %eax, %edx
    shrl    $26, %edx
    movl    _sub_op_table(,%edx,4), %ecx
    shrl    %cl, %eax
    andl    $63, %eax
    sall    $6, %edx
    addl    %edx, %eax
    call    *_EX_SCALAR(,%eax,4)
For the previous, slower way of writing it:
Code:
L1055:
    movl    $0, _SR
    movb    _inst+3, %al
    shrb    $2, %al
    movzbl  %al, %edx
    movl    _sub_op_table(,%edx,4), %ecx
    movl    _inst, %eax
    shrl    %cl, %eax
    andl    $63, %eax
    sall    $6, %edx
    addl    %edx, %eax
    call    *_EX_SCALAR(,%eax,4)
Both code outputs are fully identical except for what I highlighted in red.

In key, the latter output block:
  • Relies on slower 8-bit memory segmentations, instead of complete 32-bit registers.
  • Has an extraneous memory writeback operation after the sub-op-code shifter table, the `MOV eax, _inst`.
I gained a similar simplification of the assembly by changing the LWC2/SWC2 group I vector transfers (L/S BV,SV,LV,DV) from this:
Code:
void LS_Group_I(int direction, int length)
{ /* Group I vector loads and stores, as defined in SGI's patent. */
    register unsigned long addr;
    register int i;
    register int e = (inst.R.sa >> 1) & 0xF;
    const signed int offset = -(inst.SW & 0x00000040) | inst.R.func;

    addr = (SR[inst.R.rs] + length*offset);
    if (direction == 0) /* "Load %s to Vector Unit" */
        for (i = 0; i < length; i++)
            VR_B(inst.R.rt, (e + i) | 0x0) = RSP.DMEM[BES(addr + i) & 0xFFF];
    else /* "Store %s from Vector Unit" */
        for (i = 0; i < length; i++)
            RSP.DMEM[BES(addr + i) & 0xFFF] = VR_B(inst.R.rt, (e + i) & 0xF);
    return;
}
to this:
Code:
void LS_Group_I(int direction, int length)
{ /* Group I vector loads and stores, as defined in SGI's patent. */
    register unsigned long addr;
    register int i;
    register int e = (inst.R.sa >> 1) & 0xF;
    const signed int offset = SE(inst.SW, 6);

    addr = (SR[inst.R.rs] + length*offset);
    if (direction == 0) /* "Load %s to Vector Unit" */
        for (i = 0; i < length; i++)
            VR_B(inst.R.rt, (e + i) | 0x0) = RSP.DMEM[BES(addr + i) & 0xFFF];
    else /* "Store %s from Vector Unit" */
        for (i = 0; i < length; i++)
            RSP.DMEM[BES(addr + i) & 0xFFF] = VR_B(inst.R.rt, (e + i) & 0xF);
    return;
}
(where SE() is a static sign-extension macro forced in software as follows)
Code:
#define SE(x, b)    (-(x & (1 << b)) | (x & ~(~0 << b)))
#define ZE(x, b)    (+(x & (1 << b)) | (x & ~(~0 << b)))
Reply With Quote
  #400  
Old 4th September 2013, 05:10 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Oh yeah, I forgot.

I also made a commit to Git for "fixing" the SP_PC_REG.

This isn't something I would expect any games to exploit, but if they do exploit it, it will surely break zilmar's RSP emulator.

Actual IMEM tracking for maintaining the instruction-fetching is moved away from SP_PC_REG and uses a local register instead.
Instead, there is some evidence suggesting that SP_PC_REG = 0x04001000 | (PC offset),
where 0x04001000 is the base address for SP IMEM start.

So if a game exploited the RCP memory map by using the CPU-read mode on SP_PC_REG, outside of the RSP emulator and in the main Project64 CPU core, then it could break the current Project64 RSP emulator.

No speed cost behind adding this, just compensating for the small rewrite to maintain the same speed.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 08:28 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.