#1  
Old 25th February 2013, 04:37 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default "Static" interpreter?

Any chance somebody can tell me if this sort of method was ever implemented in an emulator they know of?

The basic (but somewhat arbitrary) structure I thought up of:
(for MIPS)
Code:
while (CPU_running)
{
    unsigned int instWord;

    fetch(instWord, cachePtr);
    cachePtr += sizeof(int);
    switch (instWord)
    {
        case 0x00000000:
            continue; // SLL $zero, $zero, 0
        case 0x0000000D:
            CPU_running = false;
            break; // BREAK
        case 0x00008020:
            GPR[16] = 0 + 0;
            continue; // ADD $s0, $zero, $zero
        case 0x08000200:
            PC = (0x0000200 << 2);
            continue; // J <target>
/* ... */
        default:
            printf("reserved/unimplemented");
            CPU_running ^= CPU_running;
            return;
    }
It's still an interpreter in a limited sense.
We see the decode phase totally voided.
A recompiler, unless perhaps static, will still undergo the decode phase.
Therefore this probably is faster than a dynamic recompiler, barring the problems arising from a massive (switch) block pointer.

This is a terrible amount of work but why not somebody try one?

It beats me having massive parameter lists pushed into the call stack for interpreter functions all the time (much slower than this but readable and maintainable).
Reply With Quote
  #2  
Old 25th February 2013, 05:12 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

(Damn, this forum needs more coders. It's like I'm starting to scare away all the gamers nao with all this blog shit. )

I've just instantiated an example benefit of using this alleged "static interpreter" for the vector computational operations on the RSP.

So far since I'm losing home PC access again for a week I implemented just VMULF:
Code:
inline void execute_VU(unsigned int inst)
{
    register int i = 0; /* iteration count index for (liably unrolled) loops */

    switch (inst)
    {
        case 0x4A000000: // VMULF $v0, $v0, $v0
            for (i = 0; i < 8; i++)
                VACC[i].DW = 2*VR[0].s[i]*VR[0].s[i] + 0x8000;
            for (i = 0; i < 8; i++)
                VR[0].s[i] = (VR[0].s[i] == 0x8000) ? 0x7FFF : VACC[i].s[MD];
            return;
        default:
        {
            char debugger[] = "XXXXXXXX\nTreating as reserved.";
            const char digits[16] = {
                '0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'
            };
            debugger[00] = digits[(inst & 0xF0000000) >> 28];
            debugger[01] = digits[(inst & 0x0F000000) >> 24];
            debugger[02] = digits[(inst & 0x00F00000) >> 20];
            debugger[03] = digits[(inst & 0x000F0000) >> 16];
            debugger[04] = digits[(inst & 0x0000F000) >> 12];
            debugger[05] = digits[(inst & 0x00000F00) >>  8];
            debugger[06] = digits[(inst & 0x000000F0) >>  4];
            debugger[07] = digits[(inst & 0x0000000F) >>  0];
            MessageBoxA(NULL, debugger, NULL, 0x00000030);
            for (i = 0; i < 8; i++)
                VR[(inst >> 6) % 32].s[i] = 0x0000;
            return;
        }
    }
}
If the machine word == 0x4A000000, then we can statically conduct multiplication on VR[0][count={01234567}] using the more optimized formula: 2 * x * x (or the square of x, << by 1) + 0x8000 (or 2*(x*x + 0x4000), whichever the compiler deems more fitting).

Since the register specifiers are both the same (zero), we do NOT have to check if VACC[i].DW == 0x000080008000 (only possible overflow result that needs clamping), because that result is caused by 0x8000 * 0x8000 (+32768*+32768 or -32768*-32768), and since we know the registers are the exact same we only need to check 1 16-bit value.

If the current, exact vector instruction is something I haven't implemented yet it will draw a message box printing the 32-bit machine word in hex (in a way rebelliously defiant of the CRT or sprintf dependencies, mind you).
Until then, I assume it's a reserved instruction on the RSP and treat it the same way that Michael Tedder treated VSUT on PUR when he didn't realize the instruction was reserved.
Reply With Quote
  #3  
Old 26th February 2013, 12:29 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
Any chance somebody can tell me if this sort of method was ever implemented in an emulator they know of?

The basic (but somewhat arbitrary) structure I thought up of:
(for MIPS)
Code:
while (CPU_running)
{
    unsigned int instWord;

    fetch(instWord, cachePtr);
    cachePtr += sizeof(int);
    switch (instWord)
    {
        case 0x00000000:
            continue; // SLL $zero, $zero, 0
        case 0x0000000D:
            CPU_running = false;
            break; // BREAK
        case 0x00008020:
            GPR[16] = 0 + 0;
            continue; // ADD $s0, $zero, $zero
        case 0x08000200:
            PC = (0x0000200 << 2);
            continue; // J <target>
/* ... */
        default:
            printf("reserved/unimplemented");
            CPU_running ^= CPU_running;
            return;
    }
It's still an interpreter in a limited sense.
We see the decode phase totally voided.
A recompiler, unless perhaps static, will still undergo the decode phase.
Therefore this probably is faster than a dynamic recompiler, barring the problems arising from a massive (switch) block pointer.

This is a terrible amount of work but why not somebody try one?

It beats me having massive parameter lists pushed into the call stack for interpreter functions all the time (much slower than this but readable and maintainable).
Sorry, been busy working on my emulator.

This will have absolutely horrific performance. The space of instructions is _far_ too large. You'd need to minimalize the problem space. Even still, dynarec would kill you in your sleep due to the fact that you're branching over large regions.

Read: http://stackoverflow.com/questions/1...unsorted-array

Not totally relevant, since you're not "missing" much in this case, but your TLB and instruction cache are going to be destroyed.
Reply With Quote
  #4  
Old 26th February 2013, 12:36 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Posting again like a bad person.

A well-written interpreter has very short functions anywho:
Code:
0000000000000000 <VR4300ADDU>:
   0:   8b 87 e8 01 00 00       mov    0x1e8(%rdi),%eax #\ 
   6:   01 d6                   add    %edx,%esi          # -- dispatched in parallel
   8:   48 63 f6                movslq %esi,%rsi         \
   b:   48 89 b7 f0 01 00 00    mov    %rsi,0x1f0(%rdi)   / group 1
  12:   c1 e8 0b                shr    $0xb,%eax        \
  15:   83 e0 1f                and    $0x1f,%eax        |  group 2
  18:   48 89 87 f8 01 00 00    mov    %rax,0x1f8(%rdi) /
  1f:   c3                      retq
The first two instructions can be dispatched in parallel, and the other 2 groups of instructions can run in parallel following that. Any aggressive out-of-order uarch will do this.

If you were to do the special-case version... you'd probably get something like [note: NOT valid byte code; was just moving mnemonics around]:

Code:
0000000000000000 <VR4300ADDU>:
   6:   01 d6                   add    %edx,%esi
   8:   48 63 f6                movslq %esi,%0x1f0($some_constant_addr)
  12:   c3                      retq

vs. a highly optimized version for one specific case might shave off a couple of instructions or so. However, as I was saying earlier, the problem space is _so_ big that you're going to obliterate whatever cache you have.

Last edited by MarathonMan; 26th February 2013 at 12:48 AM.
Reply With Quote
  #5  
Old 26th February 2013, 10:18 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

Never really learned about ICACHE at least on Intel. I wasn't sure if cache was updated after an indexed jump off of a switch expression for example or if it needed a function call to reorganize an update to cache mem.

So with a huge switch on 2^32 possibilities for RSP instructions it would still try to load all of that at once at some point...it sounds like you're saying.

(ohh my aching arms...typing feels so weird after rigorously moving shit out of someone's apartment for like the third time of the year)

Anyway this could explain why the MAME RSP recompiled as a Windows N64 plugin was so much slower when built by MSVC, than by GCC/MinGW. That huge switch() on the divisive instruction word was optimized better by GCC for linear jumps.

I always pictured an optimized block pointer as like:
Allocate inline functions on aligned intervals (e.g. 256 bytes per case block, 0:0xFF for case A, 0x100 to 0x1FF for case B, etc.), shift the case value (0 to 15 or w/e for example) <<= 8 and you have the block pointer.

But I never see that sort of thing for complete switch tables. Or the linker manages that.
Reply With Quote
  #6  
Old 26th February 2013, 11:02 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

Also.

While you did certainly explain the problem of using a switch() in this nature... does this correlate to the problems of this alleged "static interpreter" by means of a function pointer table?

So what if we make one function for each MIPS 32-bit word?
like execute_00000000(...) to execute(FFFFFFFF)

Would that share the same problem you mentioned about overloading the "problem space"?

if it doesn't overload the user's ram lmao

For the sake of analysis let's try to address cases where a machine has 16-bit instructions, not 32- , making this a bit cleaner.
Reply With Quote
  #7  
Old 26th February 2013, 11:07 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

By the way.

Is branch prediction (or even prevention) so important, that you would consider this simplified method of clamping VMULF accumulator results to the destination vector register file:

Code:
; 20   :         VR[vd].s[i] = (VACC[i].W[0] == 0x80008000) ? 0x7FFF : VACC[i].s[MD];
.. inferior to this method?
Code:
; 20   :         VR[vd].s[i] = VACC[i].s[MD] - (VACC[i].W[0] == 0x80008000);
Because if the vector accumulator result was 0x000080008000, then VACC[i].s[MD] is 0x8000, so we subtract 1 (the condition (acc == 0x000080008000)) from the result to effectively produce the 0x7FFF 16-bit signed overflow clamp result.

This is one way to prevent branching altogether but would you say it's worth it based on the clues you have provided me?
Reply With Quote
  #8  
Old 26th February 2013, 11:20 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

I was just looking back at a couple of your posts back on EmuTalk.

This isn't very important (unless you're a perfectionist like me), but I looked back at a couple of the verbosity displays you have in CEN64:
Code:
[RSP]: "Initializing COP0."
[RSP]: "Initializing COP2."
To be clear, it would probably be preferrable to say "CP0 and CP2" instead of "COP0 and COP2", unless you literally are initializing the actual instructions in the opcode matrix or something.

MIPS family revisions documentation always denotes the co-processors as "CPz", where 0 <= z <= 3.

"COP" probably means "coprocessor operation" (they use "CO" in the MIPS32 matrix in updated manuals not applied to the N64), except the "C" in "COP" means "coprocessor" and "OP" means "operation", most likely though I don't know this for sure.

Nobody gives a shit but me.
Reply With Quote
  #9  
Old 27th February 2013, 12:00 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
Never really learned about ICACHE at least on Intel. I wasn't sure if cache was updated after an indexed jump off of a switch expression for example or if it needed a function call to reorganize an update to cache mem.
Caches are (1) completely transparent, (2) small, and (3) fast. If you have to go all the way out memory, you're easily going to wait for 100+ cycles, even on a modern machine. Caches compensate by attempting to store subsets of the memory (based on access patterns). However, the combination of addresses that can be stored in a cache at any given time is limited (you can't just store addresses on a common stride in the cache due to architectural limitations), so it's a non-trivial topic. I'm not math person, so sorry if that doesn't make much sense.

However, considering that L1 caches (2-3 cycle access) are only 16-32KB on most systems, and given the fact that you can only store certain addresses simultaneously in the cache, you can see how things get out of hand rather quickly.

Quote:
Originally Posted by FatCat View Post
So with a huge switch on 2^32 possibilities for RSP instructions it would still try to load all of that at once at some point...it sounds like you're saying.
No; things are loaded on the granularity of 64 byte blocks on Intel systems (the lower bits are masked off, and the remaining bits index into the cache). I'm just saying that your switch statement is going to be so big, your accesses are going to be very spread out, and as a result, you're going to conflict and capacity miss on most iterations, forcing you to go all the way out to main memory.

Quote:
Originally Posted by FatCat View Post
Anyway this could explain why the MAME RSP recompiled as a Windows N64 plugin was so much slower when built by MSVC, than by GCC/MinGW. That huge switch() on the divisive instruction word was optimized better by GCC for linear jumps.
Maybe. As hesitant as people are to accept the fact, GCC just has a better optimizer. End of story. I don't know what version your MinGW install uses, but gcc has a feature called 'flto' (full link-time optimization) in the newest versions (really only stable >= 4.6.0) that basically (with another flag) can treat your program as if it were one giant file composed entirely of functions with static linkage. To give you an idea of what it does, my binary's size drops ~25%+ after I turn it on. Yeah.

Quote:
Originally Posted by FatCat View Post
I always pictured an optimized block pointer as like:
Allocate inline functions on aligned intervals (e.g. 256 bytes per case block, 0:0xFF for case A, 0x100 to 0x1FF for case B, etc.), shift the case value (0 to 15 or w/e for example) <<= 8 and you have the block pointer.

But I never see that sort of thing for complete switch tables. Or the linker manages that.
That's not ideal unless you have rather large cases, but that's more or less the idea of what should happen with a densely populated switch statement. Again, all I'm trying to say is that you have so many cases that you'll essentially render your cache useless. There might be a "middle ground" that you can find that sits in between your method and a generic interpreter, though.

Good engineering is the product of lot of guess and check work, gambling, head scratching, and sleepless nights.
Reply With Quote
  #10  
Old 27th February 2013, 12:05 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
Would that share the same problem you mentioned about overloading the "problem space"?
Yes. Your problem is going to be capacity misses; your problem size is too big for the constraints given to you. See the PM I'm about to send you.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 12:22 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.