#21  
Old 26th January 2013, 12:11 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Sounds great! I'll save some pages on that for home.
Got to go now in some seconds till next day.

By the way does that also handle the vector accumulator? Is that internally supported within those SSE2 methods, or is that something that is supposed to get managed some other way / different than the way e.g. zilmar did it in VAND?
Reply With Quote
  #22  
Old 26th January 2013, 12:33 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
Sounds great! I'll save some pages on that for home.
Got to go now in some seconds till next day.

By the way does that also handle the vector accumulator? Is that internally supported within those SSE2 methods, or is that something that is supposed to get managed some other way / different than the way e.g. zilmar did it in VAND?
AFAIK, SSE is useless for anything that requires carry bits (like multiprecision arithmetic functions). I have found uses for it in VAND, VOR, VXOR, etc. but not VADDC, ...

zilmar doesn't appear to use SSE intrinsic functions... maybe MSVC is taking care of it for him, or maybe he just never got around to it!

Last edited by MarathonMan; 26th January 2013 at 12:36 AM.
Reply With Quote
  #23  
Old 26th January 2013, 12:43 AM
zilmar zilmar is offline
Core Team
Alpha Tester
Project Supporter
Administrator
 
Join Date: Jun 2005
Posts: 988
Default

the rsp interrupter is meant to be accurate. The recompiler is meant for speed .. it that I believe it use mmx2 since that is more closer to the RSP ..

where MMX uses fixed point, SSE was using floating point.
Reply With Quote
  #24  
Old 26th January 2013, 12:52 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by zilmar View Post
the rsp interrupter is meant to be accurate. The recompiler is meant for speed .. it that I believe it use mmx2 since that is more closer to the RSP ..

where MMX uses fixed point, SSE was using floating point.
Whoops, my bad -- I was looking at the reinterpreter core. SSE is for integers, too, though!

msdn.microsoft.com/en-us/library/8ayabe4k(v=vs.71).aspx

Last edited by MarathonMan; 26th January 2013 at 12:55 AM.
Reply With Quote
  #25  
Old 26th January 2013, 05:30 AM
squall_leonhart's Avatar
squall_leonhart squall_leonhart is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Mar 2007
Location: Sydney, Australia
Posts: 2,915
Default

the original SSE was known as MMX2 well into its RnD ;D
__________________

CPU:Intel Xeon x5690 @ 4.2Ghz, Mainboard:Asus Rampage III Extreme, Memory:48GB Corsair Vengeance LP 1600
Video:EVGA Geforce GTX 1080 Founders Edition, NVidia Geforce GTX 1060 Founders Edition
Monitor:ROG PG279Q, BenQ BL2211, Sound:Creative XFI Titanium Fatal1ty Pro
SDD:Crucial MX300 275, Crucial MX300 525, Crucial MX300 1000
HDD:500GB Spinpoint F3, 1TB WD Black, 2TB WD Red, 1TB WD Black
Case:NZXT Phantom 820, PSU:Seasonic X-850, OS:Windows 7 SP1
Reply With Quote
  #26  
Old 26th January 2013, 07:09 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Quote:
Originally Posted by MarathonMan View Post
Whoops, my bad -- I was looking at the reinterpreter core. SSE is for integers, too, though!

msdn.microsoft.com/en-us/library/8ayabe4k(v=vs.71).aspx
Of course, what zilmar didn't inform is that the recompiler has had the same basic types of limitations.

Instead of revising either the recompiler or the interpreter for accuracy in all cases, he's more interested in making either one more accurate only for cases where games are forcing him to apply the change.

For example you can see lots of cases in his source such as MFC0 that are unsecured if the game tries to write to $zero. He also does not disable RSP_Running macro if a programmer uses MTC0 to SP_STATUS to set the HALT status flag themselves and break out of the instruction without directly executing BREAK (which of course I have never seen a game try yet).

Either way, if SSE/2 is our ticket to doing data transfers in parallel, he's missing the big point. Speed is just a coincidence; it's more accurate to use SSE2 intrinsics if it means emulating the parallel data transfer behavior.
Or he would not have just finished revising the RSP to use a temporary vector prebuffer instead of his legacy source element transposition matrix from 1.4 he forgot about.

Often times in my development I see myself only making the RSP faster if it also means making it more accurate, though there are times I have made accuracy fixes that did not have the converse effect of also making it faster.
Reply With Quote
  #27  
Old 26th January 2013, 07:28 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

So I have an update thanks to the information you referred me to MarathonMan about using compiler intrinsics for swaps/rotates/bit pattern exchanges/etc..

There is still a minor (but not in-apt to work around) problem which, in particular, the three methods you exemplified earlier (byte swap 16 bits, bswap 32-bit word, bswap 64 bits) won't directly handle.

First, I made a diagram to help me analyze this stuff.
Code:
 *              00010203   00010203
 * MIPS native:  A-B C-D    E-F G-H
 * Intel flip :  D-C B-A    H-G F-E
 *              00  01     00  01
That diagram helps us deduce the following rules on halfword transfers:
  1. mov AB by XOR'ing the addr by 2 to mov B-A
  2. mov CD by swapping the HW endian as defined above to move D-C
  3. mov BC by directly accessing *(short) on Intel; do NOT ^ by 2
  4. mov EF as in case 1, GH as in case 2, FG as in case 3
  5. Never seen a game use DE ((addr & 03) == 03), but it's harder.

Let's say that ((addr & 0x003) == 0x003) (#5 in the above list).

In a correct-endian DMEM segment such as what you are able to set up, all we have to do is load the short starting at point D in the MIPS native ordering, and E will be the LSB.

In the zilmar plugin system however where constantly correcting the SP memory endian-ness is not so convenient, we have to adapt and load the left-most byte of the 8-byte pair as the MSB, and the right-most byte of the 8-byte pair as the LSB. This means that to solve the problem we may have to use that 32-bit byteswap method you pointed me to twice (on both contiguous DMEM components), load DE as directly as what you are already able to do yourself with the endian pre-corrected beforehand, and then undo the byteswap so that zilmar's plugin system can "understand it".

Kind of a mess.

I decided to do a hybrid of solutions though.

Taking LHU for example (the most common RSP primary opcode where N64 games try to do unaligned addr):
Code:
    switch (addr & 0x003)
    {
        case 00:
        DEFAULT:
            addr ^= 02; /* halfword endian swap */
        case 01:
            SR[rt] = *(unsigned short *)(RSP.DMEM + addr);
            return;
        case 02:  goto DEFAULT;
        case 03:
            if (addr == 0xFFF) /* LOL */
            {
                message("LHU\nCrossed DMEM allocation barrier.", 1);
                SR[rt] = (RSP.DMEM[0xFFF ^ 03] << 8) | RSP.DMEM[0x000 ^ 03];
                return;
            }
            message("LHU\nCrossed word endian boundary.", 0);
            SR[rt] = (RSP.DMEM[addr - 03] << 8) | RSP.DMEM[addr + 04];
            return;
    }
I did something different for Store Halfword to practice using the methods you referred me to:
Code:
    switch (addr & 0x003)
    {
        case 00:
        DEFAULT:
            addr ^= 02; /* halfword endian swap */
        case 01:
            *(short *)(RSP.DMEM + addr) = (short)SR[rt];
            return;
        case 02:  goto DEFAULT;
        case 03: {
            if (addr == 0xFFF) /* LOL */
            {
                message("SH\nCrossed DMEM allocation barrier.", 1);
                RSP.DMEM[0xFFF ^ 03] = (unsigned char)(SR[rt] >> 8);
                RSP.DMEM[0x000 ^ 03] = (unsigned char)SR[rt];
                return; /* No, MSVC didn't employ swaps/rotates. :( */
            }
            message("SH\nUntested, unoptimized.", 2);
            *(int *)(RSP.DMEM + addr - 3)
          = _byteswap_ulong(*(int *)(RSP.DMEM + addr - 3));
            *(int *)(RSP.DMEM + addr + 1)
          = _byteswap_ulong(*(int *)(RSP.DMEM + addr + 1));
            *(short *)(RSP.DMEM + addr) = (short)SR[rt];
            *(int *)(RSP.DMEM + addr - 3)
          = _byteswap_ulong(*(int *)(RSP.DMEM + addr - 3));
            *(int *)(RSP.DMEM + addr + 1)
          = _byteswap_ulong(*(int *)(RSP.DMEM + addr + 1));
            return;
        }
    }
Mostly for examples, but I'm sure in time I will be able to better integrate what I've learned.
Reply With Quote
  #28  
Old 26th January 2013, 08:09 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
...
I hate to rain on your parade, but 'pshufb' (and it's intrinsic function, "_mm_shuffle_epi8") will do exactly what you're looking for.

msdn.microsoft.com/en-us/library/bb531427(v=vs.90).aspx

IIRC, it's part of SSE3. However, I've found so many uses for it that I'm using it anyways.

Last edited by MarathonMan; 26th January 2013 at 08:17 PM.
Reply With Quote
  #29  
Old 26th January 2013, 08:22 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

I can see how that may also solve the problem.

I didn't want to jump immediately to operations dealing with 64-bit datum or higher precisions under the assumption that they would operate slowly compared to just doing 32-bit or 16-bit byteswaps. Not sure how much slower one method will be over the other.

Regardless, saved.

By the way, long before mudlord told me about the Ultra64 RSP programming guide, much of the official info I got about traditional vector systems was from other patents. There was one in particular by SGI themselves which directly corresonds to LWC2 and SWC2 under the RCP: U.S. patent number 5,812,147. Since last I remember you were just implementing some things under those groups, most of the info is already in the info you have, but they explain some stuff there as well in case you might find an alternate resource useful.
Reply With Quote
  #30  
Old 26th January 2013, 08:29 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
I can see how that may also solve the problem.

I didn't want to jump immediately to operations dealing with 64-bit datum or higher precisions under the assumption that they would operate slowly compared to just doing 32-bit or 16-bit byteswaps. Not sure how much slower one method will be over the other.

Regardless, saved.

By the way, long before mudlord told me about the Ultra64 RSP programming guide, much of the official info I got about traditional vector systems was from other patents. There was one in particular by SGI themselves which directly corresonds to LWC2 and SWC2 under the RCP: U.S. patent number 5,812,147. Since last I remember you were just implementing some things under those groups, most of the info is already in the info you have, but they explain some stuff there as well in case you might find an alternate resource useful.
When you're optimizing for speed later, you might want to try it out. In my experience and from what I know, the additional cost of loading a %xmm register over a general-purpose register is minimal compared to the cost of conditional branching to select a code-path specifically designed for x-byte transfers. Specifically, branch misdirection will absolutely crush your performance. I've tried to counter this by making 'general' code paths that work for multiple byte-sizes, but have no branches. I might have overdone it in some cases, though...

Thanks for the patent #. I think I already have that one archived (there's one SGI patent that discusses all the VICE LWC2/SWC2 instructions, but it doesn't go into much detail IIRC -- maybe that's the one?).
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 03:06 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.