Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #21  
Old 16th February 2013, 02:20 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Oh man, I did it again.

Been looking at code for over 12 hours today.

I did find some great optimizations for VMAD* when I relooked earlier today though.
Reply With Quote
  #22  
Old 16th February 2013, 02:26 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Sounds great! VMAD* is after all by far the most frequently executed group of RSP instructions (under any division of the primary opcode matrix I believe) in commercial 3-D applications, so it is definitely important to make sure those are optimized.;

Quote:
Originally Posted by MarathonMan View Post
Oh man, I did it again.

Been looking at code for over 12 hours today.
Meh it's alright, kind of wish I could say the same.

I had to leave for my own health, to use a public PC where I couldn't program/test.

So I just spent the afternoon playing 60-second chess and GODDAMN I HATE THIS FUKKIN SERVER SFSDDFDF, fell into a losing streak cause of connection drops / server issues. Man, losing timed chess is like getting your ass kicked at SSB. Kept fisting my table XD

ANYWAY lololz of course I meant, "ultimately", MAME's algorithm is identical to mine.
There are some subtle/indirect differences, like using that "vres" parent variable as a temporary vector register to prebuffer everything over to the real destination register at the end of the operation, and using that inner if-else chain to split the two possible conditions for setting the VCC flag. But the results written back to all registers will always be identical when using MAME's method or mine and that's what counts.
Reply With Quote
  #23  
Old 16th February 2013, 02:40 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Right, I noticed... just after you slapped me in the face.

VMADH:
Code:
0000000000000220 <RSPVMADH>:
     220:       89 f2                   mov    %esi,%edx
     222:       89 f0                   mov    %esi,%eax
     224:       c5 f9 6f 0d 00 00 00    vmovdqa 0x0(%rip),%xmm1        # 22c <RSPVMADH+0xc>
     22b:       00 
     22c:       c1 ea 10                shr    $0x10,%edx
     22f:       c5 fa 6f 97 00 02 00    vmovdqu 0x200(%rdi),%xmm2
     236:       00 
     237:       c1 e8 06                shr    $0x6,%eax
     23a:       83 e2 1f                and    $0x1f,%edx
     23d:       83 e0 1f                and    $0x1f,%eax
     240:       48 c1 e2 04             shl    $0x4,%rdx
     244:       c4 e2 79 23 ea          vpmovsxwd %xmm2,%xmm5
     249:       c5 fa 6f 24 17          vmovdqu (%rdi,%rdx,1),%xmm4
     24e:       89 f2                   mov    %esi,%edx
     250:       c1 ee 0b                shr    $0xb,%esi
     253:       83 e6 1f                and    $0x1f,%esi
     256:       c1 ea 15                shr    $0x15,%edx
     259:       c5 e9 73 da 08          vpsrldq $0x8,%xmm2,%xmm2
     25e:       48 c1 e6 04             shl    $0x4,%rsi
     262:       83 e2 0f                and    $0xf,%edx
     265:       c4 e2 79 23 d2          vpmovsxwd %xmm2,%xmm2
     26a:       c5 fa 6f 1c 37          vmovdqu (%rdi,%rsi,1),%xmm3
     26f:       48 c1 e2 04             shl    $0x4,%rdx
     273:       c4 e2 59 00 a2 00 00    vpshufb 0x0(%rdx),%xmm4,%xmm4
     27a:       00 00 
     27c:       c4 e2 79 23 f4          vpmovsxwd %xmm4,%xmm6
     281:       89 c2                   mov    %eax,%edx
     283:       c4 e2 61 00 d9          vpshufb %xmm1,%xmm3,%xmm3
     288:       c5 d9 73 dc 08          vpsrldq $0x8,%xmm4,%xmm4
     28d:       c4 e2 79 23 c3          vpmovsxwd %xmm3,%xmm0
     292:       c5 e1 73 db 08          vpsrldq $0x8,%xmm3,%xmm3
     297:       c4 e2 79 23 e4          vpmovsxwd %xmm4,%xmm4
     29c:       48 c1 e2 04             shl    $0x4,%rdx
     2a0:       c4 e2 79 23 db          vpmovsxwd %xmm3,%xmm3
     2a5:       c4 e2 49 40 c0          vpmulld %xmm0,%xmm6,%xmm0
     2aa:       c4 e2 59 40 db          vpmulld %xmm3,%xmm4,%xmm3
     2af:       c5 d1 fe c0             vpaddd %xmm0,%xmm5,%xmm0
     2b3:       c5 e9 fe d3             vpaddd %xmm3,%xmm2,%xmm2
     2b7:       c4 e2 79 09 c0          vpsignw %xmm0,%xmm0,%xmm0
     2bc:       c4 e2 69 09 d2          vpsignw %xmm2,%xmm2,%xmm2
     2c1:       c4 e2 79 2b c2          vpackusdw %xmm2,%xmm0,%xmm0
     2c6:       c5 fa 6f 97 10 02 00    vmovdqu 0x210(%rdi),%xmm2
     2cd:       00 
     2ce:       c5 f9 7f 87 00 02 00    vmovdqa %xmm0,0x200(%rdi)
     2d5:       00 
     2d6:       c5 e9 61 d8             vpunpcklwd %xmm0,%xmm2,%xmm3
     2da:       c5 e9 69 c0             vpunpckhwd %xmm0,%xmm2,%xmm0
     2de:       c5 e1 6b c0             vpackssdw %xmm0,%xmm3,%xmm0
     2e2:       c4 e2 79 00 c9          vpshufb %xmm1,%xmm0,%xmm1
     2e7:       c5 f9 7f 0c 17          vmovdqa %xmm1,(%rdi,%rdx,1)
     2ec:       89 87 80 02 00 00       mov    %eax,0x280(%rdi)
     2f2:       c3                      retq   
     2f3:       66 66 66 66 2e 0f 1f    data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
     2fa:       84 00 00 00 00 00
VMADN:
Code:
0000000000000450 <RSPVMADN>:
     450:       89 f2                   mov    %esi,%edx
     452:       89 f0                   mov    %esi,%eax
     454:       c5 f9 6f 1d 00 00 00    vmovdqa 0x0(%rip),%xmm3        # 45c <RSPVMADN+0xc>
     45b:       00 
     45c:       c1 ea 10                shr    $0x10,%edx
     45f:       c5 fa 6f 8f 10 02 00    vmovdqu 0x210(%rdi),%xmm1
     466:       00 
     467:       c1 e8 06                shr    $0x6,%eax
     46a:       83 e2 1f                and    $0x1f,%edx
     46d:       c5 fa 6f b7 00 02 00    vmovdqu 0x200(%rdi),%xmm6
     474:       00 
     475:       83 e0 1f                and    $0x1f,%eax
     478:       48 c1 e2 04             shl    $0x4,%rdx
     47c:       c5 fa 6f 2c 17          vmovdqu (%rdi,%rdx,1),%xmm5
     481:       89 f2                   mov    %esi,%edx
     483:       c1 ee 0b                shr    $0xb,%esi
     486:       83 e6 1f                and    $0x1f,%esi
     489:       c1 ea 15                shr    $0x15,%edx
     48c:       c5 f1 61 c6             vpunpcklwd %xmm6,%xmm1,%xmm0
     490:       48 c1 e6 04             shl    $0x4,%rsi
     494:       83 e2 0f                and    $0xf,%edx
     497:       c5 f1 69 f6             vpunpckhwd %xmm6,%xmm1,%xmm6
     49b:       c5 fa 6f 24 37          vmovdqu (%rdi,%rsi,1),%xmm4
     4a0:       48 c1 e2 04             shl    $0x4,%rdx
     4a4:       c4 e2 51 00 aa 00 00    vpshufb 0x0(%rdx),%xmm5,%xmm5
     4ab:       00 00 
     4ad:       c4 e2 79 23 fd          vpmovsxwd %xmm5,%xmm7
     4b2:       c5 f1 73 dd 08          vpsrldq $0x8,%xmm5,%xmm1
     4b7:       c4 e2 59 00 e3          vpshufb %xmm3,%xmm4,%xmm4
     4bc:       c4 e2 79 33 d4          vpmovzxwd %xmm4,%xmm2
     4c1:       c5 d9 73 dc 08          vpsrldq $0x8,%xmm4,%xmm4
     4c6:       c4 e2 41 40 d2          vpmulld %xmm2,%xmm7,%xmm2
     4cb:       c4 e2 79 33 e4          vpmovzxwd %xmm4,%xmm4
     4d0:       c4 e2 79 23 c9          vpmovsxwd %xmm1,%xmm1
     4d5:       c4 e2 71 40 cc          vpmulld %xmm4,%xmm1,%xmm1
     4da:       89 c2                   mov    %eax,%edx
     4dc:       48 c1 e2 04             shl    $0x4,%rdx
     4e0:       c5 e9 fe d0             vpaddd %xmm0,%xmm2,%xmm2
     4e4:       c5 f9 ef c0             vpxor  %xmm0,%xmm0,%xmm0
     4e8:       c5 f9 75 e8             vpcmpeqw %xmm0,%xmm0,%xmm5
     4ec:       c5 f1 fe ce             vpaddd %xmm6,%xmm1,%xmm1
     4f0:       c5 e9 6b e1             vpackssdw %xmm1,%xmm2,%xmm4
     4f4:       c4 e2 69 2b f1          vpackusdw %xmm1,%xmm2,%xmm6
     4f9:       c5 f9 65 fc             vpcmpgtw %xmm4,%xmm0,%xmm7
     4fd:       c5 c9 75 f5             vpcmpeqw %xmm5,%xmm6,%xmm6
     501:       c5 d9 65 e0             vpcmpgtw %xmm0,%xmm4,%xmm4
     505:       c5 41 db c6             vpand  %xmm6,%xmm7,%xmm8
     509:       c5 c1 eb f6             vpor   %xmm6,%xmm7,%xmm6
     50d:       c5 c9 ef ed             vpxor  %xmm5,%xmm6,%xmm5
     511:       c5 b9 eb ed             vpor   %xmm5,%xmm8,%xmm5
     515:       c5 7a 6f 87 20 02 00    vmovdqu 0x220(%rdi),%xmm8
     51c:       00 
     51d:       c4 e3 39 4c e4 50       vpblendvb %xmm5,%xmm4,%xmm8,%xmm4
     523:       c4 e2 59 00 db          vpshufb %xmm3,%xmm4,%xmm3
     528:       c5 d9 72 d2 10          vpsrld $0x10,%xmm2,%xmm4
     52d:       c5 f9 7f 1c 17          vmovdqa %xmm3,(%rdi,%rdx,1)
     532:       c4 e3 69 0e d0 aa       vpblendw $0xaa,%xmm0,%xmm2,%xmm2
     538:       c5 e1 72 d1 10          vpsrld $0x10,%xmm1,%xmm3
     53d:       c4 e3 71 0e c0 aa       vpblendw $0xaa,%xmm0,%xmm1,%xmm0
     543:       c4 e2 69 2b c0          vpackusdw %xmm0,%xmm2,%xmm0
     548:       c4 e2 59 2b db          vpackusdw %xmm3,%xmm4,%xmm3
     54d:       c5 f9 7f 87 10 02 00    vmovdqa %xmm0,0x210(%rdi)
     554:       00 
     555:       c5 f9 7f 9f 00 02 00    vmovdqa %xmm3,0x200(%rdi)
     55c:       00 
     55d:       89 87 80 02 00 00       mov    %eax,0x280(%rdi)
     563:       c3                      retq   
     564:       66 66 66 2e 0f 1f 84    data32 data32 nopw %cs:0x0(%rax,%rax,1)
     56b:       00 00 00 00 00
The crazy part in VMADN is essentially MESS's SATURATE_ACCUM function in SSE, so things get a little hairy.

I also tried compiling my code with -Os (instead of -O3/-O2); performance boosted by ~7%. (-Os = optimize for size).
Reply With Quote
  #24  
Old 16th February 2013, 03:17 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

vpand, vpor, vpxor ... amazing I think that you / the compiler should find a use for those in emulation of the vector multiplies.

I'll be honest all those SSE vector assembly lines of code make me tl;dr.
Then again I've never actually coded a full-fledged program in an assembly language before, just analyzed them.

And that SATURATE_ACCUM thing was a real nuissance.
ANSI compiler kept warning me that it was inline-expanded into the function and not kept as a call, so I came to the realization that almost every single time that function ever got called in MAME, the parameters were basically the exact same, so I could arrange it to be more static and eventually remove the need for a "SATURATE_ACCUM" extern.

Like I said it's all in the clamping code I put at the bottom of each vector multiply op header.
His method is basically identical to rspsim/doc but letting the parameters be passed dynamically, so I was like, uh who needs dat shit.

Quote:
Originally Posted by MarathonMan View Post
I also tried compiling my code with -Os (instead of -O3/-O2); performance boosted by ~7%. (-Os = optimize for size).
Seriously lmao?

Optimize for size = faster than optimize for speed?
I never even tried that.
Reply With Quote
  #25  
Old 16th February 2013, 03:35 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
vpand, vpor, vpxor ... amazing I think that you / the compiler should find a use for those in emulation of the vector multiplies.
That was my doing. [link]
Reply With Quote
  #26  
Old 16th February 2013, 05:24 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

I'm gonna pretend I understood one single word from that hardware electronics Wiki you linked to, and say, a taco well eaten!

Me is starting to think VCL may be even more complex than VCH. >.>
Reply With Quote
  #27  
Old 17th February 2013, 12:43 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Wow, VCL is definitely more complex than any of the other RSP operations I have rewritten (at least, in terms of implementing it correctly and reverse-engineering its operation correct, not so much the size of the opcode). So many tricks.

The linear operation of how VCL may execute on an interpreter is classifiable into four register set conditions, for (int i = 0; i < 8; i += 1):
  1. Bit `i + 0` of VCO[15..0] is set (CARRY), but not bit `i + 8`.
  2. Bit `i + 8` of VCO[15..0] is set (NOTEQUAL), but not bit `i + 0`.
  3. Both CARRY and NOTEQUAL are set.
  4. Neither CARRY nor NOTEQUAL were set.
I'm currently looking at case 1 (the most complicated quadrant).

I don't know where to begin explaining this shit.
But the amazing thing is, it's 99% identical to the algorithm zilmar hacked out. (So far I am seeing one edge condition not accounted for.)
The original code is so static and avoids branches, so it's difficult to read.
Reply With Quote
  #28  
Old 17th February 2013, 12:58 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Hmm...
Which is faster?

Code:
int get_flag_state(unsigned int mask, int bit_no)
{
    return ((mask >> bit_no) & 1);
}
-- or --

Code:
int get_flag_state(unsigned int mask, int bit_no)
{
    return ((mask & (1 << bit_no)) != 0);
}
Various experiments reveal that the latter may be faster only for inline-optimized conditions, where `bit_no` is in any particular case a known constant.
Plus, on some systems, shift left is generally faster than shift-right. (Multiplication is faster than division, plus you need to worry about sign-extension even when shifting rightwards.)

The first example however seems way more direct in the ultimate result for variable, dynamic conditions.
Reply With Quote
  #29  
Old 17th February 2013, 02:35 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Now I know why everybody who reversed the RSP is emulating a part of VCL this way:

Code:
if ( RSP_Vect[RSPOpC.rd].UHW[el] + RSP_Vect[RSPOpC.rt].UHW[del] > 0x10000) {
    RSP_ACCUM[el].HW[1] = RSP_Vect[RSPOpC.rd].HW[el];
    RSP_Flags[1].UW &= ~(1 << (7 - el));
}
The real algorithm actually is to check if the low 16 bits are NOT all zeroes, or if the high 16 bits are NOT all zeroes.

My implementation of the accurate VCL formula:

Code:
            const unsigned short VS = (unsigned short)VR[vs].s[i];
            const unsigned short VT = (unsigned short)VR[vt].s[j=i];
/*...*/
                {
                    const int sum = VS + VT;
                    int ce = (VCE >> i) & 1;
                    int lz = ((sum & 0x0000FFFF) == 0x00000000);
                    int uz = ((sum & 0xFFFF0000) == 0x00000000); /* !carryout */

                    le = (ce & (lz | uz)) | (!ce & (lz & uz));
                }
lz and uz (and ce) are all my names; they're not copyrighted labels.
The real names were confusing as fuck to read.
Detect carry vs. LO HW?
I would rather just say lz and uz: lower 16-bit immediate is a string of all zeroes?
upper 16-bit immediate is a string of all zeroes?

If either is true, then ce (zilmar's first check in the if of his I just pasted), which is true (1), is AND'd by 1 which returns true, and we buffer over the neg of the u16 VT source slice to the destination accumulator slice and vector register slice.

zilmar's method is effectively identical because:
`if (x > 0x00010000)`

all possibilities for x > 0x10000 mean that both the LO HW and the HO HW have non-zero bits, except possibilities like (0x00020000), which WOULD violate zilmar's algorithm and cause a failure in his RSP plugin, had 0x00020000 been a valid sum of two 16-bit unsigned integers (which in this case is unreachable code).

Crazy shit.

Last edited by HatCat; 17th February 2013 at 02:41 AM.
Reply With Quote
  #30  
Old 17th February 2013, 04:37 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

const int eq = (((VCO >> (i + 8)) & 1) == 0); /* !(NOTEQUAL) */
Code:
	mov	dx, WORD PTR _VCO
	mov	DWORD PTR _VS$8219[ebp], eax
	lea	eax, DWORD PTR [ebx+8]
	movzx	esi, ax
	movzx	eax, dx
	mov	DWORD PTR _VT$8221[ebp], ecx
	mov	ecx, esi
	shr	eax, cl
const int eq = ((VCO & (0x0100 << i)) == 0x0000);
Code:
	mov	bx, WORD PTR _VCO
	mov	DWORD PTR _VT$8221[ebp], eax
	mov	DWORD PTR _VS$8219[ebp], ecx
	mov	ecx, edx
	mov	eax, 256				; 00000100H
	shl	eax, cl
	movzx	ecx, bx
	and	eax, ecx
	neg	eax
Damn...those are extremely close.
Yes the second one has one extra Intel instruction, but there is a chance it might still be faster...I'm not one to ask on surmounted cycle counts across a range of instructions.

Either way, YAY, I made it through VCL and retained the accuracy for all results in the games it would seem.
Annotation-free assembly output for my VCH method is 498 lines and for my VCL function, 538 lines...survived the entire day of rewriting this crazy operation with only a 40-line overhead in the result spit out.

[It's a good idea to leverage the execution times of VCH with that of VCL since very many ucodes have a pattern of using both instructions an equal number of times per graphics task.]
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 01:15 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.