Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #1011  
Old 6th October 2014, 05:46 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Quote:
Originally Posted by Vardius2985 View Post
I'm definitely going to stick with this RSP plugin. How slow will Jabo's Direct 3D8 1.7.x using the LLE side of it when the RSP is set to LLE be just out of curiosity?
Jabo's LLE is typically fast. Most games should run full speed due to hardware acceleration. Some games are still slow (like Rogue Squadron, WDC, Stunt Racer, etc).

Like HatCat said, it depends on the game. With Jabo's, it's even more complicated because it depends how much hardware acceleration was implemented for that specific game. I believe some games (like Quake II, F-Zero, etc) are heavy on RDP and RSP hardly effects speed, when using Pixel Accurate plugin. Since you're interested in Jabo 1.7, it has a decent amount of hardware acceleration implemented for f-zero so even that game benefits from faster RSP with Jabo's plugin.

For those slower parts of Mario, I'd imagine it's mostly RDP that needs optimizations.
Reply With Quote
  #1012  
Old 6th October 2014, 09:01 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Since you're trying to optimize your RSP, I might as well post some stuff. This was taken before you fixed VCH, so you can ignore that one.

Here's a comparison between Intel and GCC 4.8.1

Code:
                   Intel          GCC
VMULF           :  0.432 s      0.276 s
VMACF           :  0.299 s      0.286 s
VMULU           :  0.392 s      0.243 s
VMACU           :  0.341 s      0.347 s
VMUDL           :  0.128 s      0.190 s
VMADL           :  0.329 s      0.308 s
VMUDM           :  0.216 s      0.218 s
VMADM           :  0.304 s      0.283 s
VMUDN           :  0.221 s      0.187 s
VMADN           :  0.375 s      0.297 s
VMUDH           :  0.153 s      0.161 s
VMADH           :  0.227 s      0.230 s
VADD            :  0.101 s      0.135 s
VSUB            :  0.126 s      0.173 s
VABS            :  0.118 s      0.232 s
VADDC           :  0.135 s      0.163 s
VSUBC           :  0.143 s      0.178 s
VSAW            :  0.070 s      0.092 s
VEQ             :  0.101 s      0.150 s
VNE             :  0.100 s      0.155 s
VLT             :  0.132 s      0.197 s
VGE             :  0.127 s      0.203 s
VCH             :  0.257 s      0.206 s
VCL             :  0.255 s      0.301 s
VCR             :  0.160 s      0.170 s
VMRG            :  0.093 s      0.081 s
VAND            :  0.080 s      0.082 s
VNAND           :  0.082 s      0.082 s
VOR             :  0.080 s      0.081 s
VNOR            :  0.082 s      0.084 s
VXOR            :  0.080 s      0.081 s
VNXOR           :  0.082 s      0.082 s
VRCPL           :  0.352 s      0.347 s
VRSQL           :  0.524 s      0.439 s
VRCPH           :  0.129 s      0.115 s
VRSQH           :  0.128 s      0.116 s
VMOV            :  0.141 s      0.141 s
VNOP            :  0.035 s      0.041 s
Total time spent:  7.130 s      7.153 s
So my idea about mixing compiler output turned out to be a success. It was a great starting point, when I was not too familiar with SSE. Even now, I still use some of it. I'm glad to know that I can trust my intuition because i was able to guess the better choice, the majority of the time.
Reply With Quote
  #1013  
Old 6th October 2014, 09:29 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

It's not really all that important to optimize VCL and VCH anyway, any further at least. (Hell even what I went through already to optimize them, exposed them to bug reports!) I'm glad it's vectorized, but it doesn't need to be perfect SSE4 or AVX code or anything like that. It's actually pretty narrow against the ratios of concentration of other, more populous opcodes, which due to their population in complex algorithms are in greater, visible benefit of optimization.
Reply With Quote
  #1014  
Old 6th October 2014, 10:41 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Quote:
Originally Posted by HatCat View Post
It's not really all that important to optimize VCL and VCH anyway, any further at least.
Perhaps you're right. I already know VCH is well done. Part of the reason I want to give it a shot at optimizing VCL is the fact that it's good SSE practice. I've learned a lot already so far. Also I want to finish up these last few optimizations and start focusing more on accuracy / game compatibility.
Quote:
Originally Posted by HatCat View Post
(Hell even what I went through already to optimize them, exposed them to bug reports!)
Speaking of bug reports I'm totally confused! So I was testing Beetle Adventure Racing, and noticed that there were random glitches, caused by the RSP. What's confusing is that it only happens on some builds. In your official version 6, it has the bug, and same with the intel build I had. Yet when I compiled it with GCC, in SSE4 (months ago), it did not have the bug. I think your current source fixed the problem because I don't see it in latest intel build. I don't think it's related to the VCH bug, so I think it was some undefined behavior, but seems to be fixed anyway . That's something you might want to check out.

Quote:
Originally Posted by HatCat View Post
I'm glad it's vectorized, but it doesn't need to be perfect SSE4 or AVX code or anything like that.
Ya I'm not even going to bother with SSE4 and especially not AVX. Pretty much pointless, at least until LLE gfx plugins get considerably faster. For some reason, RDP seems a lot harder for me to optimize, let alone understand.
Reply With Quote
  #1015  
Old 6th October 2014, 10:59 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

Quote:
Originally Posted by RPGMaster View Post
For some reason, RDP seems a lot harder for me to optimize, let alone understand.
No regrets though. Tangling my way through the jungle of accurate RDP emulation code and tens of optimizations of my own has given me this explosion of ideas to try for the RSP.

Feels good to be working on it again. I'm still in the middle of a massive fundamental commit as we speak.
Reply With Quote
  #1016  
Old 7th October 2014, 12:09 AM
theboy181's Avatar
theboy181 theboy181 is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Aug 2014
Location: Prince Rupert,British Columbia Canada
Posts: 426
Default

Exciting times ahead!!

Are you closer to a public release soon? Can't wait to see where you have gotten since your last release.
Reply With Quote
  #1017  
Old 7th October 2014, 12:12 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

Changes mean direction...most likely good in this case.
Although I was hoping for a reason to include 64-bit build of this plugin, that would work somewhere.

Not really any bugs left to fix as far as I am aware, just better speed...actually may tweak the cycle-timing bypass for WDC/SR64/others and semaphore fix to vary with the other options a little creatively.
Reply With Quote
  #1018  
Old 7th October 2014, 02:26 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

welp.

I don't always divide by 0, but when I do:


lol, I know what opcode change broke it though...if it was anything outside of the divides, then the audio wouldn't still be perfect. It's nice to break emulation on purpose every once in a while XD.
Reply With Quote
  #1019  
Old 7th October 2014, 03:55 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

Yay, only 1 bug caused by changing 50 files in a single day.

What a massive change...and for the better!
The complicated algorithms with a lot to do to finish the vector instruction as x86 have gotten slower according to the benchmark results, but that's a temporary concession due to the safe, for now, structure of fastly, stably porting my code over to the new method, which uses XMM passing on the function call stack. Actually they're not really __m128i's; they're v16's.

Public Release 6 in this thread:
Code:
VSAW   :  0.096 s
VEQ    :  0.342 s
VNE    :  0.332 s
VMRG   :  0.116 s

VAND   :  0.112 s
VNAND  :  0.104 s
VOR    :  0.121 s
VNOR   :  0.105 s
VXOR   :  0.121 s
VNXOR  :  0.104 s
New algorithm I'm committing to GitHub today:
Code:
VSAW   :  0.081 s
VEQ    :  0.235 s
VNE    :  0.185 s
VMRG   :  0.099 s

VAND   :  0.094 s
VNAND  :  0.101 s
VOR    :  0.091 s
VNOR   :  0.099 s
VXOR   :  0.090 s
VNXOR  :  0.098 s
Everything simple is faster!
The ones that are slower (multiplies, divides and clip selects) have given me an opportunity to make them faster than before with this new structure that has made them slower, though the work is more expansive fundamentally. It's kind of like solving some math equations...sometimes you have to "anti-simplify" and complicate the mathematical expression, so that you can simplify it even more than you could in its original form.
Reply With Quote
  #1020  
Old 7th October 2014, 04:03 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

On the lower-level detail of things, this is how RSP::VAND interpreter came out in the latest public release to this thread in SSSE3:
Code:
_VAND:
    pushl    %ebp
    movl    %esp, %ebp
    pushl    %ebx
    andl    $-16, %esp
    subl    $16, %esp
    movl    8(%ebp), %eax
    movl    12(%ebp), %edx
    movl    16(%ebp), %ebx
    movl    20(%ebp), %ecx
    sall    $4, %ebx
    movdqu    _VR(%ebx), %xmm0
    sall    $4, %ecx
    movdqu    _smask(%ecx), %xmm1
    pshufb    %xmm1, %xmm0
    sall    $4, %edx
    movdqu    _VR(%edx), %xmm1
    pand    %xmm1, %xmm0
    movdqa    %xmm0, _VACC+32
    sall    $4, %eax
    movdqa    %xmm0, _VR(%eax)
    movl    -4(%ebp), %ebx
    leave
    ret
This is how it looks now:
Code:
_VAND:
    pushl    %ebp
    pand    %xmm2, %xmm1
    movdqa    %xmm1, %xmm0
    movl    %esp, %ebp
    andl    $-16, %esp
    subl    $32, %esp
    movdqa    %xmm1, _VACC+32
    leave
    ret
I'm not really sure what the PUSH ebp; op is for at the top of the latter output...all the arguments became either XMM's or known constants so that shouldn't really be there. Either way, gradual changes to all the opcodes including any useful intrinsic functions should make this go away over time.

To be fair, the first algorithm also included the shuffling of VT in SSSE3 pshufb, unlike the second one. I decided to optimize for smaller module size and just keep instruction cache simple...any of the complicated, bottleneck-ish vector opcodes are certainly ones that would want to assume shuffling to happen.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 10:34 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.