Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #21  
Old 5th August 2013, 02:52 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by dsx! View Post
great work both of you, huge performance gains confirmed
Excellent! Glad to see that I'm not the only one experiencing this.
Reply With Quote
  #22  
Old 5th August 2013, 03:08 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Because most testers here don't have the hardware to test this plugin, but dsx! has this illegal stash of...alienware?

And by Alienware I mean Japanese for "drugs".

And inside of the drugs you find more drugs.
And inside the more drugs you find Easter Eggs.
And inside the easter eggs, you find: dsx!, with some more drugs.
And a bunch of crazy PCs.

Damn, I suck at poetry.

Quote:
Originally Posted by MarathonMan View Post
We should come up with some way to satisfy things for both compilers because I agree that it's stupid to have a dependency that chews up mammaries when it doesn't really serve much.
Well it's just a Windows issue, and fuck Windows.

Seriously though. That's like the easy solution to lots of things in this world.

Indeed though, I would like it to be flexible across multiple compilers, even win32 ones.

One of the things I was going to optimize about this plugin was SLLV/SRLV/SRAV.
I think maybe, you don't need to & 31 to force valid clamps to the low five bits of the target register, if the Intel architecture already represents this behavior for us...

But, that's no big deal compared to you staticizing the SSE stuff in for me like my stubborn ass refused to do.
Reply With Quote
  #23  
Old 5th August 2013, 03:16 AM
shunyuan's Avatar
shunyuan shunyuan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Apr 2013
Posts: 491
Default

Quote:
Originally Posted by dsx! View Post
great work both of you, huge performance gains confirmed
Would you mind to share your settings for which game and the bench mark statics for both before SSE and after SSE?
__________________
---------------------
CPU: Intel U7300 1.3 GHz
GPU: Mobile Intel 4 Series (on board)
AUDIO: Realtek HD Audio (on board)
RAM: 4 GB
OS: Windows 7 - 32 bit
Reply With Quote
  #24  
Old 5th August 2013, 03:17 AM
dsx_ dsx_ is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2010
Location: Australia
Posts: 1,105
Default

Quote:
Originally Posted by shunyuan View Post
Would you mind to share your settings for which game and the bench mark statics for both before SSE and after SSE?
sure, I'll reply with these soon
Reply With Quote
  #25  
Old 5th August 2013, 03:25 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by shunyuan View Post
Would you mind to share your settings for which game and the bench mark statics for both before SSE and after SSE?
Based on my knowledge of Intel's architectural designs, this will vary greatly depending on the compiler (and of course, the ROM).

Newer chips (especially Haswell) have an abundance of vector execution units; this plugin will make use of them. Older chips like C2D do not have as many vector execution units and will likely see a smaller gain.

I am excited to see what dsx!'s benchmarks show. Ivy Bridge should show a decent speedup.
Reply With Quote
  #26  
Old 5th August 2013, 04:14 AM
dsx_ dsx_ is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2010
Location: Australia
Posts: 1,105
Default

z64gl
VI/s non-SSE min-max
VI/s SSE min-max

mario head
60-79
75-96

rush 2049 intro
37-77
46-98

top gear rally demo
82-180
102-220

world driver intro
135-500
160-600+

star fox 64 intro
54-320
67-360

goldeneye 007 intro
55-160
67-200

cruis'n usa intro
47-70
57-84

rogue squadron intro
105-145
125-160

tetrisphere intro
33-230
40-265

re-volt options screen
82
100

gains are consistently 15~20%

Last edited by dsx_; 5th August 2013 at 04:27 AM. Reason: added more info
Reply With Quote
  #27  
Old 5th August 2013, 05:47 AM
mudlord_ mudlord_ is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Dec 2012
Posts: 383
Default

and AMD is officially shithouse.

nice job using SSE 4.1
Reply With Quote
  #28  
Old 5th August 2013, 06:33 AM
oddMLan's Avatar
oddMLan oddMLan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2009
Location: Parappa Town
Posts: 210
Default

MarathonMan, to overcome the SSE support problems with older processors without rewriting much code, you can compile with Intel C++ Composer, and use the Optimized processor code path... although, it is not very... cheap.

Well, of course, you surely already knew this so... If you don't want to rewrite the code to work with an inferior instruction set I guess AMD users like me and haxatax can rot in hell.
Reply With Quote
  #29  
Old 5th August 2013, 12:15 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by haxatax View Post
and AMD is officially shithouse.

nice job using SSE 4.1
SSE4.1 was the first to offer packed 4x32-bit signed multiplies.

SSE2 only offers 2x32-bit => 64-bit product multiplies (though both signed and unsigned).

SSE4? also offers blends which save precious instructions while muxing data.

I wrote some conflicting SSE4.1 intrinsics in terms of only SSE2 intrinsics for people who aren't fortunate enough to have SSE4.1-capable hardware yet; haven't tested/merged/completed it, though.

e.g. _mm_mullo_epi32 (packed 4x32-bit multiplies) can be done in SSE2 like so:

Code:
#if 0
    __prodlo = _mm_mullo_epi32(__vvslo, __vvtlo);
    __prodhi = _mm_mullo_epi32(__vvshi, __vvthi);
#endif

    __prod1 = _mm_mul_epi32(__vvslo, __vvtlo); /* prod[31:00] = 0x0, 2x2 */
    __prod3 = _mm_mul_epi32(__vvshi, __vvthi); /* prod[95:64] = 4x4, 6x6 */

    __vvslo = _mm_srli_si128(__vvslo, 4);
    __vvtlo = _mm_srli_si128(__vvtlo, 4);
    __vvshi = _mm_srli_si128(__vvshi, 4);
    __vvthi = _mm_srli_si128(__vvthi, 4);

    __prod2 = _mm_mul_epu32(__vvslo, __vvtlo); /* prod[31:00] = 1x1, 3x3 */
    __prod4 = _mm_mul_epu32(__vvshi, __vvthi); /* prod[95:64] = 5x5, 7x7 */

   /* shuffle all the __prods back to a common register */
Just more instructions are more loss of sanity for me at the moment.

As long as I can shuffle with the SSE2 halfword instruction, everything should backport to SSE2 cleanly.

Last edited by MarathonMan; 5th August 2013 at 01:32 PM.
Reply With Quote
  #30  
Old 5th August 2013, 12:16 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by shunyuan View Post
Bug report, I found the videos are different on RR64 and Conker's BFD for SSE version and original version of FatCat's LLE RSP.

Testing environments:
PJ64 2.1
rsp_pj64 (original version and SSE version)
SoftGraphic v1.2
HleAudio

Testing games:
Ridge Racer 64
Conker's BFD

Results:

original version:


SSE version:


video record of original version

video record of SSE version

Both RR64 and Conker's BFD have the similar problems on lighting and colors, so no screen shots and video records for Conker's BFD.
Now this is interesting; thanks.

I never noticed this in my tests.

After talking to FatCat, I know I botched at least one instruction; hopefully that is the one causing these differences in colors.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 10:41 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.