Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #931  
Old 27th August 2014, 09:57 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Not really a matter of convenience.

If anything I'd say it's more "convenient" to just say _mm_set1_epi16 than it is to evade arrogant vendor lock-in.

'What is easy is not always what's right.' Seriously, Intel's vector unit is old news. There are other computers that have vector units, and proper SSE instruction set came way late to the party to be venerable, looking at the RSP's vector unit. GCC like you said, pretty much makes SSE intrinsics almost a complete joke now. There are only so many reasons left for me to use explicit intrinsic functions these days, and shuffling is one of them for now.
Reply With Quote
  #932  
Old 27th August 2014, 10:42 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

True, some of those one liners can be convenient to just use intrinsics.

I was mostly thinking about the macros you had like do_mulf. But I guess part of the reason it looks so simple is because you're using macros. I bet even without macros, it's easier to read, than writing like 50+ lines of intrinsics.

I gotta admit, GCC is genius! I saw that it didn't vectorize functions like set_VCC, but good thing I already manually did those. Part of the reason I'm looking at Intel's compiler output is because it's more readable. I don't see the symbol names in GCC's output, so I have no idea what dword ptr [ebp +12] is . So now I'm looking at GCC's output, and whenever I don't know the stack variable, I look at Intel's output.

I was totally right that looking at compiler output is a good way to learn algorithms.

I need to hurry up and figure out how to effectively use Winapi with GCC. Then, I bet I could learn a lot by simply compiling plugins with GCC and looking at the output. But that will be after I'm done with this. I can't over do it with the multitasking .
Reply With Quote
  #933  
Old 27th August 2014, 11:17 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

I would rather use MSVC than GCC with winapi stuff/Microsoft stuff.

In fact if my RSP emulator had the displeasure of being wholly written in intrinsic functions, putting aside a few of Microsoft's problems with logical branch prediction, I'd have used Microsoft Visual Studio to build the RSP, not GCC.

Well, we'll see. Right now I'm stabilizing vertex buffer objects in the RDP plugin. With extensions like these, there's no reason to use DirectDraw even for this little 2-D business ever again. But as usual I'll have a run-time compatibility fallback.
Reply With Quote
  #934  
Old 28th August 2014, 01:04 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Man patience is key! I'm taking it slower than I anticipated, and at the same time, I realize that I can finish it quicker than I estimated. I originally intended on doing a rush job and just copy paste code, but now I'm reading it carefully and making sure I understand the algorithms before implementing them.

Looks like checking different compilers is very useful, for code gen. The asm output for VMUDL looked worse in GCC than it did in Intel. I might as well take advatange of the good parts of both compilers.

I agree that MSVC is better for winapi stuff. However for your pixel accurate fork, I don't think the pro's for MSVC outweigh the cons. When I looked at the compiler output, I liked what I saw with Intel, and I did not like Clang's output, yet Clang outperformed both Intel and MSVC. I'm not suggesting you should quit using MSVC for the plugin, but I think at the very least, you could benefit from trying out other compilers to see the kind of code they generate for important functions that you are trying to improve.

Anyway, after I'm done with this RSP recompiler prototype, I'm going to probably work on / learn gfx next.

So I was looking at the Intel's compiler output for VMUDL and saw this
Code:
psrad   xmm3,10h
psrad   xmm2,10h
pslld   xmm3,10h
pslld   xmm2,10h
psrad   xmm3,10h
psrad   xmm2,10h
Wouldn't 1 psrad be the same thing as this sign-right shift, left shift, sign-right shift?

Last edited by RPGMaster; 28th August 2014 at 01:40 AM.
Reply With Quote
  #935  
Old 28th August 2014, 04:55 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Finally implemented VBOs!

Wrong thread but what the hell.
GUESS WHAT A HUGE SPEED DIFFERENCE IT MADE IN THE RDP??

From 247 VI/s to about 255, so only 8!

That's right, all you deprecation fans who are so convinced that removing old immediate-mode calls was a "needed evil"! Your obsession with the modernization of OpenGL in this style ain't worth an asshair!

Alas, a speed-up is still a speed-up, so I'll keep this anyway with VBOs implemented on top of my old compatibility functions (which will compile to DisplayError message boxes when using a modern GL/ES SDK where those functions are deprecated, to let you know what support your video card is missing).
I'll try to figure out PBOs now since that's probably the much bigger bottleneck than uploading vertex data (glTexSubImage2D pixel buffers might get that 255 up to a 355 who knows??).

Quote:
Originally Posted by RPGMaster View Post
So I was looking at the Intel's compiler output for VMUDL and saw this
Code:
psrad   xmm3,10h
psrad   xmm2,10h
pslld   xmm3,10h
pslld   xmm2,10h
psrad   xmm3,10h
psrad   xmm2,10h
Wouldn't 1 psrad be the same thing as this sign-right shift, left shift, sign-right shift?
I'll rearrange this so that it makes more sense to both of us.
Code:
psrad   xmm3,10h
pslld   xmm3,10h
psrad   xmm3,10h
psrad   xmm2,10h
pslld   xmm2,10h
psrad   xmm2,10h
Yes, that is retarded. :P I'm fairly sure Intel compiler could have probably just emitted:
Code:
PSRAD   xmm2, 16
PSRAD   xmm3, 16
I'm sure there's a logical reason behind the output Intel compiler gave you; I just don't know what it is exactly because my understanding of compiler theory isn't that low-level. It does make sense that they interweaved the operations on xmm2 and xmm3 tho ofc, and this also helps symbolize the future possibility to merge them into 1 AVX register.
Reply With Quote
  #936  
Old 28th August 2014, 06:26 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

I did not take into account that proofreading would take up a good amount of the time spent. I originally just thought I'd copy paste, but it sure paid off to pay attention. It seems that both Intel and GCC did some weird things. So what I ended up doing is comparing compiler output, one function at a time, and choose what I think is the best out of the two, for implementing a specific vector instruction. I ended up commenting out those redundant psrad and pslld.

For GCC i basically used -msse2 and -o3 for optimizations. I may later on try seeing if there are better compiler settings. That can wait till after I'm done though.

That's great news that you've sped up your OpenGL code. I hope I can figure out how to fix my issue where I can't use a full core when playing demos like ABS Crap with frame limiter off.

OMG! This is really confusing me ;/ . In some functions, I see that Intel uses less instructions, despite the fact that it's doing retarded stuff. It makes me wonder which compiler's output with minor tweaking, is better. I guess I'll have to profile later on, if I decide to even care about further optimizations.

I see noob code like this in Intel's output
Code:
 psrld      xmm3,10h  
 psrld      xmm5,10h  
 pslld      xmm2,10h  
 pslld      xmm1,10h  
 pslld      xmm3,10h  
 pslld      xmm5,10h  
 psrad      xmm2,10h  
 psrad      xmm1,10h  
 psrad      xmm3,10h  
 psrad      xmm5,10h
For now, anytime I see something majorly retarded, I'll go with GCC, since I don't want to risk changing the algorithm. I just feel like there's a good chance that tweaking Intel's output would be superior than copy pasting GCC's, for some functions. I've already encountered a few where I picked Intel's output.

Last edited by RPGMaster; 28th August 2014 at 07:56 AM.
Reply With Quote
  #937  
Old 28th August 2014, 05:12 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Or you can always sub-divide the ANSI C loops into even smaller ANSI C loops, just like a programmer designs big goals into small ones? Should make the SSE output a little more direct to the actual C. The vector multiplies were the most interesting for me to make even the most subtle of changes to improve the SSE output, although operations like VCL and VCH easily outclass them in complexity.

Quote:
Originally Posted by RPGMaster View Post
I see noob code like this in Intel's output
Code:
 psrld      xmm3,10h  
 psrld      xmm5,10h  
 pslld      xmm2,10h  
 pslld      xmm1,10h  
 pslld      xmm3,10h  
 pslld      xmm5,10h  
 psrad      xmm2,10h  
 psrad      xmm1,10h  
 psrad      xmm3,10h  
 psrad      xmm5,10h

*switches to MSVC or GCC*

Speaking of Intel, those are the jackasses who are actually agreeing to the OpenGL deprecation model.
NVIDIA and AMD are still holding on to shit like glVertex. It's Intel that's removing the deprecated funcs, or if you're a Macintosh user.
And Intel's gl support was always half-assed anyway.

Last edited by HatCat; 28th August 2014 at 05:14 PM.
Reply With Quote
  #938  
Old 28th August 2014, 11:16 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Wow, all these compiler flaws I'm seeing, has convinced me that it's best to not rely on the compiler.

I'm more curious how an assembly version would be now.

Now that I mention compiler problems, I realize you mentioned issues with gcc 4.8.1 or something. I will have to figure out which version to use then.

Here's the asm output of do_abs, using GCC 4.8.1 i think.
Code:
mov	eax, DWORD PTR [ebp+12]
pxor	xmm1, xmm1
mov	DWORD PTR [esp+32], 0
movdqa	xmm3, XMMWORD PTR LC0
mov	DWORD PTR [esp+36], 0
mov	DWORD PTR [esp+40], 0
sal	eax, 4
movdqu	xmm2, XMMWORD PTR _VR[eax]
mov	DWORD PTR [esp+44], 0
movdqa	xmm4, xmm2
mov	eax, DWORD PTR [ebp+8]
psrlw	xmm4, 15
pcmpgtw	xmm2, xmm1
movdqa	xmm1, XMMWORD PTR [esp+32]
psubw	xmm1, xmm4
pand	xmm2, xmm3
paddw	xmm1, xmm2
pmullw	xmm1, xmm0
movdqa	xmm0, xmm1
sal	eax, 4
pcmpeqw	xmm0, XMMWORD PTR LC1
pand	xmm0, xmm3
psubw	xmm1, xmm0
movdqa	XMMWORD PTR _VACC+32, xmm1
movdqa	XMMWORD PTR _VR[eax], xmm1
I may have to put this recompiler work on pause for a bit. I'm starting to worry that I'm doing something wrong here. Since I'm basing most of the algorithms for vector instructions off the asm output from your RSP plugin.
Reply With Quote
  #939  
Old 29th August 2014, 01:47 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by RPGMaster View Post
Wow, all these compiler flaws I'm seeing, has convinced me that it's best to not rely on the compiler.
For vectorization, yes. I've only really studied Clang in this regard, since gcc's intermediate output is sorcery, but Clang, at least, trips all over code generation when it comes to vectorization. I'm assuming GCC suffers from the same fate.

Intrinsics who? This is after the load and shuffle, which themselves are only a handful of insns:

Code:
## VABS algorithm.
   0:   c4 e2 69 09 c8          vpsignw %xmm0,%xmm2,%xmm1
   5:   c5 e1 65 c0             vpcmpgtw %xmm0,%xmm3,%xmm0
   9:   c5 e1 65 e1             vpcmpgtw %xmm1,%xmm3,%xmm4
   d:   c5 e1 65 d2             vpcmpgtw %xmm2,%xmm3,%xmm2
  11:   c5 e9 db d0             vpand  %xmm0,%xmm2,%xmm2
  15:   c5 e9 db dc             vpand  %xmm4,%xmm2,%xmm3
  19:   c5 f1 ef c3             vpxor  %xmm3,%xmm1,%xmm0

## Store vector back to memory.
  1d:   c5 f8 29 87 30 02 00    vmovaps %xmm0,0x230(%rdi)
  24:   00 
  25:   c3                      retq

Last edited by MarathonMan; 29th August 2014 at 01:50 AM.
Reply With Quote
  #940  
Old 29th August 2014, 02:09 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Wow seeing how small that output is, makes me wish my computer had AVX support ;/ .

Seriously, I'm just amazed at how few instructions it requires! At least now I know that compilers still need work on vectorization.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 05:01 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.