|
#931
|
||||
|
||||
![]()
Not really a matter of convenience.
If anything I'd say it's more "convenient" to just say _mm_set1_epi16 than it is to evade arrogant vendor lock-in. ![]() 'What is easy is not always what's right.' Seriously, Intel's vector unit is old news. There are other computers that have vector units, and proper SSE instruction set came way late to the party to be venerable, looking at the RSP's vector unit. GCC like you said, pretty much makes SSE intrinsics almost a complete joke now. There are only so many reasons left for me to use explicit intrinsic functions these days, and shuffling is one of them for now.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#932
|
||||
|
||||
![]()
True, some of those one liners can be convenient to just use intrinsics.
I was mostly thinking about the macros you had like do_mulf. But I guess part of the reason it looks so simple is because you're using macros. I bet even without macros, it's easier to read, than writing like 50+ lines of intrinsics. I gotta admit, GCC is genius! I saw that it didn't vectorize functions like set_VCC, but good thing I already manually did those. Part of the reason I'm looking at Intel's compiler output is because it's more readable. I don't see the symbol names in GCC's output, so I have no idea what dword ptr [ebp +12] is ![]() I was totally right that looking at compiler output is a good way to learn algorithms. I need to hurry up and figure out how to effectively use Winapi with GCC. Then, I bet I could learn a lot by simply compiling plugins with GCC and looking at the output. But that will be after I'm done with this. I can't over do it with the multitasking ![]() |
#933
|
||||
|
||||
![]()
I would rather use MSVC than GCC with winapi stuff/Microsoft stuff.
In fact if my RSP emulator had the displeasure of being wholly written in intrinsic functions, putting aside a few of Microsoft's problems with logical branch prediction, I'd have used Microsoft Visual Studio to build the RSP, not GCC. Well, we'll see. Right now I'm stabilizing vertex buffer objects in the RDP plugin. With extensions like these, there's no reason to use DirectDraw even for this little 2-D business ever again. But as usual I'll have a run-time compatibility fallback.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#934
|
||||
|
||||
![]()
Man patience is key! I'm taking it slower than I anticipated, and at the same time, I realize that I can finish it quicker than I estimated. I originally intended on doing a rush job and just copy paste code, but now I'm reading it carefully and making sure I understand the algorithms before implementing them.
Looks like checking different compilers is very useful, for code gen. The asm output for VMUDL looked worse in GCC than it did in Intel. I might as well take advatange of the good parts of both compilers. I agree that MSVC is better for winapi stuff. However for your pixel accurate fork, I don't think the pro's for MSVC outweigh the cons. When I looked at the compiler output, I liked what I saw with Intel, and I did not like Clang's output, yet Clang outperformed both Intel and MSVC. I'm not suggesting you should quit using MSVC for the plugin, but I think at the very least, you could benefit from trying out other compilers to see the kind of code they generate for important functions that you are trying to improve. Anyway, after I'm done with this RSP recompiler prototype, I'm going to probably work on / learn gfx next. So I was looking at the Intel's compiler output for VMUDL and saw this Code:
psrad xmm3,10h psrad xmm2,10h pslld xmm3,10h pslld xmm2,10h psrad xmm3,10h psrad xmm2,10h Last edited by RPGMaster; 28th August 2014 at 01:40 AM. |
#935
|
||||
|
||||
![]()
Finally implemented VBOs!
Wrong thread but what the hell. GUESS WHAT A HUGE SPEED DIFFERENCE IT MADE IN THE RDP?? From 247 VI/s to about 255, so only 8! That's right, all you deprecation fans who are so convinced that removing old immediate-mode calls was a "needed evil"! Your obsession with the modernization of OpenGL in this style ain't worth an asshair! ![]() Alas, a speed-up is still a speed-up, so I'll keep this anyway with VBOs implemented on top of my old compatibility functions (which will compile to DisplayError message boxes when using a modern GL/ES SDK where those functions are deprecated, to let you know what support your video card is missing). I'll try to figure out PBOs now since that's probably the much bigger bottleneck than uploading vertex data (glTexSubImage2D pixel buffers might get that 255 up to a 355 who knows??). Quote:
![]() Code:
psrad xmm3,10h pslld xmm3,10h psrad xmm3,10h psrad xmm2,10h pslld xmm2,10h psrad xmm2,10h Code:
PSRAD xmm2, 16 PSRAD xmm3, 16
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#936
|
||||
|
||||
![]()
I did not take into account that proofreading would take up a good amount of the time spent. I originally just thought I'd copy paste, but it sure paid off to pay attention. It seems that both Intel and GCC did some weird things. So what I ended up doing is comparing compiler output, one function at a time, and choose what I think is the best out of the two, for implementing a specific vector instruction. I ended up commenting out those redundant psrad and pslld.
For GCC i basically used -msse2 and -o3 for optimizations. I may later on try seeing if there are better compiler settings. That can wait till after I'm done though. That's great news that you've sped up your OpenGL code. I hope I can figure out how to fix my issue where I can't use a full core when playing demos like ABS Crap with frame limiter off. OMG! This is really confusing me ;/ . In some functions, I see that Intel uses less instructions, despite the fact that it's doing retarded stuff. It makes me wonder which compiler's output with minor tweaking, is better. I guess I'll have to profile later on, if I decide to even care about further optimizations. I see noob code like this in Intel's output Code:
psrld xmm3,10h psrld xmm5,10h pslld xmm2,10h pslld xmm1,10h pslld xmm3,10h pslld xmm5,10h psrad xmm2,10h psrad xmm1,10h psrad xmm3,10h psrad xmm5,10h Last edited by RPGMaster; 28th August 2014 at 07:56 AM. |
#937
|
||||
|
||||
![]()
Or you can always sub-divide the ANSI C loops into even smaller ANSI C loops, just like a programmer designs big goals into small ones? Should make the SSE output a little more direct to the actual C. The vector multiplies were the most interesting for me to make even the most subtle of changes to improve the SSE output, although operations like VCL and VCH easily outclass them in complexity.
Quote:
![]() *switches to MSVC or GCC* Speaking of Intel, those are the jackasses who are actually agreeing to the OpenGL deprecation model. NVIDIA and AMD are still holding on to shit like glVertex. It's Intel that's removing the deprecated funcs, or if you're a Macintosh user. And Intel's gl support was always half-assed anyway.
__________________
http://theoatmeal.com/comics/cat_vs_internet Last edited by HatCat; 28th August 2014 at 05:14 PM. |
#938
|
||||
|
||||
![]()
Wow, all these compiler flaws I'm seeing, has convinced me that it's best to not rely on the compiler.
I'm more curious how an assembly version would be now. Now that I mention compiler problems, I realize you mentioned issues with gcc 4.8.1 or something. I will have to figure out which version to use then. Here's the asm output of do_abs, using GCC 4.8.1 i think. Code:
mov eax, DWORD PTR [ebp+12] pxor xmm1, xmm1 mov DWORD PTR [esp+32], 0 movdqa xmm3, XMMWORD PTR LC0 mov DWORD PTR [esp+36], 0 mov DWORD PTR [esp+40], 0 sal eax, 4 movdqu xmm2, XMMWORD PTR _VR[eax] mov DWORD PTR [esp+44], 0 movdqa xmm4, xmm2 mov eax, DWORD PTR [ebp+8] psrlw xmm4, 15 pcmpgtw xmm2, xmm1 movdqa xmm1, XMMWORD PTR [esp+32] psubw xmm1, xmm4 pand xmm2, xmm3 paddw xmm1, xmm2 pmullw xmm1, xmm0 movdqa xmm0, xmm1 sal eax, 4 pcmpeqw xmm0, XMMWORD PTR LC1 pand xmm0, xmm3 psubw xmm1, xmm0 movdqa XMMWORD PTR _VACC+32, xmm1 movdqa XMMWORD PTR _VR[eax], xmm1 |
#939
|
||||
|
||||
![]() Quote:
Intrinsics who? This is after the load and shuffle, which themselves are only a handful of insns: Code:
## VABS algorithm. 0: c4 e2 69 09 c8 vpsignw %xmm0,%xmm2,%xmm1 5: c5 e1 65 c0 vpcmpgtw %xmm0,%xmm3,%xmm0 9: c5 e1 65 e1 vpcmpgtw %xmm1,%xmm3,%xmm4 d: c5 e1 65 d2 vpcmpgtw %xmm2,%xmm3,%xmm2 11: c5 e9 db d0 vpand %xmm0,%xmm2,%xmm2 15: c5 e9 db dc vpand %xmm4,%xmm2,%xmm3 19: c5 f1 ef c3 vpxor %xmm3,%xmm1,%xmm0 ## Store vector back to memory. 1d: c5 f8 29 87 30 02 00 vmovaps %xmm0,0x230(%rdi) 24: 00 25: c3 retq Last edited by MarathonMan; 29th August 2014 at 01:50 AM. |
#940
|
||||
|
||||
![]()
Wow seeing how small that output is, makes me wish my computer had AVX support ;/ .
Seriously, I'm just amazed at how few instructions it requires! At least now I know that compilers still need work on vectorization. |