|
#941
|
||||
|
||||
![]() Quote:
http://msdn.microsoft.com/fr-fr/libr...(v=vs.90).aspx The SSSE3 instructions are actually a byte smaller a piece (due to the lack of a ...VEX? prefix). |
#942
|
||||
|
||||
![]()
Nah, Clang's worse than GCC in that sense, according to a couple of RPG's notes compared to mine.
Auto-vectorizing beat out the intrinsics he was intent on using if Clang hadn't screwed it up for him.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#943
|
||||
|
||||
![]() Quote:
If you look closely, he's using 6 extra instructions to wrap around the problem with SSSE3 pshufd: Code:
vpcmpgtw xmm0, xmm0, xmm3 vpcmpgtw xmm4, xmm1, xmm3 vpcmpgtw xmm2, xmm2, xmm3 vpand xmm2, xmm0, xmm2 vpand xmm3, xmm4, xmm2 vpxor xmm0, xmm3, xmm1 Code:
pmullw xmm1, xmm0 ;# last instruction to touch xmm1, result of doing PSIGN in terms of SSE2 movdqa xmm0, xmm1 pcmpeqw xmm0, XMMWORD PTR LC1 pand xmm0, xmm3 psubw xmm1, xmm0 ;# only 4 instructions to compare to INT16_MIN and if true, correct corner case Don't buy into simple cut-and-paste marketing strategies! You can tell from looking at MM's paste that xmm3 was never defined within the scope of that function; it was an argument defined by a parent procedure. It's not the entire algorithm when comparing it to the size of the messy SSE2 output from my version, which is mostly messy because I pushed 3 int's to the function call stack, not 3 xmm's defined by a parent outside procedure (such as pre-handled shuffling) beforehand.
__________________
http://theoatmeal.com/comics/cat_vs_internet Last edited by HatCat; 29th August 2014 at 03:03 AM. |
#944
|
||||
|
||||
![]()
Well I can say for sure GCC is better than Clang at vectorization. Just checked the output of VADDC in Clang and saw poor code. Same with MSVC 2013. So that just leaves Intel and GCC. Intel & GCC aren't doing the best job, for SSE2 though.
Without cheating, the both GCC and Intel generally produce better code than i could, since idk sse too well yet. But when looking at the output, I've noticed a few flaws that I could fix, so it's kinda cool to look at the output from different compilers and then come up with the best solution, based on those results. Regarding MM's paste, I knew it didn't show everything. I was just amazed at how much simpler it is when you have 3 register operands. It's really convenient when you have to write in assembly. I just think it may be cool to eventually add AVX support to RSP recompiler. But after finding out that my computer doesn't support it, I'm obviously not going to bother ![]() Anyway, I need to find the best versions of each compiler ;/ . I'm really hoping this strange GCC output is the result of using an inferior version. What's a good version for Min-GW? |
#945
|
||||
|
||||
![]()
It has nothing to do with having 3 register operands, dude. >.< There are SSE2 opcodes that do that as well anyway.
I just proved that my SSE2 method of wrapping around the corner case fix was even smaller than his AVX/SSSE3 version of it. It's easy to take PART of an overall AVX algorithm, and show off that it's better than the SSE2 version, because in fact it is. That of course can't entirely be blamed on MarathonMan, though, since my overall output is still shitty due to the fact that I pushed 3 scalar ints to the function call stack, rather than using XMM register pushing which is basically transparent. Back then, I had no idea I could do that. Quote:
NOT AVX intrinsic functions! NOT inline/assembly! (*facepalm*) It's SSE2/se3 intrinsics that made them: https://github.com/tj90241/cen64-rsp...ter/CP2.c#L247 afaict that's marathon man's src to current VABS, which is mostly still his old algorithm when he learned of psign's inaccuracy from my rsp source code. You see, AVX operations can be generated as a matter of COMPILER INTELLIGENCE, which neither of you two seem to have a whole lot of faith in. Quote:
And I'm probably not even going to use MinGW for the next release of my RSP plugin anyway. It should be MSVC with some explicit improvements, maybe some more intrinsics since I need to push XMM registers to functions anyway (*definitely* my biggest flaw in relying on auto-vectorization from ansi C). So while you're multitasking across tens of different projects based on what could end up being old code (e.g. you do a RSP recompiler based on my old interpreter; I multiply the performance of my interpreter), I think I'll get back to work on concentrating on my one single everyday focus for now.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#946
|
||||
|
||||
![]()
Interesting, i didn't know you could write sse2/3 intrinsics and generate AVX. I get your point though. Even if that example MM gave wasn't the best one, I can still imagine other vector instructions being simpler to implement at assembly level, due to having 3 register operands instead of 2.
I don't have much faith in compiler intelligence for a good reason. I constantly observe compiler output and often see results that are not flawless, just from my pov. Since I'm working on a recompiler, which pretty much requires you to write assembly code (unless you're lazy and just copy bytes from interpreter functions), I might as well do the best I can to generate the best assembly code. I'm learning sse2 instructions and hand optimizing the output. Anyway, I'm a lot more focused now after coming to a realization. No more multitasking for me for a while ![]() I guess I'll just install the newest version of Min-GW and hope for the best. I believe you'll have to do some serious work to your interpreter, to get MSVC to do a good job. |
#947
|
||||||
|
||||||
![]()
First off, sue me for pasting my the entirety of my do_abs function without the ret. It's the same thing, only I had the freedom of passing through xmm registers as you later said.
EDIT: Nevermind, I even posted the ret. Look at that. Quote:
Quote:
![]() Quote:
![]() Regardless: we wrote different algorithms. You just compare an INT16_MIN constant and take advantage of the fact that a compare returns -1 or 0 while I opted for extra compares and masked the signs of the compare masks together. I could have just as easily wrote _mm_set1_epi16 and gotten the same output from the compiler. Which brings me to my final point: Quote:
Quote:
Speaking of which, how do you even manage to force it to do that? Quote:
![]() Last edited by MarathonMan; 29th August 2014 at 06:30 AM. |
#948
|
||||||
|
||||||
![]() Quote:
![]() But as long as you can construct accusations in the other person's mouth, it always makes yourself look better defusing them when you can shift the impression that they were ever made to begin with. Your problem of cut-pasting a small segment of code has nothing to do with whether or not you included a single ret instruction. The problem with your demonstration was that you bypassed all the things that my VABS function had implemented then: shuffling, register decode from scalar specifiers (int vd, int vs....), storage and writeback (_mm_store_si128), calculation of _mm_setzero_si128(), and etc., because you MOVED those things to happen elsewhere. You were trying to deceive RPGMaster with false marketing of an algorithm that moved most of the vector shuffling/writeback/other phases to an external procedure, whereas you could plainly see that I had everything done within a single function. Not saying I prefer it that way. This code was from a year ago. You do realize that? Quote:
I also find it ironic that you're so convinced that Clang's auto-vectorization codegen looks "awesome", and yet you still insist that it's always inferior to intrinsics. Clearly you have a low standard for awesomeness. I've known about ansi auto-vectorization since before you have; I was the one who first told you that C code could be automatically generated to SSE opcodes, and you were like, "hmm, interesting". But clearly you're still a little bit behind in the time of experience using it. Anyone who would write _mm_and_si128, for example, in a place where for (i = 0; i < 8; i++) dst[i] &= src[i]; could have done at least as well, is either totally ignorant or just unaware that they're unnecessarily enforcing vendor lock-in to Intel's exact ISA. You know, "vendor lock-in", that thing you claim to hate so much? ![]() "too"? It's not present in your assembly. That was the whole point. Your pasted output shows no pxor x, x. So you can't say "too". It's only present in my assembly paste, not yours. You basically just got done blindly admitting to leaving out more extra but necessary steps to the emulation of VABS just to make your codepaste even smaller than RPG's. Quote:
Wow, and here I thought most people would have already had the intelligence to see the fine proof that they were wrong about this little thing. Quote:
"Different algorithms!" It's not the ability to use an intrinsic function that counts. It's algorithm that counts. In that case, mine was more direct than yours, even though memory load costs extra. And because you were so kind enough as to admit to _mm_set1_epi16 easily getting the same output from the compiler as my ANSI version, you also blindly admitted to the use of _mm_set1_epi16 version being obsolete to my ANSI version if it produces the exact same output. It's like using inline assembly to solve a problem that plain simple C would solve: You're politically obsessed with the lower-level version when it's not necessary. Well said. =] Quote:
How do you manage to read that as "I'll just willingly traumatize the compiler with scalars?" Quote:
Anyway, I've known about this since months ago, before you brought it up. Sorry to break your self-flattery, but I was already aware that I missed out on the fact that passing __m128i variables to a function call doesn't do pushes/pops. It was something I learned later on; you didn't have to point it out here. I pointed it out first as the major reason why RPGMaster's sse2 codegen was so messy, then you started pointing it out as your "main point" as though I wasn't already aware of it. :S And just because I realized months ago that I have to pass __m128i's to function calls (actually or I could just not use a function call table, one could use a switch statement), doesn't mean I can't still use ANSI C, and I'm still going to continue to use it.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#949
|
|||||||||||||||||
|
|||||||||||||||||
![]() Quote:
I'll try to spell things out a little more clearly for you in the future, since you seem to have a hard time following along. Also note that in debates, it's informal and classless to label someone based on their medical condition, especially when that condition is of no fault of their own. Quote:
Quote:
- do_abs, as posted, doesn't shuffle either. - I even mentioned in my original post... "This is after the load and shuffle." - my function, as posted, also did writeback of the low part of the accumulator as posted (just not vd). Regardless, I'm sorry for not editing and recompiling my code just to make my code confirm to your inferior argument passing convention. I generally dislike to use my time in such ways. Furthermore, I assume that RPGMaster is capable of realizing that the argument formats were different and it wasn't an apples to oranges comparison, but you had to comment because you got all butthurt. He's trying to write a recompiler and will have all those registers cached, so he only needs to see the vector-computation component algorithm anyways, yes? Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
And, furthermore, you should that the amount of time exposed to a topic is not sufficiently indicative of experience in a topic. Maybe I've been doing vast amounts of research in auto-vectorization? I already mentioned I looked at the IR output of Clang, something which you weren't even aware that existed based on your past posts, so who are you to judge me? Quote:
Anyways, that's only one aspect of intrinsics. What's really great about the intrinsics, and why I use them even with the vendor lock-in present, is because I can pass around data efficiently without having my code look like the product of a heroin addict and a hooker. It also gives me the ability to readily use intrinsics like _mm_movemask_epi8 in the event where the compiler has a borderline impossible time selecting them. Quote:
Quote:
Quote:
Quote:
**** see next post for code **** Hm... looks like it didn't auto-vectorize. Nor did it with -march=native -O3 -ftree-vectorize. Still wondering why I use those intrinsics? Quote:
Have fun! Last edited by MarathonMan; 29th August 2014 at 11:47 PM. |
#950
|
||||
|
||||
![]()
****
Code:
0000000000004460 <do_abs>: 4460: 31 c0 xor %eax,%eax 4462: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 4468: 0f b7 0c 02 movzwl (%rdx,%rax,1),%ecx 446c: 66 89 4c 04 e8 mov %cx,-0x18(%rsp,%rax,1) 4471: 48 83 c0 02 add $0x2,%rax 4475: 48 83 f8 10 cmp $0x10,%rax 4479: 75 ed jne 4468 <do_abs+0x8> 447b: 30 c0 xor %al,%al 447d: 0f 1f 00 nopl (%rax) 4480: 0f b7 14 06 movzwl (%rsi,%rax,1),%edx 4484: 66 c1 ea 0f shr $0xf,%dx 4488: 66 89 54 04 a8 mov %dx,-0x58(%rsp,%rax,1) 448d: 48 83 c0 02 add $0x2,%rax 4491: 48 83 f8 10 cmp $0x10,%rax 4495: 75 e9 jne 4480 <do_abs+0x20> 4497: 30 c0 xor %al,%al 4499: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 44a0: 31 d2 xor %edx,%edx 44a2: 66 83 3c 06 00 cmpw $0x0,(%rsi,%rax,1) 44a7: 0f 9f c2 setg %dl 44aa: 66 89 54 04 b8 mov %dx,-0x48(%rsp,%rax,1) 44af: 48 83 c0 02 add $0x2,%rax 44b3: 48 83 f8 10 cmp $0x10,%rax 44b7: 75 e7 jne 44a0 <do_abs+0x40> 44b9: 48 8d 44 24 c8 lea -0x38(%rsp),%rax 44be: 48 8d 50 10 lea 0x10(%rax),%rdx 44c2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 44c8: 31 c9 xor %ecx,%ecx 44ca: 48 83 c0 02 add $0x2,%rax 44ce: 66 89 48 fe mov %cx,-0x2(%rax) 44d2: 48 39 d0 cmp %rdx,%rax 44d5: 75 f1 jne 44c8 <do_abs+0x68> 44d7: 48 8d 44 24 a8 lea -0x58(%rsp),%rax 44dc: 48 8d 50 10 lea 0x10(%rax),%rdx 44e0: 66 f7 18 negw (%rax) 44e3: 48 83 c0 02 add $0x2,%rax 44e7: 48 39 d0 cmp %rdx,%rax 44ea: 75 f4 jne 44e0 <do_abs+0x80> 44ec: 31 c0 xor %eax,%eax 44ee: 66 90 xchg %ax,%ax 44f0: 0f b7 54 04 a8 movzwl -0x58(%rsp,%rax,1),%edx 44f5: 66 01 54 04 c8 add %dx,-0x38(%rsp,%rax,1) 44fa: 48 83 c0 02 add $0x2,%rax 44fe: 48 83 f8 10 cmp $0x10,%rax 4502: 75 ec jne 44f0 <do_abs+0x90> 4504: 30 c0 xor %al,%al 4506: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 450d: 00 00 00 4510: 0f b7 54 04 b8 movzwl -0x48(%rsp,%rax,1),%edx 4515: 66 01 54 04 c8 add %dx,-0x38(%rsp,%rax,1) 451a: 48 83 c0 02 add $0x2,%rax 451e: 48 83 f8 10 cmp $0x10,%rax 4522: 75 ec jne 4510 <do_abs+0xb0> 4524: 30 c0 xor %al,%al 4526: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 452d: 00 00 00 4530: 0f b7 54 04 c8 movzwl -0x38(%rsp,%rax,1),%edx 4535: 66 31 54 04 e8 xor %dx,-0x18(%rsp,%rax,1) 453a: 48 83 c0 02 add $0x2,%rax 453e: 48 83 f8 10 cmp $0x10,%rax 4542: 75 ec jne 4530 <do_abs+0xd0> 4544: 30 c0 xor %al,%al 4546: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 454d: 00 00 00 4550: 31 d2 xor %edx,%edx 4552: 66 81 7c 04 e8 00 80 cmpw $0x8000,-0x18(%rsp,%rax,1) 4559: 0f 95 c2 setne %dl 455c: 66 89 54 04 d8 mov %dx,-0x28(%rsp,%rax,1) 4561: 48 83 c0 02 add $0x2,%rax 4565: 48 83 f8 10 cmp $0x10,%rax 4569: 75 e5 jne 4550 <do_abs+0xf0> 456b: 30 c0 xor %al,%al 456d: 0f 1f 00 nopl (%rax) 4570: 0f b7 54 04 d8 movzwl -0x28(%rsp,%rax,1),%edx 4575: 66 01 54 04 e8 add %dx,-0x18(%rsp,%rax,1) 457a: 48 83 c0 02 add $0x2,%rax 457e: 48 83 f8 10 cmp $0x10,%rax 4582: 75 ec jne 4570 <do_abs+0x110> 4584: 30 c0 xor %al,%al 4586: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 458d: 00 00 00 4590: 0f b7 54 04 e8 movzwl -0x18(%rsp,%rax,1),%edx 4595: 48 83 c0 02 add $0x2,%rax 4599: 66 89 90 00 00 00 00 mov %dx,0x0(%rax) 45a0: 48 83 f8 10 cmp $0x10,%rax 45a4: 75 ea jne 4590 <do_abs+0x130> 45a6: 30 c0 xor %al,%al 45a8: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 45af: 00 45b0: 0f b7 90 00 00 00 00 movzwl 0x0(%rax),%edx 45b7: 66 89 14 07 mov %dx,(%rdi,%rax,1) 45bb: 48 83 c0 02 add $0x2,%rax 45bf: 48 83 f8 10 cmp $0x10,%rax 45c3: 75 eb jne 45b0 <do_abs+0x150> 45c5: c3 retq 45c6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 45cd: 00 00 00 |