|
#951
|
||||||||||||
|
||||||||||||
![]() Quote:
*sigh*, why do I even bother. Quote:
Quote:
I'm observing problems you seem to be having with not skipping over relevant parts of posts, which had always been a problem. Now if you're convinced that for some strange reason, I *know* what your medical diagnosis is, notice that I didn't say you're diagnosed in so-and-so a manner. Probably what I should have said was "being ADHD". It's perfectly valid for a normal person to be an ass and the other way around. Quote:
"If you had used intrinsics, you could have just seen it from the getgo instead of having to refactor your entire CP2 codebase and you wouldn't have the "mess" of SSE2 instructions that you have now. Then I would have never even responded." If you don't care what I implemented, why is it true that my implementation using intrinsics would have caused you to never respond? It's fine to have advice for people, but when you already know someone is adverse to it, it really just starts to become trolling. Notice how I don't pop on CEN64 forums just to talk to everyone there but you about things you're not doing that I think you should, when I know you're opposed to them. And if I keep posting about it in threads (MarathonMan's not using ANSI or zilmar specs!) to various users, I imagine that you as the owner of said forum would eventually find that worth moderating. For someone who's so sure that I "kept arrogantly assuming" (as you said on IRC after cxd4 /quit) that you were interested in adopting my RCP code, I find it interesting that I haven't bitched you out about one single thing of how you personally do things, yet you usually only post on this forum about things you disagree of how I'm doing it. Poor you; I'm sure you must feel on the defensive. Quote:
It's hard to feel butthurt when you actually know you're right, especially if year-old code is the best example you have to criticize. I've been posting modern examples of ANSI C loops all over my RDP/angrylion thread and how that perfectly compiled to optimized SSE2 intructions; why didn't you participate in an argument then when I made those posts? Why only now and not then do you bring it up? Because at the time, I was right and you were wrong. In this case, you're right and I'm wrong, because shitty usage of ANSI C loops from my 1-year-less-experienced-self in old RSP code from back then has given you a false hope of proving that intrinsics always guarantee beating out portable code. That being said, you're right that you commented in your first post, "this is after the load and shuffle." Apparently I missed that part, or possibly forgot. See, maybe I have ADHD. Hell maybe everyone does. To me, it's more an observation. Quote:
I hardly argued with your code. All of this was in defensive of what I didn't know back then about SSE when I wrote that year-old RSP code. None of it was about *your* code, except that time I proved to RPGMaster that auto-vectorization in part of my method potentially beat out a part of your intrinsics code. I'm telling you that the shit that is those ANSI C loops right now just ain't optimized, and all you can keep doing is bashing that as the reason why it supposedly can't result in as good a thing as intrinsics, even though I've been posting ANSI C loops way more recently than those in my RDP thread that DO equate to the output of intrinsics. You're the one arguing with older code here; get the story straight! So now you're saying you're agreeing "the affirmative" with our observation that Clang's vectorized output is worse than GCC's? If that's true, then why did you say this: "Clang's IR all the way up to codegen looks awesome from an auto-vectorization standpoint. I'd be surprised if GCC's was any better;" Also, why did you post GCC output with absolutely no vectorization at all, if you think Clang's is worse, when it at least vectorizes? Quote:
But whatever, I understand you feel so wrongly accused on a number of things that it's only right to seek revenge. Quote:
Quote:
GCC, MSVC, Clang and Intel compiler I don't really know that many compilers, probably not compared to you at least. What's an optimizing compiler you trust in the competition for best output that DOESN'T have the capability to auto-vectorize? Is my RSP VABS emulator an accurate reason of why intrinsic functions for SSE should be minimized/avoided? No, but that doesn't mean it proves every fucking example is invalid. If realizing that wouldn't have prevented the length of this argument, then I can't guess what it is for you. ![]() Quote:
Even if you fully use intrinsics everywhere you can think of, it's still a good idea to pass -msse2 and other relevant flags to signify that this is the limit, and help vectorize maybe other things you might have missed. If you really liked the reassurance that your compiler is going to always spew out optimized code, regardless of what flags passed, then don't pass -O3 to gcc. Pass -O0 and code in 1990's "I'm better than my compiler" mode. Quote:
Think about it. You're bashing the way I wrote my non-intrinsic code from back then. You even accused me of willingly passing (int vd, int vs, int vt), when really I told you I had no idea back then that passing __m128i's would have avoided pushes or pops and been superior to that. So yes, actually, in the case of being accused of knowing what I was doing while I was doing it, it DOES matter what I did or didn't know. You just don't think it does because you keep pointing out mistakes from the past of which I'm already aware. Quote:
I have GCC 4.9.something (forget) too on that laptop you sent. It auto-vectorizes just fine. Dunno what you're missing.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#952
|
||||
|
||||
![]()
I assume that this type of banter is common among coders, and that it actually helps the progression of the scene.
![]() Anyhow, this is making a really good read, and I hope it continues. |
#953
|
||||
|
||||
![]()
I knew that the argument formats were different. It's just that seeing the code made me realize that writing AVX code in assembly would probably be simpler. I'd like to emphasize the fact that I'm focused on making a recompiler, so the code gen is all done manually. Later on, I may try making multiple code paths, but I doubt that will be anytime soon. I'm fairly confident this sse2 one will satisfy my goals.
I admit I am surprised by that output MarathonMan posted. I'm guessing perhaps gcc isn't good with 64bit? I don't see how you could mess up the compiler settings if you put -march=native . I'm pretty sure when i did -march=native, I saw better asm output, although it obviously wasn't limited to sse2 ;/ . Clang can vectorize fine, still probably not as good as GCC, but it's certainly better than MSVC right now. Only reason it looked so bad in HatCat's fork is because for some reason using intrinsics, inhibits that compiler's ability for auto-vectorization. When i disabled the intrinsics, in Draw Triangle, the asm output looked much better in Clang, than when it was mixed. So what I'll have to do for any project I decide to work on / make from scratch is, have 2 different implementations. One will be intrinsics and the other will be pure ansi-c because they don't seem to mix well. I wouldn't be surprised if other compilers have a problem with intrinsics interfering with auto vectorization too. I've gone through most of the vector instructions now. It's been pretty interesting looking at the output. There were functions where both GCC and Intel did weird stuff, so I had optimized them, myself. Man I've been pretty much working like a machine these past few days. The one thing I'm worried about is whether it will even work. It will be a huge hassle figuring out bugs. Other than that, recompilers are very interesting. |
#954
|
||||
|
||||
![]() Quote:
Still bothers me to all hell I can't convince you of anything regardless of how much evidence I supply without getting lip, so regardless of the fact that I feel like I could respond strongly I'll just save both our efforts for development and ban/register myself from here. Peace. |
#955
|
||||
|
||||
![]()
Well I adopted your SSSE3 shuffle code. I didn't understand pshufb all that well to begin wtih, so I looked at how you implemented it and even commented credit to you in my shuffle.h source. So when I build my RSP with -DARCH_MIN_SSSE3 and -mssse3, it uses those intrinsics instead.
There are plenty of other places where I had to use intrinsics (your movemask example being another one); I just want to keep it minimal/portable to non-Intel (like, on mips, for loops could compile to SGI/rsp VAND instruction, rather than sse2 pand!). The days where we both found fixes/corrections for RSP interpreter speed/accuracy were just so long ago that the future just isn't the past I guess. Either way, I'd never even heard of SSE before you brought it up, so like, try not to feel like you've never convinced me of anything before. RPGMaster: I just got done reading that entire post and can't remember a single goddamn thing about what I just read. ![]()
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#956
|
||||
|
||||
![]()
Rofl wow I goofed pretty bad. Some of the CP2 functions I implemented have some clear mistakes ;/ .
Taking it slow sure is better. I made a lot of silly mistakes when I was trying to do it fast. And I was staying up till 5 am working on this stuff for a few days. Anyway, is there a convenient way to use different versions of GCC? Or will I have to constantly install and uninstall? Hopefully soon I can get my code working ;/ . |
#957
|
||||
|
||||
![]()
CP2 don't mean shit. MFC2/CTC2/?? V*?
Just start out small. Try doing something like LBV and SBV in recompiler, then try MFC2 and MTC2, including the 8-bit wrap-around possibility exclusive to MFC2. I just have different folders for different versions of GCC. $MinGW/libexec/bin/... tends to show you all the versions of GCC you have installed in folders, not just one. ![]()
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#958
|
||||
|
||||
![]()
You're right. I should start with LBV and SBV, the finish all the other Lc2 and Sc2. I originally tried to cut corners but that failed miserably.
Alright, I guess I'll go ahead and try installing dif versions and do a comparison. |
#959
|
||||
|
||||
![]()
So right now, I'm workin on LDV. Since it's a switch table, I'm wondering the best way to implement the jump table. Is there a good way to allocate a jump table, without using something like malloc? For the time being, I'm using a large static 2d array, but it will need to be changed in the future.
Lol first I was goin too fast, now I'm goin too slow ;/ . Time to pick up the pace ![]() |
#960
|
||||
|
||||
![]() Quote:
LDV is a static operation...loads 64 bits to a vector register. I only made the interpreter implementation of it a switch table to optimize for all the possible alignments. If you do not assume an alignment (addr & 07), you must do 8 1-byte writes with constant endianness conversion of the byte address every time. I never said to finish LWC2 and SWC2, just to practice on LBV and SBV and M*C2. Little-endian CPU makes writing 64 bits at once break accuracy with LDV.
__________________
http://theoatmeal.com/comics/cat_vs_internet |