Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #941  
Old 29th August 2014, 02:15 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by RPGMaster View Post
Wow seeing how small that output is, makes me wish my computer had AVX support ;/ .

Seriously, I'm just amazed at how few instructions it requires! At least now I know that compilers still need work on vectorization.
Those are just AVX versions of SSSE3 instructions (hence the V prefix).
http://msdn.microsoft.com/fr-fr/libr...(v=vs.90).aspx

The SSSE3 instructions are actually a byte smaller a piece (due to the lack of a ...VEX? prefix).
Reply With Quote
  #942  
Old 29th August 2014, 02:47 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Nah, Clang's worse than GCC in that sense, according to a couple of RPG's notes compared to mine.

Auto-vectorizing beat out the intrinsics he was intent on using if Clang hadn't screwed it up for him.
Reply With Quote
  #943  
Old 29th August 2014, 02:59 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Quote:
Originally Posted by RPGMaster View Post
Wow seeing how small that output is, makes me wish my computer had AVX support ;/ .

Seriously, I'm just amazed at how few instructions it requires! At least now I know that compilers still need work on vectorization.
Seriously, you're letting the AVX equivalents of simple SSE2 operations deceive you, among other things. Also, the SSSE3 psign opcode he is using still needs extra instructions to hack around the fact that it is not an accurate mapping of the actual RSP's VABS instruction, which was knowledge formerly known only to zilmar and myself as all other RSP implementations and rspsim never covered it.

If you look closely, he's using 6 extra instructions to wrap around the problem with SSSE3 pshufd:
Code:
vpcmpgtw    xmm0, xmm0, xmm3
vpcmpgtw    xmm4, xmm1, xmm3
vpcmpgtw    xmm2, xmm2, xmm3
vpand       xmm2, xmm0, xmm2
vpand       xmm3, xmm4, xmm2
vpxor       xmm0, xmm3, xmm1
On the other hand, that SSE2 output you pasted from trying to compile my plugin earlier:
Code:
pmullw	xmm1, xmm0
;# last instruction to touch xmm1, result of doing PSIGN in terms of SSE2

movdqa	xmm0, xmm1
pcmpeqw	xmm0, XMMWORD PTR LC1
pand	xmm0, xmm3
psubw	xmm1, xmm0
;# only 4 instructions to compare to INT16_MIN and if true, correct corner case
So his version added 6 instructions to correct the corner case; mine added 4.

Don't buy into simple cut-and-paste marketing strategies! You can tell from looking at MM's paste that xmm3 was never defined within the scope of that function; it was an argument defined by a parent procedure. It's not the entire algorithm when comparing it to the size of the messy SSE2 output from my version, which is mostly messy because I pushed 3 int's to the function call stack, not 3 xmm's defined by a parent outside procedure (such as pre-handled shuffling) beforehand.

Last edited by HatCat; 29th August 2014 at 03:03 AM.
Reply With Quote
  #944  
Old 29th August 2014, 03:16 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,029
Default

Well I can say for sure GCC is better than Clang at vectorization. Just checked the output of VADDC in Clang and saw poor code. Same with MSVC 2013. So that just leaves Intel and GCC. Intel & GCC aren't doing the best job, for SSE2 though.

Without cheating, the both GCC and Intel generally produce better code than i could, since idk sse too well yet. But when looking at the output, I've noticed a few flaws that I could fix, so it's kinda cool to look at the output from different compilers and then come up with the best solution, based on those results.

Regarding MM's paste, I knew it didn't show everything. I was just amazed at how much simpler it is when you have 3 register operands. It's really convenient when you have to write in assembly. I just think it may be cool to eventually add AVX support to RSP recompiler. But after finding out that my computer doesn't support it, I'm obviously not going to bother .

Anyway, I need to find the best versions of each compiler ;/ . I'm really hoping this strange GCC output is the result of using an inferior version. What's a good version for Min-GW?
Reply With Quote
  #945  
Old 29th August 2014, 04:12 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

It has nothing to do with having 3 register operands, dude. >.< There are SSE2 opcodes that do that as well anyway.

I just proved that my SSE2 method of wrapping around the corner case fix was even smaller than his AVX/SSSE3 version of it.

It's easy to take PART of an overall AVX algorithm, and show off that it's better than the SSE2 version, because in fact it is. That of course can't entirely be blamed on MarathonMan, though, since my overall output is still shitty due to the fact that I pushed 3 scalar ints to the function call stack, rather than using XMM register pushing which is basically transparent. Back then, I had no idea I could do that.

Quote:
Originally Posted by RPGMaster View Post
I just think it may be cool to eventually add AVX support to RSP recompiler. But after finding out that my computer doesn't support it, I'm obviously not going to bother .
Guess what generated said AVX instructions!

NOT AVX intrinsic functions!
NOT inline/assembly! (*facepalm*)
It's SSE2/se3 intrinsics that made them:
https://github.com/tj90241/cen64-rsp...ter/CP2.c#L247

afaict that's marathon man's src to current VABS, which is mostly still his old algorithm when he learned of psign's inaccuracy from my rsp source code.

You see, AVX operations can be generated as a matter of COMPILER INTELLIGENCE, which neither of you two seem to have a whole lot of faith in.

Quote:
Originally Posted by RPGMaster View Post
Anyway, I need to find the best versions of each compiler ;/ . I'm really hoping this strange GCC output is the result of using an inferior version. What's a good version for Min-GW?
The both of you have already caused me to use enough time in this thread. Why should I consume even more repeating myself? I talked a lot about GCC versions old/new having pros/cons with some vectorization in the past; that was months ago before GCC was updated to even newer MinGW releases since the time I tested that.

And I'm probably not even going to use MinGW for the next release of my RSP plugin anyway. It should be MSVC with some explicit improvements, maybe some more intrinsics since I need to push XMM registers to functions anyway (*definitely* my biggest flaw in relying on auto-vectorization from ansi C). So while you're multitasking across tens of different projects based on what could end up being old code (e.g. you do a RSP recompiler based on my old interpreter; I multiply the performance of my interpreter), I think I'll get back to work on concentrating on my one single everyday focus for now.
Reply With Quote
  #946  
Old 29th August 2014, 05:03 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,029
Default

Interesting, i didn't know you could write sse2/3 intrinsics and generate AVX. I get your point though. Even if that example MM gave wasn't the best one, I can still imagine other vector instructions being simpler to implement at assembly level, due to having 3 register operands instead of 2.

I don't have much faith in compiler intelligence for a good reason. I constantly observe compiler output and often see results that are not flawless, just from my pov. Since I'm working on a recompiler, which pretty much requires you to write assembly code (unless you're lazy and just copy bytes from interpreter functions), I might as well do the best I can to generate the best assembly code. I'm learning sse2 instructions and hand optimizing the output.

Anyway, I'm a lot more focused now after coming to a realization. No more multitasking for me for a while . Now that I finally am able to accomplish something big, I'm really determined to finish it.

I guess I'll just install the newest version of Min-GW and hope for the best.

I believe you'll have to do some serious work to your interpreter, to get MSVC to do a good job.
Reply With Quote
  #947  
Old 29th August 2014, 06:01 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

First off, sue me for pasting my the entirety of my do_abs function without the ret. It's the same thing, only I had the freedom of passing through xmm registers as you later said.

EDIT: Nevermind, I even posted the ret. Look at that.

Quote:
Originally Posted by HatCat View Post
Nah, Clang's worse than GCC in that sense, according to a couple of RPG's notes compared to mine.

Auto-vectorizing beat out the intrinsics he was intent on using if Clang hadn't screwed it up for him.
As I said in my post, I was talking about intermediate output (IR), not object or final output. Clang's IR all the way up to codegen looks awesome from an auto-vectorization standpoint. I'd be surprised if GCC's was any better; if anything, they're probably similar and differences in instruction selection and code generation algorithms is probably what you're seeing.

Quote:
Originally Posted by HatCat View Post
You can tell from looking at MM's paste that xmm3 was never defined within the scope of that function; it was an argument defined by a parent procedure.
It's zero (pxor x,x). It's present in your assembly, too.

Quote:
Originally Posted by HatCat View Post
On the other hand, that SSE2 output you pasted from trying to compile my plugin earlier:
Yeah, and one of those instructions loads from memory and compares all in one, which gets executed as two instructions anyways. Not to mention: that load is going to have considerable latency vs. a compare, and, or xor.

Regardless: we wrote different algorithms. You just compare an INT16_MIN constant and take advantage of the fact that a compare returns -1 or 0 while I opted for extra compares and masked the signs of the compare masks together. I could have just as easily wrote _mm_set1_epi16 and gotten the same output from the compiler.

Which brings me to my final point:

Quote:
Originally Posted by HatCat View Post
It's not the entire algorithm when comparing it to the size of the messy SSE2 output from my version, which is mostly messy because I ...
[didn't use]

Quote:
Originally Posted by HatCat View Post
XMM register pushing which is basically transparent. Back then, I had no idea I could do that.
Let's call a duck a duck here. You're going to try and traumatize the compiler into passing what register type you want as a parameter all for the purposes of labeling your code as ANSI C? Dude, just use the ****ing intrinsics. What I was trying to point out was because I use intrinsics, I am afforded this luxury.

Speaking of which, how do you even manage to force it to do that?

Quote:
Originally Posted by HatCat View Post
The both of you have already caused me to use enough time in this thread. Why should I consume even more repeating myself? I talked a lot about GCC versions old/new having pros/cons with some vectorization in the past; that was months ago before GCC was updated to even newer MinGW releases since the time I tested that.
Lemme know when you finish that version that passes via xmm registers. If you had used intrinsics, you could have just seen it from the getgo instead of having to refactor your entire CP2 codebase and you wouldn't have the "mess" of SSE2 instructions that you have now. Then I would have never even responded.

Last edited by MarathonMan; 29th August 2014 at 06:30 AM.
Reply With Quote
  #948  
Old 29th August 2014, 09:38 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Quote:
Originally Posted by MarathonMan View Post
First off, sue me for pasting my the entirety of my do_abs function without the ret. It's the same thing, only I had the freedom of passing through xmm registers as you later said.

EDIT: Nevermind, I even posted the ret. Look at that.
I can't believe that you're so ADHD as to misread every single thing, and then begin with, "sue me for not posting the ret", only to find out that you did post the ret. Logically speaking, wouldn't that mean that none of my argument had anything to do with whether or not you posted a ret? Otherwise, I wouldn't have called you out for not doing something that you actually did, right? Hellooooo? Attention span?

But as long as you can construct accusations in the other person's mouth, it always makes yourself look better defusing them when you can shift the impression that they were ever made to begin with.

Your problem of cut-pasting a small segment of code has nothing to do with whether or not you included a single ret instruction. The problem with your demonstration was that you bypassed all the things that my VABS function had implemented then: shuffling, register decode from scalar specifiers (int vd, int vs....), storage and writeback (_mm_store_si128), calculation of _mm_setzero_si128(), and etc., because you MOVED those things to happen elsewhere. You were trying to deceive RPGMaster with false marketing of an algorithm that moved most of the vector shuffling/writeback/other phases to an external procedure, whereas you could plainly see that I had everything done within a single function.

Not saying I prefer it that way. This code was from a year ago. You do realize that?

Quote:
Originally Posted by MarathonMan View Post
As I said in my post, I was talking about intermediate output (IR), not object or final output. Clang's IR all the way up to codegen looks awesome from an auto-vectorization standpoint. I'd be surprised if GCC's was any better; if anything, they're probably similar and differences in instruction selection and code generation algorithms is probably what you're seeing.
And as we said in our posts, RPGMaster observed inferior code generation in Clang than as with GCC for auto-vectorization. Read it again if necessary. It has nothing to do with "object" or "final output" versus "intermediate output"; it's plainly an observation of the algorithm both compilers emited in asm.

I also find it ironic that you're so convinced that Clang's auto-vectorization codegen looks "awesome", and yet you still insist that it's always inferior to intrinsics. Clearly you have a low standard for awesomeness. I've known about ansi auto-vectorization since before you have; I was the one who first told you that C code could be automatically generated to SSE opcodes, and you were like, "hmm, interesting". But clearly you're still a little bit behind in the time of experience using it.

Anyone who would write _mm_and_si128, for example, in a place where for (i = 0; i < 8; i++) dst[i] &= src[i]; could have done at least as well, is either totally ignorant or just unaware that they're unnecessarily enforcing vendor lock-in to Intel's exact ISA. You know, "vendor lock-in", that thing you claim to hate so much?

Quote:
Originally Posted by MarathonMan View Post
It's zero (pxor x,x). It's present in your assembly, too.
"too"?

It's not present in your assembly.
That was the whole point. Your pasted output shows no pxor x, x. So you can't say "too". It's only present in my assembly paste, not yours. You basically just got done blindly admitting to leaving out more extra but necessary steps to the emulation of VABS just to make your codepaste even smaller than RPG's.

Quote:
Originally Posted by MarathonMan View Post
Yeah, and one of those instructions loads from memory and compares all in one, which gets executed as two instructions anyways. Not to mention: that load is going to have considerable latency vs. a compare, and, or xor.
Nope, your version used *3* compares and 2 ands, not "a compare, and, or xor". Mine used one compare and only one and, never minding the difference of the compare being to memory/movdqa you pointed out. Clearly your version uses more instruction memory with extra SSE operations, although you may argue it's faster due to not having to do the memory load.

Wow, and here I thought most people would have already had the intelligence to see the fine proof that they were wrong about this little thing.

Quote:
Originally Posted by MarathonMan View Post
Regardless: we wrote different algorithms. You just compare an INT16_MIN constant and take advantage of the fact that a compare returns -1 or 0 while I opted for extra compares and masked the signs of the compare masks together. I could have just as easily wrote _mm_set1_epi16 and gotten the same output from the compiler.
Yes, that's the key!
"Different algorithms!"

It's not the ability to use an intrinsic function that counts.
It's algorithm that counts. In that case, mine was more direct than yours, even though memory load costs extra.

And because you were so kind enough as to admit to _mm_set1_epi16 easily getting the same output from the compiler as my ANSI version, you also blindly admitted to the use of _mm_set1_epi16 version being obsolete to my ANSI version if it produces the exact same output. It's like using inline assembly to solve a problem that plain simple C would solve: You're politically obsessed with the lower-level version when it's not necessary.

Quote:
Originally Posted by MarathonMan View Post
Which brings me to my final point:

[didn't use]
Well said. =]

Quote:
Originally Posted by MarathonMan View Post
Let's call a duck a duck here. You're going to try and traumatize the compiler into passing what register type you want as a parameter all for the purposes of labeling your code as ANSI C? Dude, just use the ****ing intrinsics.
Um, no? I just got done saying several times that I had no idea that passing XMM registers across functions could be done so transparently. Back then I didn't know it'd be any better than passing scalar registers.

How do you manage to read that as "I'll just willingly traumatize the compiler with scalars?"

Quote:
Originally Posted by MarathonMan View Post
Lemme know when you finish that version that passes via xmm registers. If you had used intrinsics, you could have just seen it from the getgo instead of having to refactor your entire CP2 codebase and you wouldn't have the "mess" of SSE2 instructions that you have now.
Who said anything about a "mess" of instructions that I have "now"? That was RPGMaster's paste when he compiled with -msse2. It's not necessarily the result of auto-vectorization from a compiler that wasn't released 2 years ago.

Anyway, I've known about this since months ago, before you brought it up. Sorry to break your self-flattery, but I was already aware that I missed out on the fact that passing __m128i variables to a function call doesn't do pushes/pops. It was something I learned later on; you didn't have to point it out here.

I pointed it out first as the major reason why RPGMaster's sse2 codegen was so messy, then you started pointing it out as your "main point" as though I wasn't already aware of it. :S

And just because I realized months ago that I have to pass __m128i's to function calls (actually or I could just not use a function call table, one could use a switch statement), doesn't mean I can't still use ANSI C, and I'm still going to continue to use it.
Reply With Quote
  #949  
Old 29th August 2014, 11:37 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by HatCat View Post
I can't believe that you're so ADHD as to misread every single thing, and then begin with, "sue me for not posting the ret", only to find out that you did post the ret. Logically speaking, wouldn't that mean that none of my argument had anything to do with whether or not you posted a ret?
Factually speaking, you accused me of deceiving RPGMaster, and again did it in the last post post. That comment was made high in the post due to the fact that our two functions, do_abs and RSP_VABS, both take the similar input operands -- post-shuffled arguments, etc. -- (albeit, in different data formats and levels of indirection) and perform the same task on them. So, on that basis, I said 'sue me' for trying to be "deceptive" as I pasted the entirety of my function which carries out the same task and produces the same output, including the ret.

I'll try to spell things out a little more clearly for you in the future, since you seem to have a hard time following along.

Also note that in debates, it's informal and classless to label someone based on their medical condition, especially when that condition is of no fault of their own.

Quote:
Originally Posted by HatCat View Post
Otherwise, I wouldn't have called you out for not doing something that you actually did, right? Hellooooo? Attention span?
It's hard to have a high attention span with the amount of dribble that comes out of your mouth.

Quote:
Originally Posted by HatCat View Post
The problem with your demonstration was that you bypassed all the things that my VABS function had implemented then: shuffling, register decode from scalar specifiers (int vd, int vs....), storage and writeback (_mm_store_si128), calculation of _mm_setzero_si128(), and etc., because you MOVED those things to happen elsewhere.
I don't care what you implemented then. My response was to the function that RPGMaster has posted... duh? You should take your own advice and not "construct accusations in the other person's mouth":
- do_abs, as posted, doesn't shuffle either.
- I even mentioned in my original post... "This is after the load and shuffle."
- my function, as posted, also did writeback of the low part of the accumulator as posted (just not vd).

Regardless, I'm sorry for not editing and recompiling my code just to make my code confirm to your inferior argument passing convention. I generally dislike to use my time in such ways. Furthermore, I assume that RPGMaster is capable of realizing that the argument formats were different and it wasn't an apples to oranges comparison, but you had to comment because you got all butthurt.

He's trying to write a recompiler and will have all those registers cached, so he only needs to see the vector-computation component algorithm anyways, yes?

Quote:
Originally Posted by HatCat View Post
You were trying to deceive RPGMaster with false marketing of an algorithm that moved most of the vector shuffling/writeback/other phases to an external procedure
Just to point out again, not only does do_abs also not include the shuffle, but in addition I made mention of the fact of the loads and shuffle in my original post:

Quote:
Originally Posted by MarathonMan View Post
This is after the load and shuffle,
Quote:
Originally Posted by HatCat View Post
Not saying I prefer it that way. This code was from a year ago. You do realize that?
And my code was from over 6 months ago. Do you want a cookie or something? Milk to go with it? Do you get bonus points for arguing with older code?

Quote:
Originally Posted by HatCat View Post
And as we said in our posts, RPGMaster observed inferior code generation in Clang than as with GCC for auto-vectorization. Read it again if necessary. It has nothing to do with "object" or "final output" versus "intermediate output"; it's plainly an observation of the algorithm both compilers emited in asm.
Sorry for trying to share my observations on a public forum. I must be the devil! I agreed the affirmative:

Quote:
For vectorization, yes.
and went further to explain WHY it is, to the best of my observations:

Quote:
I've only really studied Clang in this regard, since gcc's intermediate output is sorcery, but Clang, at least, trips all over code generation when it comes to vectorization. I'm assuming GCC suffers from the same fate.
Quote:
Originally Posted by HatCat View Post
I also find it ironic that you're so convinced that Clang's auto-vectorization codegen looks "awesome"
Ahem:

Quote:
Originally Posted by MarathonMan View Post
but Clang, at least, trips all over code generation when it comes to vectorization.
"trips all over code generation" does not sound like 'looks "awesome"' to me. Would you not agree? Would you also not agree that you are going against the very thing you argue in that you are ' construct[ing] accusations in the other person's mouth'?

Quote:
Originally Posted by HatCat View Post
I've known about ansi auto-vectorization since before you have; I was the one who first told you that C code could be automatically generated to SSE opcodes, and you were like, "hmm, interesting". But clearly you're still a little bit behind in the time of experience using it.
I love these terms that you come up with to make yourself sound smart. "ansi auto-vectorization" is definitely a good one, thanks for the laugh. Is there a standards committee for auto-vectorization? It's just "auto-vectorization", buddy.

And, furthermore, you should that the amount of time exposed to a topic is not sufficiently indicative of experience in a topic. Maybe I've been doing vast amounts of research in auto-vectorization? I already mentioned I looked at the IR output of Clang, something which you weren't even aware that existed based on your past posts, so who are you to judge me?

Quote:
Originally Posted by HatCat View Post
Anyone who would write _mm_and_si128, for example, in a place where for (i = 0; i < 8; i++) dst[i] &= src[i]; could have done at least as well, is either totally ignorant or just unaware that they're unnecessarily enforcing vendor lock-in to Intel's exact ISA. You know, "vendor lock-in", that thing you claim to hate so much?
I prefer vendor lock-in over assuming capabilities are present in the host compiler. I agree it'd be nice if there was a standardized way of doing all this.

Anyways, that's only one aspect of intrinsics. What's really great about the intrinsics, and why I use them even with the vendor lock-in present, is because I can pass around data efficiently without having my code look like the product of a heroin addict and a hooker. It also gives me the ability to readily use intrinsics like _mm_movemask_epi8 in the event where the compiler has a borderline impossible time selecting them.

Quote:
Originally Posted by HatCat View Post
"too"?

It's not present in your assembly.
That was the whole point. Your pasted output shows no pxor x, x. So you can't say "too". It's only present in my assembly paste, not yours. You basically just got done blindly admitting to leaving out more extra but necessary steps to the emulation of VABS just to make your codepaste even smaller than RPG's.
This was, admittedly, not clear on my part per the last post. I meant "too" in that if I had used your SSE2 version of the psignw I would have already had the pxor'd value for free.

Quote:
Originally Posted by HatCat View Post
You're politically obsessed with the lower-level version when it's not necessary.
Mmmm... not so much. Again, see what I said about passing data and whatnot around above. I also like the reassurance that my compiler is going to spew out vector instructions regardless of flags passed or what have you.

Quote:
Originally Posted by HatCat View Post
Um, no? I just got done saying several times that I had no idea that passing XMM registers across functions could be done so transparently. Back then I didn't know it'd be any better than passing scalar registers.

How do you manage to read that as "I'll just willingly traumatize the compiler with scalars?"
Doesn't matter what you didn't or did know, it's what you did.

Quote:
Originally Posted by HatCat View Post
Who said anything about a "mess" of instructions that I have "now"?
Well! Since you asked, I'll compile that function from your trunk with a bleeding edge release of gcc (4.9.1) and -O2 -march=native:

**** see next post for code ****

Hm... looks like it didn't auto-vectorize. Nor did it with -march=native -O3 -ftree-vectorize. Still wondering why I use those intrinsics?

Quote:
Originally Posted by HatCat View Post
I pointed it out first as the major reason why RPGMaster's sse2 codegen was so messy, then you started pointing it out as your "main point" as though I wasn't already aware of it. :S
I'll throw you a bone and admit that I don't read most of the messages here, or at most, skim through them. All I saw was RPGMaster saying he was unsure about things, so I posted what the algorithm would look like if the registers were cached, loaded, and shuffled (stating the former two things in my post -- one can assume that implementing register caching is a good idea when designing a recompiler).

Quote:
Originally Posted by HatCat View Post
And just because I realized months ago that I have to pass __m128i's to function calls (actually or I could just not use a function call table, one could use a switch statement), doesn't mean I can't still use ANSI C, and I'm still going to continue to use it.
Have fun!

Last edited by MarathonMan; 29th August 2014 at 11:47 PM.
Reply With Quote
  #950  
Old 29th August 2014, 11:38 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

****

Code:
0000000000004460 <do_abs>:
    4460:	31 c0                	xor    %eax,%eax
    4462:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
    4468:	0f b7 0c 02          	movzwl (%rdx,%rax,1),%ecx
    446c:	66 89 4c 04 e8       	mov    %cx,-0x18(%rsp,%rax,1)
    4471:	48 83 c0 02          	add    $0x2,%rax
    4475:	48 83 f8 10          	cmp    $0x10,%rax
    4479:	75 ed                	jne    4468 <do_abs+0x8>
    447b:	30 c0                	xor    %al,%al
    447d:	0f 1f 00             	nopl   (%rax)
    4480:	0f b7 14 06          	movzwl (%rsi,%rax,1),%edx
    4484:	66 c1 ea 0f          	shr    $0xf,%dx
    4488:	66 89 54 04 a8       	mov    %dx,-0x58(%rsp,%rax,1)
    448d:	48 83 c0 02          	add    $0x2,%rax
    4491:	48 83 f8 10          	cmp    $0x10,%rax
    4495:	75 e9                	jne    4480 <do_abs+0x20>
    4497:	30 c0                	xor    %al,%al
    4499:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
    44a0:	31 d2                	xor    %edx,%edx
    44a2:	66 83 3c 06 00       	cmpw   $0x0,(%rsi,%rax,1)
    44a7:	0f 9f c2             	setg   %dl
    44aa:	66 89 54 04 b8       	mov    %dx,-0x48(%rsp,%rax,1)
    44af:	48 83 c0 02          	add    $0x2,%rax
    44b3:	48 83 f8 10          	cmp    $0x10,%rax
    44b7:	75 e7                	jne    44a0 <do_abs+0x40>
    44b9:	48 8d 44 24 c8       	lea    -0x38(%rsp),%rax
    44be:	48 8d 50 10          	lea    0x10(%rax),%rdx
    44c2:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
    44c8:	31 c9                	xor    %ecx,%ecx
    44ca:	48 83 c0 02          	add    $0x2,%rax
    44ce:	66 89 48 fe          	mov    %cx,-0x2(%rax)
    44d2:	48 39 d0             	cmp    %rdx,%rax
    44d5:	75 f1                	jne    44c8 <do_abs+0x68>
    44d7:	48 8d 44 24 a8       	lea    -0x58(%rsp),%rax
    44dc:	48 8d 50 10          	lea    0x10(%rax),%rdx
    44e0:	66 f7 18             	negw   (%rax)
    44e3:	48 83 c0 02          	add    $0x2,%rax
    44e7:	48 39 d0             	cmp    %rdx,%rax
    44ea:	75 f4                	jne    44e0 <do_abs+0x80>
    44ec:	31 c0                	xor    %eax,%eax
    44ee:	66 90                	xchg   %ax,%ax
    44f0:	0f b7 54 04 a8       	movzwl -0x58(%rsp,%rax,1),%edx
    44f5:	66 01 54 04 c8       	add    %dx,-0x38(%rsp,%rax,1)
    44fa:	48 83 c0 02          	add    $0x2,%rax
    44fe:	48 83 f8 10          	cmp    $0x10,%rax
    4502:	75 ec                	jne    44f0 <do_abs+0x90>
    4504:	30 c0                	xor    %al,%al
    4506:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
    450d:	00 00 00 
    4510:	0f b7 54 04 b8       	movzwl -0x48(%rsp,%rax,1),%edx
    4515:	66 01 54 04 c8       	add    %dx,-0x38(%rsp,%rax,1)
    451a:	48 83 c0 02          	add    $0x2,%rax
    451e:	48 83 f8 10          	cmp    $0x10,%rax
    4522:	75 ec                	jne    4510 <do_abs+0xb0>
    4524:	30 c0                	xor    %al,%al
    4526:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
    452d:	00 00 00 
    4530:	0f b7 54 04 c8       	movzwl -0x38(%rsp,%rax,1),%edx
    4535:	66 31 54 04 e8       	xor    %dx,-0x18(%rsp,%rax,1)
    453a:	48 83 c0 02          	add    $0x2,%rax
    453e:	48 83 f8 10          	cmp    $0x10,%rax
    4542:	75 ec                	jne    4530 <do_abs+0xd0>
    4544:	30 c0                	xor    %al,%al
    4546:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
    454d:	00 00 00 
    4550:	31 d2                	xor    %edx,%edx
    4552:	66 81 7c 04 e8 00 80 	cmpw   $0x8000,-0x18(%rsp,%rax,1)
    4559:	0f 95 c2             	setne  %dl
    455c:	66 89 54 04 d8       	mov    %dx,-0x28(%rsp,%rax,1)
    4561:	48 83 c0 02          	add    $0x2,%rax
    4565:	48 83 f8 10          	cmp    $0x10,%rax
    4569:	75 e5                	jne    4550 <do_abs+0xf0>
    456b:	30 c0                	xor    %al,%al
    456d:	0f 1f 00             	nopl   (%rax)
    4570:	0f b7 54 04 d8       	movzwl -0x28(%rsp,%rax,1),%edx
    4575:	66 01 54 04 e8       	add    %dx,-0x18(%rsp,%rax,1)
    457a:	48 83 c0 02          	add    $0x2,%rax
    457e:	48 83 f8 10          	cmp    $0x10,%rax
    4582:	75 ec                	jne    4570 <do_abs+0x110>
    4584:	30 c0                	xor    %al,%al
    4586:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
    458d:	00 00 00 
    4590:	0f b7 54 04 e8       	movzwl -0x18(%rsp,%rax,1),%edx
    4595:	48 83 c0 02          	add    $0x2,%rax
    4599:	66 89 90 00 00 00 00 	mov    %dx,0x0(%rax)
    45a0:	48 83 f8 10          	cmp    $0x10,%rax
    45a4:	75 ea                	jne    4590 <do_abs+0x130>
    45a6:	30 c0                	xor    %al,%al
    45a8:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
    45af:	00 
    45b0:	0f b7 90 00 00 00 00 	movzwl 0x0(%rax),%edx
    45b7:	66 89 14 07          	mov    %dx,(%rdi,%rax,1)
    45bb:	48 83 c0 02          	add    $0x2,%rax
    45bf:	48 83 f8 10          	cmp    $0x10,%rax
    45c3:	75 eb                	jne    45b0 <do_abs+0x150>
    45c5:	c3                   	retq   
    45c6:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
    45cd:	00 00 00
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 09:18 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.