Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #251  
Old 23rd April 2014, 04:48 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

I tl;dr'd over a lot of that, so pardon any points that I make which don't address specific issues.

Quote:
Originally Posted by HatCat View Post
But yeah, otherwise you're right. The MMX method I just pasted uses both unpack low and unpack high to store.

But, later on in your M-m-m-m-MEGA EDIT post you said _mm_loadl_epi64 is easily preferable over MMX. Uh, why is that true when both of them generate MOVQ anyway? Cause that's what your function just did; it generated 2 MOVQ's. I'm currently using a MOVQ myself, but the only difference is that it accesses the MM register file and not the XMM register file.
I'm trying to advocate SSE over MMX. So yes, I do claim that _mm_loadl_epi64 is "easily preferable" over having to deal with the penalties that come along with any MMX instruction (namely using a very legacy part of the instruction set, invalidating the FP tags after MMX is finished, etc.).

BTW: the intrinsics are essentially an API; they just help the compiler generate the most efficient code for your setup. Just because you're using a particular intrinsic -- _mm_loadl_epi64 over _mm_loadu_si128 doesn't necessarily mean that the compiler is going to generate different code. If SSE7 is released tomorrow with a movhq and "half" xmm registers, and somebody compiles with -msse7 on gcc, then it will emit code that is more suited to SSE7 and half-xmm registers.

But for now, with the arguments that I fed gcc, _mm_loadl_epi64 tells the compiler that the upper half of the register should be zeroed out. It's just semantics. Maybe it'll generate different code for your K8 setup and compiler version. I don't know, nor do I care, as long as the semantics are maintained. Example:

Code:
extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_loadl_epi64 (__m128i const *__P)
{
  return _mm_set_epi64 ((__m64)0LL, *(__m64 *)__P);
}
gcc uses these definitions accordingly. You can see in my prior example how ti determined that zeroing the upper half of the vector wasn't even required because I never referenced that data.

Quote:
Originally Posted by HatCat View Post
What's wrong with being a strict subset of something?
Doesn't that make it more elementary, basic and simple?
I don't have any gripes with the MMX API, instructions, or anything like that -- it's just the fact that Intel has (IMO) more or less deprecated it. The forum link you posted is a great example of this -- somebody asks how to combine AVX with MMX and nobody can give concrete answers.

Quote:
Originally Posted by HatCat View Post
That would just be vendor-lock in.
How is AVX, SSE, etc. any more of a lock-in than MMX? They're all lock-ins. Also, I'm only advocating SSE2 and, as I mentioned, every x86_64 CPU supports SSE2. How many people only have a AMD 32-bit processor in this day and age?

Quote:
Originally Posted by HatCat View Post
What kind of comparison is this?
It's a comparison showing several things. When compiling for x86_64:
  • MMX instructions are larger than SSE equivalents (5 vs 4 bytes, respectively)
  • Simple MMX intrinsics result in disastrous code-bloat.
  • MMX instructions use the legacy opcode space.

So while it might be fine for targeting IA32, as soon as you compile for x86_64, your code is going to be larger and more inefficient.

Quote:
Originally Posted by HatCat View Post
I don't mean execution overhead; I do mean size--size of the executable basically. Shouldn't MMX byte-code be smaller than SSE byte-code, at least sometimes?
Yes. MMX instructions are 1 byte smaller than their SSE2 equivalents on IA32. However, MMX instructions are 2 bytes larger in the x86_64 due to the legacy prefix. That being said, how often are you computing using only 64 bits of data? Can you give examples?

Quote:
Originally Posted by HatCat View Post
I was just saying, even in cases where MMX is indisputedly slower than the SSE solution to a specific problem, if it's not at all performance-intensive, maybe we can hope that the bytecode of the MMX solution is smaller than that of the SSE one?
If it's not performance critical, who cares about size? If I press the screenshot button, I don't mind waiting the extra 200us to fetch a page from RAM and expending the extra 4KB of disk space required. After I'm done taking the screenshot, the performance critical code will get cached in again.

Quote:
Originally Posted by HatCat View Post
I can only hope not since they should be aware that AMD screwed up my K8 for doing certain 128-bit SSE ops.
Core 2 Duos have 128-bit wide SSE units.
Reply With Quote
  #252  
Old 23rd April 2014, 05:47 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Quote:
Originally Posted by MarathonMan View Post
I don't have any gripes with the MMX API, instructions, or anything like that -- it's just the fact that Intel has (IMO) more or less deprecated it. The forum link you posted is a great example of this -- somebody asks how to combine AVX with MMX and nobody can give concrete answers.
This does seem like a matter of opinion because Intel does continue to support MMX, also in their optimization manuals, which discusses not only MMX operations but SSSE3 and SSE4 ones.

Likewise, the forum link I posted shows that its use is still being recommended by some (although the thread wasn't really about specific examples), so I don't see any evidence that Intel has deprecated it by any means, or they wouldn't continue to not contradict their own hybrid documentation of using either MMX or XMM operands in their instructions books.

Quote:
Originally Posted by MarathonMan View Post
How is AVX, SSE, etc. any more of a lock-in than MMX? They're all lock-ins.
Really? How is an instruction set enhancement 20 years newer than the subset ISA from decades ago, any more of a lock-in? You just asked that?

It's obvious that AVX+SSE is more of a lock-in than MMX is. That shouldn't need explaining because you know that one is a superset encompassing the other.

Besides, that wasn't really the level of definition you used in your earlier reasoning as to why you wrote CEN64 to support non-SSE machines, when you said that you "dislike vendor lock-in as much as the next guy" as your reason for including pure C code for people who didn't have the machines for Intel SSE, in some simple macro logic. If you can call that "vendor lock-in", so can I.

Quote:
Originally Posted by MarathonMan View Post
Also, I'm only advocating SSE2 and, as I mentioned, every x86_64 CPU supports SSE2. How many people only have a AMD 32-bit processor in this day and age?
And as I mentioned, 900% (yes two 0's) of CPUs could support SSE2, but only a measly 90% of CPUs would continue to support MMX, and I still wouldn't care. I would use commented-out MMX anyway on the side of pure C code, because I am not going to use a 128-bit register when I only need to store 64 bits. That's just too abstract, and only worth doing if in a performance-relevant section. Otherwise, I'd sooner do 32-bit writes in C before I'd ever access a 128-bit RF for the sole purposes of moving just 64 bits over.

And since the N64 CPU is 32-bit and not 64-bit, I'm close to but not quite 100% inclined to take advantage of the fact that any 64-bit Intel CPU (amd64, IA-64, etc.) always throws in support for SSE2. Then again in many ways the N64 is certainly like a 64-bit system, so yes, making a 64-bit emulator for the N64 but not taking advantage of SSE2 is pure masochism, because a 64-bit Intel always has SSE2 right? I had already got that much. I don't think there is any logical strategy to using MMX when compiling in 64-bit, but my code here is 32-bit. :P

Quote:
Originally Posted by MarathonMan View Post
It's a comparison showing several things. When compiling for x86_64:
  • MMX instructions are larger than SSE equivalents (5 vs 4 bytes, respectively)
  • Simple MMX intrinsics result in disastrous code-bloat.
  • MMX instructions use the legacy opcode space.
Your comparison shows me maybe 2 of those 3 things. (Looks like you were definitely right about the bytecode...how unfortunate.)

Your comparison shows me that you are able to write 2 functions using 1 intrinsic in each, one with the SSE2 version and one with the MMX version, and that somehow the MMX version does extra memory loads to fit it from an XMM into the MMX, whereas your SSE version already had it in the XMM and did it in one instruction. This "code bloat" in your MMX version has never happened for me when simply working with the __m64 data type as I've done, and that "code bloat" seems to demonstrate the migration from an XMM presumed by GCC, redundantly into an MMX, for some strange reason. It really looks more like a problem with GCC than with MMX.

Quote:
Originally Posted by MarathonMan View Post
That being said, how often are you computing using only 64 bits of data? Can you give examples?
I know you had to tl;dr my last post (probably this one too xD), so I'll repost my example:
Code:
    if (*GET_GFX_INFO(DPC_STATUS_REG) & DP_STATUS_XBUS_DMA)
        do
        {
            offset &= 0xFFF / sizeof(i64);
            cmd_data[cmd_ptr + length].W = *(i64 *)(SP_DMEM + 8*offset);
            offset -= 0x001 * sizeof(i8);
        } while (--length >= 0);
You see, not long after process_RDP_list() starts, you have to buffer the RDP instructions from either RDRAM or SP data memory, depending on whether the XBUS_DMA bit was set.

You know that the RDP instructions are always divisible by 64-bit words.

You know that you could possibly check if the number of 64-bit words to read is an even number, divisible by 128 so that you could use SSE instead, but this involves asking for an assumption, throwing in a branch weigh and probably just delaying the worst case scenario for games anyway, when it is not an even number.

Quote:
Originally Posted by MarathonMan View Post
If it's not performance critical, who cares about size?
Correct me if I'm wrong, but my understanding is that making size go down, gives you more space elsewhere in performance-critical sections to allow THEIR size to go up. A balancing scale.

Basically I think that performance-critical functions (like the emulation thread) should sacrifice size for speed, and that non-performance-critical functions (like the screenshot example) should sacrifice speed for size. If you make some functions smaller and take up less code space, doesn't that give more code space to the rest of your program? Plus, smaller overal binary size isn't too much of a loserar anyway.

Last edited by HatCat; 23rd April 2014 at 06:16 AM.
Reply With Quote
  #253  
Old 23rd April 2014, 09:22 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by HatCat View Post
This does seem like a matter of opinion because Intel does continue to support MMX, also in their optimization manuals, which discusses not only MMX operations but SSSE3 and SSE4 ones.
Maybe, but a lot of people seem to have the same opinion as I.

MSDN:

Quote:
The x87, MMX, and 3DNow! instruction sets are deprecated in 64-bit modes. The instructions sets are still present for backward compatibility for 32-bit mode; however, to avoid compatibility issues in the future, their use in current and future projects is discouraged.
AMD: http://i50.tinypic.com/oua4qb.png

Quote:
Originally Posted by HatCat View Post
Really? How is an instruction set enhancement 20 years newer than the subset ISA from decades ago, any more of a lock-in? You just asked that?

It's obvious that AVX+SSE is more of a lock-in than MMX is. That shouldn't need explaining because you know that one is a superset encompassing the other.
I wholly disagree that the longer an extension has been around, the smaller the size, etc. of the extension, the less of a lock-in that it becomes. Regardless, this is just a difference in opinion.

Quote:
Originally Posted by HatCat View Post
Besides, that wasn't really the level of definition you used in your earlier reasoning as to why you wrote CEN64 to support non-SSE machines, when you said that you "dislike vendor lock-in as much as the next guy" as your reason for including pure C code for people who didn't have the machines for Intel SSE, in some simple macro logic. If you can call that "vendor lock-in", so can I.
I don't see how my reasoning differs? SSE2 is a lock-in. SSSE3 is a lock-in. MMX is a lock-in. So I also try to provide strict, conformant C, as it is not a lock-in.

Quote:
Originally Posted by HatCat View Post
And as I mentioned, 900% (yes two 0's) of CPUs could support SSE2, but only a measly 90% of CPUs would continue to support MMX, and I still wouldn't care. I would use commented-out MMX anyway on the side of pure C code, because I am not going to use a 128-bit register when I only need to store 64 bits. That's just too abstract, and only worth doing if in a performance-relevant section. Otherwise, I'd sooner do 32-bit writes in C before I'd ever access a 128-bit RF for the sole purposes of moving just 64 bits over.
Physical registers at the microarchitectural level can be (and are, in many implementations that I've studied) larger than their architectural sizes. Sometimes, several instances of architectural registers are contained in one physical register. That being said, I'm not sure why the size of the abstraction matters, so long as the abstraction produces the correct result.

Quote:
Originally Posted by HatCat View Post
And since the N64 CPU is 32-bit and not 64-bit.
Which N64 CPU? You should know that "N64 CPU" is vague. The RSP is 32-bit, but the VR4300 is 64-bit internal with a 32-bit bus interface. And I'm not sure what you would call the RDP.

Quote:
Originally Posted by HatCat View Post
Your comparison shows me maybe 2 of those 3 things. (Looks like you were definitely right about the bytecode...how unfortunate.)

Your comparison shows me that you are able to write 2 functions using 1 intrinsic in each, one with the SSE2 version and one with the MMX version, and that somehow the MMX version does extra memory loads to fit it from an XMM into the MMX, whereas your SSE version already had it in the XMM and did it in one instruction.
Imma let you finish, but... I passed in an __m64, not an __m128i, to the MMX function. If you target IA-32, gcc assumes the arguments are in _mm* or whatever the MMX registers are. So yes, MMX intrinsics really are just that grossly inefficient (at least in GCC) when you target x86_64, which should raise some flags.

Quote:
Originally Posted by HatCat View Post
This "code bloat" in your MMX version has never happened for me when simply working with the __m64 data type as I've done, and that "code bloat" seems to demonstrate the migration from an XMM presumed by GCC, redundantly into an MMX, for some strange reason. It really looks more like a problem with GCC than with MMX.
Have you only been targeting IA-32?

Quote:
Originally Posted by HatCat View Post
I know you had to tl;dr my last post (probably this one too xD), so I'll repost my example:
Code:
    if (*GET_GFX_INFO(DPC_STATUS_REG) & DP_STATUS_XBUS_DMA)
        do
        {
            offset &= 0xFFF / sizeof(i64);
            cmd_data[cmd_ptr + length].W = *(i64 *)(SP_DMEM + 8*offset);
            offset -= 0x001 * sizeof(i8);
        } while (--length >= 0);
You see, not long after process_RDP_list() starts, you have to buffer the RDP instructions from either RDRAM or SP data memory, depending on whether the XBUS_DMA bit was set.

You know that the RDP instructions are always divisible by 64-bit words.
Copying a 64-bit quantity is child's play. Do you have an specific examples that actually involve some computation on 64-bit quantities?

Quote:
Originally Posted by HatCat View Post
Correct me if I'm wrong, but my understanding is that making size go down, gives you more space elsewhere in performance-critical sections to allow THEIR size to go up. A balancing scale.

Basically I think that performance-critical functions (like the emulation thread) should sacrifice size for speed, and that non-performance-critical functions (like the screenshot example) should sacrifice speed for size. If you make some functions smaller and take up less code space, doesn't that give more code space to the rest of your program? Plus, smaller overal binary size isn't too much of a loserar anyway.
The user address space is 2GiB on Windows IA-32. Your only concern is cacheable space. Which is something that you should be aware of -- yes. However, the code that gets executed when a screenshot is requested will fall in the memory hierarchy within microseconds when unused. I would be worried more conflict misses than I would capacity misses if you're squeezing out every last drop.

EDIT:

Regardless, I see that you're unwilling to drop MMX so I'll just leave it at this post. I think you're foolish for using MMX over SSE2 for the several reasons that I have stressed. We clearly have different opinions in the matter, so I see no reason to continue to debate over it when I seem as though I am unable to convince you. I have stated my point.

Last edited by MarathonMan; 23rd April 2014 at 09:29 AM.
Reply With Quote
  #254  
Old 23rd April 2014, 10:08 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Quote:
Originally Posted by MarathonMan View Post
Maybe, but a lot of people seem to have the same opinion as I.

MSDN:



AMD: http://i50.tinypic.com/oua4qb.png
No no, this is not the subject we were discussing.

The subject was not, MMX is still being encouraged for 64-bit systems. You're inserting criterion where it wasn't specified.

I said it's still being supported *period*. Microsoft Visual Studio is basically dropping the intrinsic syntax functionality for MMX when using the intrinsics environment in 64-bit mode only, but it doesn't matter, because MMX is pointless when compiling in 64-bit. Why use an MMX movq opcode with some MM register, when you could just use rax, rcx, rdx ... ? MMX is basically obsolete then.

In the world of opinions, anyone can be "right", but neither AMD saying SSE is the future (invariably on 64-bit CPUs) nor Microsoft saying they're dropping C-level intrinsic syntax for SSE (only on 64-bit CPUs) really seems to justify your statement that other people believe that MMX, in of its self, is deprecated. I don't need a damn lawyer who's good with word games to trick me into thinking into somebody else's benefit; I simply needed a logical discussion with truth. So far it just seems you'd rather change my questions around.

Quote:
Originally Posted by MarathonMan View Post
I don't see how my reasoning differs? SSE2 is a lock-in. SSSE3 is a lock-in. MMX is a lock-in. So I also try to provide strict, conformant C, as it is not a lock-in.
Need I also remind that you included macros for #define SSE2_ONLY or SSSE3_ONLY to once again reduce what you referred to as "vendor lock-in". I only said "strict C" as just one of the examples of things you do; that's not the only one.

Quote:
Originally Posted by MarathonMan View Post
I passed in an __m64, not an __m128i, to the MMX function. If you target IA-32, gcc assumes the arguments are in _mm* or whatever the MMX registers are. So yes, MMX intrinsics really are just that grossly inefficient (at least in GCC) when you target x86_64, which should raise some flags.
First, I've repeatedly insisted that I have no intention of using any MMX code whatsoever in 64-bit software, but since it seems you prefer to pay attention to the parts that strike your opinions I guess my words are drowned out.

Second, I know that you passed in an __m64, not an __m128i. That's exactly why the problem there was with GCC. For the reason you just admitted, GCC used an XMM followed by several memory moves when no __m128i data type was requested.

Quote:
Originally Posted by MarathonMan View Post
Have you only been targeting IA-32?
It's not so religious as you appear to fear.

I think you have been targeting IA-64 because like so many others, naturally, you prefer to do things your own way. (And that's a positive reason; don't misinterpret me.) But since the only common N64 plugin specifications still only supports 32-bit, I really don't see as much point in sacrificing backwards-compatibility for nit-pick gains in 64-bit code, from the very start, at the get-go.

64-bit is something I will have to do entirely on my own later to do when I would be ready to free myself from the resources of the plugin specifications.

Quote:
Originally Posted by MarathonMan View Post
Copying a 64-bit quantity is child's play. Do you have an specific examples that actually involve some computation on 64-bit quantities?
All of that is in SSE. Why would I use MMX for anything heavily computational?
MMX is only for the very basic problems, like what I just provided you.

And once again, you're changing my questions around so you don't have to answer them
So I'll suppose the correct answer is Yes, in the incidental presence of a pre-existing EMMS operation, it is faster on x86_32 to use a MOVQ in conjunction with a MM register, than it is to say 2 32-bit mov's.

Quote:
Originally Posted by MarathonMan View Post
Regardless, I see that you're unwilling to drop MMX so I'll just leave it at this post. I think you're foolish for using MMX over SSE2 for the several reasons that I have stressed. We clearly have different opinions in the matter, so I see no reason to continue to debate over it when I seem as though I am unable to convince you. I have stated my point.
There is a simple reason why you are unable to convince me.
It's because you can't exactly "convince" anybody, something of which they already know.

I've been vectorizing things to arrays for SSE and SSE2 in this plugin. I want those things to use SSE and not MMX. I only plan to mix in certain MMX code for simple problems, like that 64-bit copying example I gave you, but you continue to misperceive me as someone who is dropping SSE and wants to use only MMX and not anything else if feasible.

All I had been asking, was if for the more simple problems MMX could perform on equal terms as with SSE, but this whole time you continue to put that off and pessimistically assume that I'm trying to use MMX "over" SSE. A lawyer is not always honest. I simply like to get advice on possibilities where both of them can be tag-teamed. (Edit, which again is never in 64-bit code imo.) If this is all just because you're bent on the extra teamwork of me handing in CEN64-friendly code then I think you're unable to focus honestly on my questions, so I guess I no longer really see much point in getting you to answer them.

Last edited by HatCat; 23rd April 2014 at 04:59 PM.
Reply With Quote
  #255  
Old 23rd April 2014, 10:54 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Now, getting back on track, I'm about ready to look into posting up another source tree and build soon. Not much about the VI has changed, but I've removed a huge amount of abstract and/or slow RDP code distractions.

It took me a while cause I had to re-plot out in my head how to re-create the _mm_mullo_epi32 from SSE4 in my head, and started writing it over in terms of scratch from SSE2 again. I came up with this blindly and voila, worked on the first try!

Code:
    span[j].rgba[0]
      = ((rgba[0] & ~0x1ff) + d_rgba_diff[0] - xfrac*d_rgba_dxh[0]) & ~0x3FF;
    span[j].rgba[1]
      = ((rgba[1] & ~0x1ff) + d_rgba_diff[1] - xfrac*d_rgba_dxh[1]) & ~0x3FF;
    span[j].rgba[2]
      = ((rgba[2] & ~0x1ff) + d_rgba_diff[2] - xfrac*d_rgba_dxh[2]) & ~0x3FF;
    span[j].rgba[3]
      = ((rgba[3] & ~0x1ff) + d_rgba_diff[3] - xfrac*d_rgba_dxh[3]) & ~0x3FF;
... is now ...

Code:
        xmm_frac = _mm_set1_epi32(xfrac);

        delta_x_high = _mm_load_si128((__m128i *)d_rgba_dxh);
        prod_lo = _mm_mul_epu32(delta_x_high, xmm_frac);
        delta_x_high = _mm_srli_epi64(delta_x_high, 32);
        prod_hi = _mm_mul_epu32(delta_x_high, xmm_frac);
        prod_lo = _mm_shuffle_epi32(prod_lo, _MM_SHUFFLE(3, 1, 2, 0));
        prod_hi = _mm_shuffle_epi32(prod_hi, _MM_SHUFFLE(3, 1, 2, 0));
        delta_x_high = _mm_unpacklo_epi32(prod_lo, prod_hi);

        delta_diff = _mm_load_si128((__m128i *)d_rgba_diff);
        result = _mm_load_si128((__m128i *)rgba);
        result = _mm_srli_epi32(result, 9);
        result = _mm_slli_epi32(result, 9);
        result = _mm_add_epi32(result, delta_diff);
        result = _mm_sub_epi32(result, delta_x_high);
        result = _mm_srli_epi32(result, 10);
        result = _mm_slli_epi32(result, 10);
        _mm_store_si128((__m128i *)span[j].rgba, result);
I might have to delay fixing the Mario no Photopie / GoldenEye VI issue for a later build, but the screenshots problem will be fixed.
Reply With Quote
  #256  
Old 23rd April 2014, 11:21 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Ah, I get it now. The reason the SSE2 wrapper for _mm_mullo_epi32 is like, 50% of the size of the complete one I installed in my previous branch of the plugin, is cause this isn't a complete mullo_epi32 ... it's a special kind that correctly assumes that the second multiplier is a constant set from _mm_set1_epi32.
Reply With Quote
  #257  
Old 23rd April 2014, 03:48 PM
ReyVGM ReyVGM is offline
Project Supporter
Senior Member
 
Join Date: Mar 2014
Posts: 212
Default

Quote:
Originally Posted by HatCat View Post
I might have to delay fixing the Mario no Photopie / GoldenEye VI issue for a later build, but the screenshots problem will be fixed.
No no, take your time
Reply With Quote
  #258  
Old 23rd April 2014, 04:41 PM
oddMLan's Avatar
oddMLan oddMLan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2009
Location: Parappa Town
Posts: 210
Default

Quote:
Originally Posted by ReyVGM View Post
No no, taco your time
fix' d
Reply With Quote
  #259  
Old 23rd April 2014, 04:46 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Actually thanks for the input as I probably could have ended up content with either way. I'm pretty indecisive about things; I just figured I should cleanup the remaining C++-ish-ness of the code so RPGMaster could have a keener look at it.

Well anyway, regardless of that, I noticed a pattern with angrylion's sign-extension macros. (RPGMaster I'm not sure but I can't seem to recall ever talking about this specific subject.) He used to have tons of macros for sign-extending 1-bit, 2-bit, ... n-bit integer sizes, until I helped him create this static universal macro for it which angrylion did adopt in the master branch:
Code:
#define SIGN(x, numb) (x) & ((1 << numb) - 1)) | -((x) & (1 << (numb - 1))
Lately, I've found 2 even cleaner (and definitely smaller in code size) alternatives to this as well, which I made extra macros for.

Code:
#if (~0 >> 1 < 0)
#define SRA(exp, sa)    ((signed)(exp) >> (sa))
#else
#define SRA(exp, sa)    (SE((exp) >> (sa), (sa) ^ 31))
#endif

/*
 * Virtual register sign-extensions and clamps using b-bit immediates:
 */
#define CLAMP(i, b)     ((i) & ~(~0x00000000 << (b)))
#define SB(i, b)        ((i) &  ( 0x00000001 << ((b) - 1)))
#define SE(i, b)        ((i) | -SB((i), (b)))

/*
 * Forces to clamp immediate bit width AND sign-extend virtual register:
 */
#if (0)
#define SIGN(i, b)      SE(CLAMP((i), (b)), (b))
#else
#define SIGN(i, b)      SRA((i) << (32 - (b)), (32 - (b)))
#endif
I prefer the latter one because a) much smaller code and b) easier to auto-vectorize to SIMD code by the compiler. Using a bunch of AND and OR bitmasks from a logical point of view makes it somewhat harder since you can't use immediate operands in those intrinsics currently. Rather than allocate that memory I just use shifts to zero-extend things.
Reply With Quote
  #260  
Old 23rd April 2014, 05:32 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,029
Default

Thanks HatCat, I appreciate your work. It's good to simplify macros, so good job. The more complex a macro, the more likely the compiler will screw up, expecially if you use the macro a lot. Bit shifting is pro when it comes to sign stuff.

When you update the plugin, I'll be sure to examine and profile again . Hopefully this time, I'll be able to accomplish more.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 04:21 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.