#211  
Old 19th June 2013, 01:31 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
That really gets to me. Makes me not want to depend on SSSE3.
What I eventually want to do with the RSP plugin is to write concise inline functions that map well to SSE, and then write efficient non-SSE replacements:

Like [https://github.com/tj90241/cen64-rsp...ter/CP2.c#L74].

It's the prologue (whoops, I said prologue in my last post when I meant epilogue) to a lot of those blasted RSP instructions where the element value determines which bytes of one of the source vector get broadcasted into the actual operation. The "pure C" implementation is just to (for i = 0; i < n; i++) ... = ...;

That way, you get the best of both worlds, and it's amenable to future vectorized architectures -- if you don't have SSE, or don't want to include it, the code simply gets compiled out.

I don't like the idea of lock-in, but I can't look away from the performance when I need every little last bleeding cycle for simulation.
Reply With Quote
  #212  
Old 19th June 2013, 01:36 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by MarathonMan View Post
What I eventually want to do with the RSP plugin is to write concise inline functions that map well to SSE, and then write efficient non-SSE replacements:

Like [https://github.com/tj90241/cen64-rsp...ter/CP2.c#L74].

It's the prologue (whoops, I said prologue in my last post when I meant epilogue) to a lot of those blasted RSP instructions where the element value determines which bytes of one of the source vector get broadcasted into the actual operation. The "pure C" implementation is just to (for i = 0; i < n; i++) ... = ...;

That way, you get the best of both worlds, and it's amenable to future vectorized architectures -- if you don't have SSE, or don't want to include it, the code simply gets compiled out.

I don't like the idea of lock-in, but I can't look away from the performance when I need every little last bleeding cycle for simulation.
Believe it or not!!

When I use GCC, and only GCC (not MSVS),
to compile my RSP plugin with

#define PARALLELIZE_VECTOR_TRANSFERS
#define EMULATE_VECTOR_RESULT_BUFFER // zilmar does this, me by default I have it off

^ Both of those defined,
it generates SSSE3 code for me on VAND, VOR, VXOR, VNAND, VNOR, VNXOR.

And I think some of the other opcodes like the adds (VADD components), but not too many others.

So your idea was already implemented by me.

Obviously though I cannot run SSSE3 on my machine, so I didn't release that version of my RSP dll into my thread as I can't test it...somebody else can compile it themselves in the meantime.
Reply With Quote
  #213  
Old 19th June 2013, 01:40 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Hmmm interesting... I'll have to look at the assembly it generates and see how it compares.
Reply With Quote
  #214  
Old 19th June 2013, 01:47 PM
angrylion angrylion is offline
Member
 
Join Date: Oct 2008
Location: Moscow, Russia
Posts: 36
Default

Quote:
Originally Posted by FatCat View Post
It's easy for compiler theory to understand that (var & ~03) >> 6 is redundant.
If however you also consider that more readable than just saying (var >> 6) then I think you have a very strong opinion of readability vs. compiler theory.
It was my mistake, what I meant was ((var >> 6) & ~3). ((var >> 8) << 2) is not equivalent to (var >> 6), like you sort of suggested in post #193, it's equivalent to ((var >> 6) & ~3).

Quote:
Originally Posted by FatCat View Post
It was a valid example of how to remove a branch, but it was not a good example of removing a branch in a situation where you should.
So mid-way, I cancelled the multiply code and commented it out, instead showing that it could be more pre-optimized even in the case that you must branch.
You wrote "Okay so the last example kind of sucked, but, it's still an improvement". I demonstrated that on MSVC2002 it's not necessarily an improvement. Obviously, the assembly listing in my previous post referred to your variant that's without multiplies. I compared my if/else code with your example with just one if. So on MSVC2002 your example is not necessarily "more pre-optimized", unless this comparison of yours referred to your own code with multiplies.
Reply With Quote
  #215  
Old 19th June 2013, 02:13 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by angrylion View Post
It was my mistake, what I meant was ((var >> 6) & ~3). ((var >> 8) << 2) is not equivalent to (var >> 6), like you sort of suggested in post #193, it's equivalent to ((var >> 6) & ~3).
In that case, if you really meant ((var >> 6) & ~3), I think it's already optimized.

The shift-only equivalent of that, in that case, would be ((var >> 8) << 2) == ((var >> 6) & ~3) .
And AND-masking is probably faster than shifting right.

Then again...we have a special case here
Shifting right by 8 bits means you could do a byte-indexed fetch using movzbl or MOV AX, DH like I said earlier.
After all, like I said, you have to do a MOV regardless.
it mov's into a register first, and then shifts right by 8
it would be more direct to just mov eax, dh, than to mov eax, edx;shr eax, 8; , but like you said manual says DO NOT DO
so tl;dw , not that important

To be consistent with the rest of your code, which does not shift right by 6 or 8, I would probably keep the AND-mask method like you did, than use one potential micro-optimization in a special case to >> 8, << 2.

So that would be an example where I might deliberately sacrifice code efficiency/performance micro-boosts for the consistency of the code.

Quote:
Originally Posted by angrylion View Post
You wrote "Okay so the last example kind of sucked, but, it's still an improvement". I demonstrated that on MSVC2002 it's not necessarily an improvement. Obviously, the assembly listing in my previous post referred to your variant that's without multiplies. I compared my if/else code with your example with just one if. So on MSVC2002 your example is not necessarily "more pre-optimized", unless this comparison of yours referred to your own code with multiplies.
True. The thing I don't understand though is why it would be even worse, compiled under MSVC 2002, than your original code.

Why should it insert 8 additional moves?

It makes 8 moves unconditional.
Right now you have it set to,

if (!flip) dest = +src; dest2 = +src2; dest3 = +src3;
else dest = -src; dest2 = -src2; dest3 = -src3; etc. ..

I just made some of the code unconditional...

dest = src; dest2 = src2; dest3 = src3; ... etc.
if (flip) dest *= -1; dest2 *= -1; dest3 *= -1; .. etc.

By removing code outside of the if-else branch frame and leaving only the unique case-specific parts behind, I would think that my code should be optimized either the same, or better, but not worse?

But, oh well.
I'm not going to spend an entire post arguing about it.

The key thing that I wanted to prove,
is that if it wasn't for all those color INC key variables you had:

Code:
void render_spans_1cycle_complete(int start, int end, int tilenum, int flip)
{
...
 // int drinc, dginc, dbinc, dainc, dzinc, dsinc, dtinc, dwinc;
    int xinc;

    if (flip)
    {
/*
        drinc = spans_dr;
        dginc = spans_dg;
        dbinc = spans_db;
        dainc = spans_da;
        dzinc = spans_dz;
        dsinc = spans_ds;
        dtinc = spans_dt;
        dwinc = spans_dw;
*/
        xinc = 1;
    }
    else
    {
/*
        drinc = -spans_dr;
        dginc = -spans_dg;
        dbinc = -spans_db;
        dainc = -spans_da;
        dzinc = -spans_dz;
        dsinc = -spans_ds;
        dtinc = -spans_dt;
        dwinc = -spans_dw;
*/
        xinc = -1;
    }
....
... That the unquestionable branch removal optimization should be:

Code:
void render_spans_1cycle_complete(int start, int end, int tilenum, int flip)
{
...
    const int xinc = flip ^ (flip - 1);
...
But yes, with all that other data in there it makes you want to keep the if-else, so I NVM'd.

There still must be hundreds of these examples.
I had no idea MarathonMan and yourself were ready to pick back about this stuff. Otherwise I would have come with better examples, which I will do soon.

Last edited by HatCat; 19th June 2013 at 02:16 PM.
Reply With Quote
  #216  
Old 19th June 2013, 02:35 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
Why should it insert 8 additional moves?

It makes 8 moves unconditional.
Right now you have it set to,

if (!flip) dest = +src; dest2 = +src2; dest3 = +src3;
else dest = -src; dest2 = -src2; dest3 = -src3; etc. ..

I just made some of the code unconditional...

dest = src; dest2 = src2; dest3 = src3; ... etc.
if (flip) dest *= -1; dest2 *= -1; dest3 *= -1; .. etc.

By removing code outside of the if-else branch frame and leaving only the unique case-specific parts behind, I would think that my code should be optimized either the same, or better, but not worse?

But, oh well.
If you think about how it breaks down to assembly, his is going to win every time... because he semantically advised the compiler about what he intended to do.

(As if I haven't forced that down your throat enough)

Really, though, if you look at what the compiler is probably generating in this case:

Code:
if (something_zero_or_not_zero)

Can just break down into:

testl %eax, %eax
je eax_is_zero:

eax_is_nonzero:
 add; add; add;
 jmp prologue

eax_is_zero:
  sub; sub; sub;

prologue:
OTOH, the your branch will result in executing either the same number of instructions, or (if it branches), twice the instructions. And taken branch uses inferior IMUL (TBH, I don't know why negate isn't applicable here?)

My point is, if the compiler knew that ADDs and IMULs were faster than the conditional branch to ADD/SUB blocks, or it could avoid a branch in some manner, it would have emitted that code.

Trust the compiler. Let the compiler flow through you: http://blog.heartland.org/wp-content...d-300x225.jpeg
Reply With Quote
  #217  
Old 19th June 2013, 02:45 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Actually this is not the best example either.

But I am curious and do not have a profiler I know how to use.

Starting Line 792 at n64video.cpp of r75:
Code:
if (tile[num].mask_t)
{
    if (tile[num].mt)
    {
        wrap = *T >> tile[num].f.masktclamped;
        wrap &= 1;
        *T ^= (-wrap);
    }

    *T &= maskbits_table[tile[num].mask_t];
}
This branch elimination might, or might not, be better than the original code.
I'm somewhat asking, but I think it probably is.

Code:
if (tile[num].mask_t)
{
    wrap = *T >> (tile[num].f.masktclamped & -tile[num].mt);
    *T ^= -(wrap &= 1);
    *T &= maskbits_table[tile[num].mask_t];
}
I think it could be faster but I'm definitely wrong if either:
  • It is extremely common in RDP cycles that `tile[num].mt` is FALSE and would not have taken the branch in the original code at least a third of the time.
  • tile[num].mt is not either a 0 or a 1 (It's hard to tell. Right now angrylion has it declared as type `int` in the TILE struct.)--it could be 2 or greater. If so, my code must become even slower: it must be "... & -(tile[num].mt != 0)" and not "... & -tile[num].mt).
So it depends.
Again not a good example compared to other stuff I've seen.


But I want to get this question off my head first.

Last edited by HatCat; 19th June 2013 at 02:49 PM.
Reply With Quote
  #218  
Old 19th June 2013, 02:54 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by MarathonMan View Post
If you think about how it breaks down to assembly, his is going to win every time... because he semantically advised the compiler about what he intended to do.

(As if I haven't forced that down your throat enough)

Really, though, if you look at what the compiler is probably generating in this case:

Code:
if (something_zero_or_not_zero)

Can just break down into:

testl %eax, %eax
je eax_is_zero:

eax_is_nonzero:
 add; add; add;
 jmp prologue

eax_is_zero:
  sub; sub; sub;

prologue:
OTOH, the your branch will result in executing either the same number of instructions, or (if it branches), twice the instructions. And taken branch uses inferior IMUL (TBH, I don't know why negate isn't applicable here?)

My point is, if the compiler knew that ADDs and IMULs were faster than the conditional branch to ADD/SUB blocks, or it could avoid a branch in some manner, it would have emitted that code.

Trust the compiler. Let the compiler flow through you: http://blog.heartland.org/wp-content...d-300x225.jpeg
I was using neither.

In case you weren't following I was already using NEGate.

There is no extra code with my method. It just assumes copying the data over.
Additionally, if (flip) is set, use the NEG op on all 8 of them.

What the hell.
You must not be following me. This can't possibly be slower than his way of writing it--maybe the same.
Reply With Quote
  #219  
Old 19th June 2013, 02:55 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
What the hell.
You must not be following me.
Probably not. For some reason, I have the hardest time digesting your posts.
Reply With Quote
  #220  
Old 19th June 2013, 03:13 PM
shunyuan's Avatar
shunyuan shunyuan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Apr 2013
Posts: 491
Default

Quote:
Originally Posted by FatCat View Post

Code:
if (tile[num].mask_t)
{
    if (tile[num].mt)
    {
        wrap = *T >> tile[num].f.masktclamped;
        wrap &= 1;
        *T ^= (-wrap);
    }

    *T &= maskbits_table[tile[num].mask_t];
}
Code:
if (tile[num].mask_t)
{
    wrap = *T >> (tile[num].f.masktclamped & -tile[num].mt);
    *T ^= -(wrap &= 1);
    *T &= maskbits_table[tile[num].mask_t];
}
Your optimized version is hard to understand and debug in comparison with angrylion's original version.
__________________
---------------------
CPU: Intel U7300 1.3 GHz
GPU: Mobile Intel 4 Series (on board)
AUDIO: Realtek HD Audio (on board)
RAM: 4 GB
OS: Windows 7 - 32 bit
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 02:40 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.