#191  
Old 18th June 2013, 07:12 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

On the topic of more optimizing, as far as techniques I have not yet talked about.

I fear I should have discussed this sooner, as this appears to be an extremely often occurring problem with this RDP:
static condition codes

There must be thousands of unneeded branches that I've come across already in plugins.

Here is an example of how to eliminate one:
Code:
			color->a = (c & 1) ? 0xFF : 0x00;
If we defined `color -> a` to be of signed char (or INT8) data type, we can instead do this:

Code:
			color->a = -(c & 1); /* -TRUE is 0xFF; -FALSE is 0x00 */
This is one example of probably hundreds!

I swear, give me an if-else thing in this source code, and I can probably find a branch-free static work around.

Here, another example from the render_spans functions:
Code:
void render_spans_1cycle_complete(int start, int end, int tilenum, int flip)
{
...
	int drinc, dginc, dbinc, dainc, dzinc, dsinc, dtinc, dwinc;
	int xinc;
	if (flip)
	{
		drinc = spans_dr;
		dginc = spans_dg;
		dbinc = spans_db;
		dainc = spans_da;
		dzinc = spans_dz;
		dsinc = spans_ds;
		dtinc = spans_dt;
		dwinc = spans_dw;
		xinc = 1;
	}
	else
	{
		drinc = -spans_dr;
		dginc = -spans_dg;
		dbinc = -spans_db;
		dainc = -spans_da;
		dzinc = -spans_dz;
		dsinc = -spans_ds;
		dtinc = -spans_dt;
		dwinc = -spans_dw;
		xinc = -1;
	}
...
You could eliminate having to weigh branches, just by removing them:

Code:
void render_spans_1cycle_complete(int start, int end, int tilenum, int flip)
{
...
    int drinc, dginc, dbinc, dainc, dzinc, dsinc, dtinc, dwinc;
    int xinc;

 // flip = (flip != 0); // x86 SETNE, force bool type to 0 or 1 if needed
    xinc = flip ^ (flip - 1);
/* At this point, we have forced xinc to either +1 or -1.
 * xinc is now the sign coefficient of whether to negate each var.
 */

    drinc = spans_dr;
    dginc = spans_dg;
    dbinc = spans_db;
    dainc = spans_da;
    dzinc = spans_dz;
    dsinc = spans_ds;
    dtinc = spans_dt;
    dwinc = spans_dw;
/*
 *  drinc *= xinc;
 *  dginc *= xinc;
 *  dbinc *= xinc;
 *  ... etc., using xinc == -1 || +1 as a multiplier.
 *  This is branch-free but maybe slower, as NEG ops are way faster than IMUL.
 *  So in this case we'll keep the branch laid out preferably in a self-mutable manner:
 */
    if (!flip)
        goto BRANCH_NEXT;
    else
    {
        drinc = -drinc;
        dginc = -dginc;
        dbinc = -dbinc;
        dainc = -dainc;
        dzinc = -dzinc;
        dsinc = -dsinc;
        dtinc = -dtinc;
        dwinc = -dwinc;
// Remember that x86 NEG opcode takes only one operand.
// Try saying only var1 = -var1, not var1 = -var2.
    }
BRANCH_NEXT:
...
Okay so the last example kind of sucked, but, it's still an improvement.
Too many examples to post here. If I succeeded at teaching this thing I'll have a shitload of revisions to merge in.

Last edited by HatCat; 18th June 2013 at 07:17 PM.
Reply With Quote
  #192  
Old 18th June 2013, 08:49 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Code:

c0 = tc16[taddr0 & 0x3ff];
c0 = tlut[((c0 >> 8) << 2) ^ WORD_ADDR_XOR];
c1 = tc16[taddr1 & 0x3ff];
c1 = tlut[((c1 >> 8) << 2) ^ WORD_ADDR_XOR];
c2 = tc16[taddr2 & 0x3ff];
c2 = tlut[((c2 >> 8) << 2) ^ WORD_ADDR_XOR];
c3 = tc16[taddr3 & 0x3ff];
c3 = tlut[((c3 >> 8) << 2) ^ WORD_ADDR_XOR];
c0 = tc16[taddr0 & 0x3ff];
c0 = tlut[((c0 & ~3) >> 6) ^ WORD_ADDR_XOR];
c1 = tc16[taddr1 & 0x3ff];
c1 = tlut[((c1 & ~3) >> 6) ^ WORD_ADDR_XOR];
c2 = tc16[taddr2 & 0x3ff];
c2 = tlut[((c2 & ~3) >> 6) ^ WORD_ADDR_XOR];
c3 = tc16[taddr3 & 0x3ff];
c3 = tlut[((c3 & ~3) >> 6) ^ WORD_ADDR_XOR];
Why on earth would anybody want to do either of these techniques.

shr ax, 8
shl ax, 2
# is redundant

and ax, -4; # ~3 = 0b1111111111111 [...] 00
shr ax, 6
# is redundant

It should be just plain (cx >> 6) ^ WORD_ADDR_XOR.
I won't bother changing it myself though right now because I expect GCC is intelligent enough to figure that one out for us.
If I optimize too much of the code prematurely I lose sync with RDP updates.
Reply With Quote
  #193  
Old 18th June 2013, 09:07 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

another one of r67's before-after compares:

Code:
    color0->r = color0->g = color0->b = (c0 >> 8) & 0xff; 
    color0->r = color0->g = color0->b = c0 >> 8;
Why is it not agreeable to just say `(unsigned char)(c0 >> 8)` ?
This helps the compiler intrinsics by understanding that we're trying to read the upper 8-bit byte of a 16-bit word on the Intel architecture.
As such, (unsigned char)(var >> 8) is more likely to get compiled to something like `mov ecx, ah` or whatever, than either of the above two colored methods are to produce such output.

Saying "& 0xFF" is never optimized because (unsigned char) is the exact same on every computer system supporting the C language: 8 bits wide.
It's that way because it's convenient. It's more concrete and feeds intrinsics better to indicate the byte data type explicitly, than to say 0xFF all the time.

Even if we're already certain that the value contained in a 32-bit register is never >= 0x0100, you can never go wrong with forcing an 8-bit data transfer if we need exactly only 8 bits, so there is no loss of efficiency by including the type expression in this case.

Alright, I hope I'm finished ranting now. Need to concentrate.
Reply With Quote
  #194  
Old 18th June 2013, 10:58 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
another one of r67's before-after compares:

Code:
    color0->r = color0->g = color0->b = (c0 >> 8) & 0xff; 
    color0->r = color0->g = color0->b = c0 >> 8;
Why is it not agreeable to just say `(unsigned char)(c0 >> 8)` ?
This helps the compiler intrinsics by understanding that we're trying to read the upper 8-bit byte of a 16-bit word on the Intel architecture.
As such, (unsigned char)(var >> 8) is more likely to get compiled to something like `mov ecx, ah` or whatever, than either of the above two colored methods are to produce such output.

Saying "& 0xFF" is never optimized because (unsigned char) is the exact same on every computer system supporting the C language: 8 bits wide.
It's that way because it's convenient. It's more concrete and feeds intrinsics better to indicate the byte data type explicitly, than to say 0xFF all the time.

Even if we're already certain that the value contained in a 32-bit register is never >= 0x0100, you can never go wrong with forcing an 8-bit data transfer if we need exactly only 8 bits, so there is no loss of efficiency by including the type expression in this case.

Alright, I hope I'm finished ranting now. Need to concentrate.
Erm... a (good) optimizing compiler would generate the same code either way.

Code:
0000000000000000 <func>:
   0:	c1 ff 08             	sar    $0x8,%edi
   3:	89 f8                	mov    %edi,%eax
   5:	c3                   	retq   
   6:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
   d:	00 00 00 

0000000000000010 <func2>:
  10:	c1 ff 08             	sar    $0x8,%edi
  13:	89 f8                	mov    %edi,%eax
  15:	c3                   	retq
  16:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  1d:	00 00 00
Same goes for many of your other examples:

Code:
color->a = (c & 1) ? 0xFF : 0x00;
color->a = -(c & 1); /* -TRUE is 0xFF; -FALSE is 0x00 */
gcc-4.8 generates:

Code:
0000000000000020 <func3>:
  20:	83 e7 01             	and    $0x1,%edi
  23:	89 f8                	mov    %edi,%eax
  25:	f7 d8                	neg    %eax
  27:	c3                   	retq   
  28:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
  2f:	00 

0000000000000030 <func4>:
  30:	83 e7 01             	and    $0x1,%edi
  33:	89 f8                	mov    %edi,%eax
  35:	f7 d8                	neg    %eax
  37:	c3                   	retq
Maybe Micro$haft's products can't figure it out...

Either way, I much prefer angrylion's code as it's much clearer to read and is no slower!

OTOH, many of your examples which eliminate branches could be converted into SSE without any effort...

Last edited by MarathonMan; 18th June 2013 at 11:09 PM.
Reply With Quote
  #195  
Old 18th June 2013, 11:52 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by MarathonMan View Post
Erm... a (good) optimizing compiler would generate the same code either way.
You're not reading me correctly.
He's not forcing an 8-bit data transfer.

Because of this the code output is not optimized, much like those two code examples you just pasted:

Code:
0000000000000000 <func>:
   0:	c1 ff 08             	sar    $0x8,%edi
   3:	89 f8                	mov    %edi,%eax
   5:	c3                   	retq   
   6:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
   d:	00 00 00 

0000000000000010 <func2>:
  10:	c1 ff 08             	sar    $0x8,%edi
  13:	89 f8                	mov    %edi,%eax
  15:	c3                   	retq
  16:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  1d:	00 00 00
Neither of those are as direct as what I said.

The direct solution is MOV EAX, DH, not,
SAR EDX, 8; and then MOV EAX, EDX (or edi in your case, which does not even compare).

Why do a shift and then a move, when you could have just done one move?

If you FORCE an 8-bit transfer with the (unsigned char) data type, which is the 8 bits he is aiming to extract, then the compiler will know that he is requesting the upper 8 bits of a 16-bit sub-register word of a 32-bit resource.

Your method just shifts to the right and is potentially hazardous to the possibility that the upper 16 bits of the source 32 segment could have been filled, though the programmer's aim has endeavored to prevent this from happening so that an AND-mask is not needed.
My method is both less maintenance-hazardous and more direct.

Quote:
Originally Posted by MarathonMan View Post
Same goes for many of your other examples:

Code:
color->a = (c & 1) ? 0xFF : 0x00;
color->a = -(c & 1); /* -TRUE is 0xFF; -FALSE is 0x00 */
gcc-4.8 generates:

Code:
0000000000000020 <func3>:
  20:	83 e7 01             	and    $0x1,%edi
  23:	89 f8                	mov    %edi,%eax
  25:	f7 d8                	neg    %eax
  27:	c3                   	retq   
  28:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
  2f:	00 

0000000000000030 <func4>:
  30:	83 e7 01             	and    $0x1,%edi
  33:	89 f8                	mov    %edi,%eax
  35:	f7 d8                	neg    %eax
  37:	c3                   	retq
That's cool that GCC figures it out, but not everybody has the latest version of GCC.
As of yet I can only use MinGW GCC.

And it's no excuse to not explicitly write the correct optimized algorithm for the compiler.
I find it no less readable to say = (unsigned char)(-(var & 1)), over saying the less optimized = (var & 1) ? 0xFF : 0x00.

Unless SSE isn't the only thing you want to cause a lock-in (i.e. writing deliberately inefficient C code only for an intended selection of compilers to figure it out).

Quote:
Originally Posted by MarathonMan View Post
Either way, I much prefer angrylion's code as it's much clearer to read and is no slower!

OTOH, many of your examples which eliminate branches could be converted into SSE without any effort...
Well I only posted one-and-a-half examples of how to eliminate branches (only 1 technically, but 2 if you count my commented-out idea of calculating the -1 / +1 sign multiplier and statically multiplying the variables instead of using an if-else branch frame), so I think this suggests maybe you weren't really paying attention.

I'm not sure why it would be easier for another human to read as I am not another human (I am only me.), but I do know that explicitly coding for efficiency is more readable to the compiler, encourages non-SSE intrinsics, and helps produce the correct code without assuming the compiler will figure out the correct code.
In practice, when possible, what is more readable to the compiler should be more readable to you, not just what one man finds readable.

Last edited by HatCat; 18th June 2013 at 11:59 PM.
Reply With Quote
  #196  
Old 19th June 2013, 01:02 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
He's not forcing an 8-bit data transfer.

...

Neither of those are as direct as what I said.

The direct solution is MOV EAX, DH, not,
SAR EDX, 8; and then MOV EAX, EDX (or edi in your case, which does not even compare).

Why do a shift and then a move, when you could have just done one move?

The MOV EAX, DH is the same as SAR EDX, 8 in this case. The compiler knows the type returned by the function is an 8-bit type and will either zero (or sign) extend it at a later time if it's upcasted back to a 32-bit type at a later time. Otherwise, it'll only use the lower 8 bits.

What I was trying to get at was the compiler is smart enough to realize that (unsigned char) (word >> 8) and (unsigned char) ((word >> 8) & 0xFF) are the same operation; the latter just serves as syntactic sugar to the reader and makes them aware that they are only concerned with the lower byte, even if the types change.

Quote:
Originally Posted by FatCat View Post
Your method just shifts to the right and is potentially hazardous to the possibility that the upper 16 bits of the source 32 segment could have been filled, though the programmer's aim has endeavored to prevent this from happening so that an AND-mask is not needed.

My method is both less maintenance-hazardous and more direct.
If i used unsigned int instead of int (like the source probably does), then it would have emitted a srl and it would have been the same thing anyways.

Right shifts are not guaranteed to be arithmetic or logical by the C standard; bitwise and-ing by 0xFF will guarantee, regardless of the type and shift used, that only the low 8-bits of the result is used. And results in no extra operations on a good compiler.

IMHO, at a bare minimum, you should cast the result to a UINT8 if you want to remove the & 0xFF (as you did in your post, but not the code block).

Quote:
Originally Posted by FatCat View Post
And it's no excuse to not explicitly write the correct optimized algorithm for the compiler.
I find it no less readable to say = (unsigned char)(-(var & 1)), over saying the less optimized = (var & 1) ? 0xFF : 0x00.

Unless SSE isn't the only thing you want to cause a lock-in (i.e. writing deliberately inefficient C code only for an intended selection of compilers to figure it out).
If you do the former, you're going to completely drive away most developers from looking at your code. It's incredibly difficult to process the mask over a ternary operation in this case.

Quote:
Originally Posted by FatCat View Post
Well I only posted one-and-a-half examples of how to eliminate branches (only 1 technically, but 2 if you count my commented-out idea of calculating the -1 / +1 sign multiplier and statically multiplying the variables instead of using an if-else branch frame), so I think this suggests maybe you weren't really paying attention.
No, I read everything. I just thought they were really poor examples seeing as most compilers are smart enough to figure these things out and it's incredibly difficult to read your changes.

Quote:
Originally Posted by FatCat View Post
I'm not sure why it would be easier for another human to read as I am not another human (I am only me.), but I do know that explicitly coding for efficiency is more readable to the compiler, encourages non-SSE intrinsics, and helps produce the correct code without assuming the compiler will figure out the correct code.

In practice, when possible, what is more readable to the compiler should be more readable to you, not just what one man finds readable.
That is a 90s mentality. Embrace modern compilers.

Look at the ternary operator case: the compiler clearly knows that it should produce 0xFF if the least significant bit is set to 1. Otherwise it should produce 0x00. If somebody releases a future instruction (like a extendBitToByte -- completely theoretical), the compiler's instruction scheduler will clearly identify that it's appropriate to use that instruction. In the meantime, it'll be smart enough to figure out what you've ultimately written.

OTOH, it might not be able to go backwards (from your more ambiguous code to the one displaying clear semantics). Semantic analysis is really, really hard -- think about how incredibly difficult it is to "disassemble" ASM back into C without losing any of the semantics. When you negate the result of an and, the compiler might not be able to piece together what you originally intended to do.
Reply With Quote
  #197  
Old 19th June 2013, 01:12 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Also, you don't need a recent version of gcc to get these kinds of optimizations:

Code:
Disassembly of section .text:

0000000000000000 <foo>:
   0:	83 e7 01             	and    $0x1,%edi
   3:	89 f8                	mov    %edi,%eax
   5:	f7 d8                	neg    %eax
   7:	c3                   	retq   
xxxxxx@alpha:~  
$ gcc -v
Using built-in specs.
Target: x86_64-linux-gnu
...
gcc version 4.4.6 (Debian 4.4.6-11)
Couldn't find anything older, but I would be surprised if it didn't generate the same thing as well.
Reply With Quote
  #198  
Old 19th June 2013, 02:15 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by MarathonMan View Post
The MOV EAX, DH is the same as SAR EDX, 8 in this case. The compiler knows the type returned by the function is an 8-bit type and will either zero (or sign) extend it at a later time if it's upcasted back to a 32-bit type at a later time. Otherwise, it'll only use the lower 8 bits.
Not quite. COLOR is defined here:

Code:
typedef struct
{
	INT32 r, g, b, a;
} COLOR;
So there is no indication at all of an 8-bit data type anywhere with your shift method.
Your hazard of shifting right by 8 bits isn't altered.
You also chose for some reason to say "SAR" instead of "SHR" (simple zero-extended shift), perhaps to purposely show off how maintenance-hazard-prone the algorithm can be while attempting to match the directness of MOV EAX, DH, which at any rate is still faster then doing an extra shift instruction that already requires a MOV anyway.

I wasn't questioning that your way of doing it doesn't reach the same result (provided, in your case, that the programmer makes sure 32-bit params don't get passed to the procedure call with trash upper bits), just that it wasn't as optimized.

Quote:
Originally Posted by MarathonMan View Post
What I was trying to get at was the compiler is smart enough to realize that (unsigned char) (word >> 8) and (unsigned char) ((word >> 8) & 0xFF) are the same operation; the latter just serves as syntactic sugar to the reader and makes them aware that they are only concerned with the lower byte, even if the types change.
I was not complaining when he said
Code:
(word >> 8) & 0xFF
I was more criticizing that it got changed to:
Code:
word >> 8
Which was neither as fast as the former, nor as secure.
(word >> 8) & 0xFF was better, but I would have said (unsigned char)(word >> 8).

And I'm not sure on this, but maybe you didn't mean to say "(unsigned char) ((word >> 8) & 0xFF)" because that's not what he wrote originally, so that couldn't be what I was talking about. Also, it's redundant. Picking either 0xFF or (unsigned char) is fine, but no need for both.

Why would you criticize readability to someone else yet post redundant code expressions as "being the same thing"? Not sure which point you're trying to make.

Quote:
Originally Posted by MarathonMan View Post
If i used unsigned int instead of int (like the source probably does), then it would have emitted a srl and it would have been the same thing anyways.
You mean SHL, as srl is MIPS

SHL EAX, 8 and MOV EAX, DH are not the same thing.
Like I said, you can arrange it so that they accomplish the same effect, but the operations are still so different enough that one is less hazard-prone than the other and, faster as well granted that the SHL method already requires an extra MOV operation to be done first anyway.

This was never a bug report, just optimization critics.
But, I think you knew that. :/

Quote:
Originally Posted by MarathonMan View Post
Right shifts are not guaranteed to be arithmetic or logical by the C standard; bitwise and-ing by 0xFF will guarantee, regardless of the type and shift used, that only the low 8-bits of the result is used. And results in no extra operations on a good compiler.
So then actually, you're agreeing that (unsigned char) (identical of course to & 0xFF) is better than shifting right by 8 I take it?

... VVV

Quote:
Originally Posted by MarathonMan View Post
No, I read everything. I just thought they were really poor examples seeing as most compilers are smart enough to figure these things out and it's incredibly difficult to read your changes.
^^^ ...

Then you're also agreeing that your compiler read your change as an obsolete shift to the right by 8 bits.

Because the compiler would not have generated that code if you used the correct expression: (unsigned char)(var >> 8)
Code:
MOV eax, dh
So your assumption about modernness of compiler saving you did not work there.

Quote:
Originally Posted by MarathonMan View Post
If you do the former, you're going to completely drive away most developers from looking at your code. It's incredibly difficult to process the mask over a ternary operation in this case.
As a fellow systems programmer, you should understand that the ability to read bit-wise masks is more important than the ability to read a ternary if-else operator.

Think like the computer, not some human with his own opinion on readability.

And if anyone disagrees with me I have no complaint that they make no modifications to the code and do not look at it.

Personally I find none of the shuffling/masking SSSE3 version of your RSP implementation readable in the least, but that didn't stop me from the anticipation of supporting it over the more readable, non-SSE method.

Quote:
Originally Posted by MarathonMan View Post
That is a 90s mentality. Embrace modern compilers.

Look at the ternary operator case: the compiler clearly knows that it should produce 0xFF if the least significant bit is set to 1. Otherwise it should produce 0x00. If somebody releases a future instruction (like a extendBitToByte -- completely theoretical), the compiler's instruction scheduler will clearly identify that it's appropriate to use that instruction. In the meantime, it'll be smart enough to figure out what you've ultimately written.
Hm. =[
This is a 2000s mentality.

Look at the mask instead
Code:
color->a = -(c & 1) & 0xFF;
The compiler clearly knows that (c & 1) has only two possible values, 0 and 1. Regardless, -1 & 0xFF is 0xFF and 0 & 0xFF is 0x00.
It's perfectly readable, unlike the if-else ternary which is slower to process and figure out the guesswork to effectively compile down to the exact algorithm I just wrote.

Therefore my form is more readable to the compiler and more portable across compilers that don't all agree with the way you think (about guessing things out for you).

Last edited by HatCat; 19th June 2013 at 02:27 AM.
Reply With Quote
  #199  
Old 19th June 2013, 02:56 AM
angrylion angrylion is offline
Member
 
Join Date: Oct 2008
Location: Moscow, Russia
Posts: 36
Default

Quote:
Originally Posted by FatCat View Post
The direct solution is MOV EAX, DH, not,
SAR EDX, 8; and then MOV EAX, EDX (or edi in your case, which does not even compare).
The Pentium 4 Optimization Manual tells that I should avoid the use of AH, etc., because it involves an internal shift micro-op, so these two code sequences should be almost equal on what I have.

Quote:
Originally Posted by FatCat View Post
That's cool that GCC figures it out, but not everybody has the latest version of GCC.
My MSVC2002 figures this out too, so I don't care, I prefer to preserve what's readable to me.

Quote:
Originally Posted by FatCat View Post
if you count my commented-out idea of calculating the -1 / +1 sign multiplier and statically multiplying the variables instead of using an if-else branch frame
IMUL has a latency of 10 on my Prescott. I am not going to consider replacing simpler branches with multiplies, even less so replacing 1 conditional and 1 unconditional branches with 8 multiplies. MooglyGuy already advised me this once, and multiplies were always slower in my profiler.

Quote:
Originally Posted by FatCat
I won't bother changing it myself though right now because I expect GCC is intelligent enough to figure that one out for us.
MSVC2002 doesn't insert AND opcode here either.

Quote:
Originally Posted by FatCat
Okay so the last example kind of sucked, but, it's still an improvement.
On MSVC2002 you transformed this sequence:
Code:
cmp
jz noflip
mov reg, mem
mov mem, reg
jmp next
noflip: mov reg, mem
neg reg
mov mem, reg
next:
to this one:

Code:
cmp
mov reg, mem
mov mem, reg
jnz next
neg reg
mov mem, reg
next:
So you eliminated one unconditional jump, but you inserted 8 additional mov mem, reg in the worst case, which will happen 50% of time. I'm not going to include this without profiling results.
Reply With Quote
  #200  
Old 19th June 2013, 03:04 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
Not quite. COLOR is defined here:

Code:
typedef struct
{
	INT32 r, g, b, a;
} COLOR;
Why in the name of all that is holy...

Yeah, in that case, you can't do anything short of a movzbl.

Quote:
Originally Posted by FatCat View Post

I was not complaining when he said
...
Because I totally read the post you claiming something that we both clearly agree that is inferior was a better solution, duh.

Man, I feel like an idiot.

I misread your post as saying angrylion's head vs my revisions... only I'm a dumb idiot!

Quote:
Originally Posted by FatCat View Post
This was never a bug report, just optimization critics.
But, I think you knew that. :/
If I would have known the type of COLOR. I thought you were advocating something as better enough though it's a clearly inferior and not any more optimized.[/quote]

Quote:
Originally Posted by FatCat View Post
As a fellow systems programmer, you should understand that the ability to read bit-wise masks is more important than the ability to read a ternary if-else operator.

Think like the computer, not some human with his own opinion on readability.

And if anyone disagrees with me I have no complaint that they make no modifications to the code and do not look at it.

Personally I find none of the shuffling/masking SSSE3 version of your RSP implementation readable in the least, but that didn't stop me from the anticipation of supporting it over the more readable, non-SSE method.
Oh, I completely agree. That has got to be one of the most illegible pieces of muck work that I've ever written. But since compilers can't vectorize for beans (at least well, anyways) in the current form, and SSE is uber cool and fast, I didn't have much of a choice.

I still disagree on the ternary operation, however. My opinion stands on that. Really, a program is just a giant tree to the compiler. If you write something that clearly describes a conditional operation, the compiler is aware of the semantics of that operation.

OTOH, if you try to obscure it with a mask, the compiler will have no idea what you're doing without semantic analysis to determine that the mask is really being used as a conditional operation.

I really need to get working on my emulator so I can do this cool stuff.

Last edited by MarathonMan; 19th June 2013 at 03:07 AM.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 10:26 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.