#201  
Old 19th June 2013, 03:16 AM
angrylion angrylion is offline
Member
 
Join Date: Oct 2008
Location: Moscow, Russia
Posts: 36
Default

What's wrong with the type of COLOR? You can't make color components 8-bit, color combiner operates on sign.9 values. If you attempt to do this, you'll break the color of the big rotating "N" in Zelda OoT intro.

Last edited by angrylion; 19th June 2013 at 03:19 AM.
Reply With Quote
  #202  
Old 19th June 2013, 03:36 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by angrylion View Post
What's wrong with the type of COLOR? You can't make color components 8-bit, color combiner operates on sign.9 values. If you attempt to do this, you'll break the color of the big rotating "N" in Zelda OoT intro.
Hmm... I forgot about the 9/9/9 RGB color mode even though I just mentioned it the other day. I have to study the architecture of the RDP more.

Even still, a UINT16 is better than a UINT32, no? On 64-bit archs, you could pass around a full RGBA value instead of a pointer and save yourself the cost of dereferencing.

Last edited by MarathonMan; 19th June 2013 at 03:42 AM.
Reply With Quote
  #203  
Old 19th June 2013, 03:48 AM
angrylion angrylion is offline
Member
 
Join Date: Oct 2008
Location: Moscow, Russia
Posts: 36
Default

Maybe, I don't recommend to trust anything without experimenting with a profiler. Last I heard, generally 32-bit memory accesses and arithmetic were faster than 16-bit on x86-32. If you want to avoid dereferencing pointers for the combiner, it means you want to have a set of "combiner input" COLORs and copy the same colors to them twice in 2-cycle mode?
Reply With Quote
  #204  
Old 19th June 2013, 03:52 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by MarathonMan View Post
Hmm... I forgot about the 9/9/9 RGB color mode even though I just mentioned it the other day. I have to study the architecture of the RDP more.

Even still, a UINT16 is better than a UINT32, no? On 64-bit archs, you could pass around a full RGBA value instead of a pointer and save yourself the cost of dereferencing.

EDIT: Might also enable cwtl to be emitted on x86, which could increase performance. Forgot if that was a macro or micro -op though.
Oh, is UINT16 better than UINT32?

I thought that was just to save space/program size.

AFAIK the fastest data type is just `int`, not long, short, char, w/e.
Which on Win32 is the INT32 or UINT32 macro in size effect, so maybe that's why MESS chose UINT32?

Not sure but I didn't think UINT16 was any better performance than UINT32 on a 32-bit system.

I don't mind if he uses UINT32, but it needs to be movzbl like you said or, if doing register-register communication instead, the MOV EAX, DH like I said (move 8-bit upper of 16-bit DX into 32-bit EAX as the resulting color1.r structure member) if he wants it to be optimized.

Quote:
Originally Posted by MarathonMan View Post
If I would have known the type of COLOR. I thought you were advocating something as better enough though it's a clearly inferior and not any more optimized.
Well, even so, let's pretend the COLOR structure was UINT8's, not UINT32.
(Btw I never say "UINT8" instead of "unsigned char" because I prefer relying on the pure, internal C keywords--yes you're more than welcome to accuse me of making ugly code to scare people away there. I won't argue with that. I sometimes want to maximize portable maintenance/independence of external macros to the interpretation of the compiler, and not using a neatly, smaller and man-readable "UINT8" name.)

This is perfectly fine ultimately: You're still shifting a 32-bit int to the right and trapping the resulting lo 8 bits.
I don't like to write it this way however because it issues an ANSI C warning (an insignificant warning but a warning message nonetheless), shifting 32 bits right by 8 and storing it into an 8-bit target without explicitly type-converting it to strictly define it as (unsigned char) source type.

It will be as optimized, but so as to get rid of the ANSI C warning some compilers might brag about, I would say byte = (unsigned char)(shit >> 8) instead of just byte = (shit >> 8), but only to be careful/explicit.

You don't have to though.
It does make it even more strongly emphasized that we're moving 8 bits though (and even more strongly optimized that we're moving SHIT!), so in that sense I consider the intended optimization more well-known visible to the reader as well as the compiler.

Quote:
Originally Posted by MarathonMan View Post
Oh, I completely agree. That has got to be one of the most illegible pieces of muck work that I've ever written. But since compilers can't vectorize for beans (at least well, anyways) in the current form, and SSE is uber cool and fast, I didn't have much of a choice.

I still disagree on the ternary operation, however. My opinion stands on that. Really, a program is just a giant tree to the compiler. If you write something that clearly describes a conditional operation, the compiler is aware of the semantics of that operation.

OTOH, if you try to obscure it with a mask, the compiler will have no idea what you're doing without semantic analysis to determine that the mask is really being used as a conditional operation.

I really need to get working on my emulator so I can do this cool stuff.
I wrote the mask because ideally, it's the assembly language output.

As you were saying,
target = (bitstring & 1) ? 0xFF : 0x00

... compiles to that assembly code you posted earlier.

I wrote the C version of that asm code.
That's why I find it more readable.

My argument is really as simple as that, but it doesn't prove that I'm right or that you're wrong in your view of it.

It reminds humans of the intended, necessary optimization the compiler SHOULD make--even if it's not one of the modern ones that does, either way. I see it as a habit/practice, not saying everyone likes it.
Reply With Quote
  #205  
Old 19th June 2013, 04:02 AM
angrylion angrylion is offline
Member
 
Join Date: Oct 2008
Location: Moscow, Russia
Posts: 36
Default

Quote:
Originally Posted by FatCat View Post
I don't mind if he uses UINT32, but it needs to be movzbl like you said or, if doing register-register communication instead, the MOV EAX, DH like I said (move 8-bit upper of 16-bit DX into 32-bit EAX as the resulting color1.r structure member) if he wants it to be optimized.
As I said, AH accesses are not faster than SHR on Pentium 4. Secondly, there's nothing preventing an abstract compiler to be smart enough to use AH accesses, if I right-shift an unsigned 16-bit value by 8, like I do. And there is a marginal risk that an abstract compiler is stupid enough to insert AND if I leave it intact in my C code.

Last edited by angrylion; 19th June 2013 at 04:09 AM.
Reply With Quote
  #206  
Old 19th June 2013, 04:24 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by angrylion View Post
Secondly, there's nothing preventing an abstract compiler to be smart enough to use AH accesses, if I right-shift an unsigned 16-bit value by 8, like I do.
That certainly is wrong.

You're shifting a 32-bit value by 8.

The compiler would be DUMB to use AH access because AH is 8-bit register access.
You only specified take a 32-bit value and shift it right by 8.

That does not say discard the upper 24 bits or 16 bits.
For the compiler to make that assumption for you is wrong.

So no, you must specify (unsigned char) or & 0xFF if you want it to use AH access.

If you prefer to go by the manual and not do that, fine.

Also, I didn't see your post on the last page before, and I don't have time to address other replies just yet.
Reply With Quote
  #207  
Old 19th June 2013, 04:41 AM
angrylion angrylion is offline
Member
 
Join Date: Oct 2008
Location: Moscow, Russia
Posts: 36
Default

Quote:
Originally Posted by FatCat View Post
You're shifting a 32-bit value by 8.
Ah, you're partially correct, for some reason I shift a 16-bit unsigned value in some texel fetching functions and 32-bit unsigned value in others. I'll see if I have an inclination to change this to always shift a 16-bit unsigned value.

Last edited by angrylion; 19th June 2013 at 05:09 AM.
Reply With Quote
  #208  
Old 19th June 2013, 01:09 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

This is just the thing that I was talking about.

Relying on intrinsics in the compiler is not always everything.

Sometimes it's just better for maintainability of the code to always make sure of the type.
Code:
UINT8 function(UINT16 uint16) {
    return ((unsigned char)(uint16 >> 8));
}

UINT8 function(int uint32) {
    return ((unsigned char)(uint32 >> 8));
}
Only one of those functions works without the (unsigned char) in it.
Either way, why remove it?
You know that you desire exactly 8 bits.
The more information you explicitly define to the compiler, the more it doesn't forget to optimize knowing all the facts.

Even for the first of those two functions where (unsigned char) isn't needed, it's less hazardous to future maintenance of code (maybe later you change it from UINT16 into fast int type) and minimizes external changes, when you make your code, portable.

It's just safer design IMO, and no less readable.
And if I'm wrong about the MOV EAX, DH 8-bit register access, OK, compiler won't generate that code. It will pick something better if I'm wrong. Maybe MOVZBL like MarathonMan said.

Quote:
Originally Posted by angrylion View Post
The Pentium 4 Optimization Manual tells that I should avoid the use of AH, etc., because it involves an internal shift micro-op, so these two code sequences should be almost equal on what I have.
I miss 16-bit Intel programming.

Back then MOV AX, DH was beautiful.
You didn't need to shift by 8 because you had this internal workaround.

There also was no "E"AX, "E"BX...it was just the base register names.

If things have since then changed, fine, the compiler should know I'm wrong.
Doesn't change my C advice as it should increase likelihood that correct output is used as well as code security.

Quote:
Originally Posted by angrylion View Post
IMUL has a latency of 10 on my Prescott. I am not going to consider replacing simpler branches with multiplies, even less so replacing 1 conditional and 1 unconditional branches with 8 multiplies. MooglyGuy already advised me this once, and multiplies were always slower in my profiler.
I think that's why, you can see in my code, I commented out the IMUL method.
It was just to demonstrate that branch management/weighing could be eliminated, and as I already admitted at the end of the post it was better to use a branch and use the NEG opcode, not IMUL.

Quote:
Originally Posted by angrylion View Post
MSVC2002 doesn't insert AND opcode here either.
[^ in reply to me predicting this would happen about the (var & ~03) >> 8 thing, being really just the same thing as (var >> 6)]

Which is one reason why I didn't change it.
It's easy for compiler theory to understand that (var & ~03) >> 8 is redundant.
If however you also consider that more readable than just saying (var >> 6) then I think you have a very strong opinion of readability vs. compiler theory.

Ultimately, either way the real reason I'm not making my own suggested change to that code, is it's wisest to keep it in sync with yours, letter-for-letter, to make merging in those more colossal source updates on Google Code easier.

Quote:
Originally Posted by angrylion View Post
On MSVC2002 you transformed this sequence:
Code:
cmp
jz noflip
mov reg, mem
mov mem, reg
jmp next
noflip: mov reg, mem
neg reg
mov mem, reg
next:
to this one:

Code:
cmp
mov reg, mem
mov mem, reg
jnz next
neg reg
mov mem, reg
next:
So you eliminated one unconditional jump, but you inserted 8 additional mov mem, reg in the worst case, which will happen 50% of time. I'm not going to include this without profiling results.
Like I told you already, that example sucked.

It was a valid example of how to remove a branch, but it was not a good example of removing a branch in a situation where you should.
So mid-way, I cancelled the multiply code and commented it out, instead showing that it could be more pre-optimized even in the case that you must branch.
Reply With Quote
  #209  
Old 19th June 2013, 01:19 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
Oh, is UINT16 better than UINT32?

I thought that was just to save space/program size.

...

Not sure but I didn't think UINT16 was any better performance than UINT32 on a 32-bit system.
Not what I was trying to get at - UINT16 is no faster; if anything, as AL mentioned, it has the possibility of being slower.

I was saying that, on x86_64 (because IDGAF about IA-32 anymore ), you could do the following:

Code:
struct COLOR {
   UINT16T r,g,b,a;
};
When you do a function call, instead of passing the color through a pointer and having to reference it immediately in a function call:

Code:
foo:
   movl 0(%rdi), %eax ; red component
   movl 4(%rdi), %ebx ; green component
   ...
   ; prologue; write the (modified parts) of the color back out
   movl %eax, 0(%rdi)
   movl %ebx, 4(%rdi)
You can just pass it directly:

Code:
foo:
   ; COLOR is already in %rdi

   ; prologue is cheap now!
   movq %rdi, %rax
   ret
But it could be slower. Depends on 16-bit operations and whatnot.

Quote:
Originally Posted by FatCat View Post
The more information you explicitly define to the compiler, the more it doesn't forget to optimize knowing all the facts.

Even for the first of those two functions where (unsigned char) isn't needed, it's less hazardous to future maintenance of code (maybe later you change it from UINT16 into fast int type) and minimizes external changes, when you make your code, portable.

It's just safer design IMO, and no less readable.
And if I'm wrong about the MOV EAX, DH 8-bit register access, OK, compiler won't generate that code. It will pick something better if I'm wrong. Maybe MOVZBL like MarathonMan said.
This, this, and this again. Compilers <3 semantics.

Last edited by MarathonMan; 19th June 2013 at 01:25 PM.
Reply With Quote
  #210  
Old 19th June 2013, 01:25 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Crazy.


You are wild with SSE ideas.

Well, 64-bit programming might benefit some things.
But I think of it this way.
MIPS CPU is a 32-bit processor.
RCP CPU is 32-bit.

Win32 machine is 32-bit.
I would just do emulator in 32-bit.

There are lots of things like the RSP vector accumulators and that color struct idea you mentioned though that in fact could benefit from 64-bit programming and the SSE/SSE2 they entail support for.

The problem I have with SSSE3, is that the Nintendo 64 supported those complex vector operations since 1995.
The PC with SSSE3, only starts supporting *some* of those RSP vector ops, like, 20 years later?

That really gets to me. Makes me not want to depend on SSSE3.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 12:26 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.