Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #1301  
Old 1st August 2014, 01:41 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Well since you're focusing on optimizations, I might as well discuss and ask questions .

I was looking at set_combine() and saw that it was assigning values to global variables that don't appear to be used outside of that function. Do you know why it's using global variables? Idk who wrote that part of the code, but just curious. I believe I saw the same thing with draw_triangle(). I should learn intrinsics ;/ . Then I'd be able to write some SSE code.

Perhaps these things I brought up are not too important, but I think these things add up, imo. Also you'd be able to reduce the filesize with these micro-optimizations . I know you like to reduce size.
Reply With Quote
  #1302  
Old 1st August 2014, 01:50 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Because they are prototyped in rdp.h, but referenced in n64video.c.

Edit, for the combiner actually you're right. They don't even need to be instantiated in n64video.c...

Last edited by HatCat; 1st August 2014 at 02:05 AM.
Reply With Quote
  #1303  
Old 1st August 2014, 02:04 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

I made the SET_COMBINE struct a local variable to the set_combine DP command, but actually the number of statements goes up, not down, due to having to allocate the space dynamically.

I guess I can play around with it some more. Combiner changes don't currently amount to much with the slow triangle rasterizer algorithm in place though.

Quote:
Originally Posted by RPGMaster View Post
I believe I saw the same thing with draw_triangle(). I should learn intrinsics ;/ . Then I'd be able to write some SSE code.
With or without intrinsics, all the spans render methodology is still in n64video.c and feeds from huge heaps of global data administered by draw_triangle, so all of that stuff is needed.
Reply With Quote
  #1304  
Old 1st August 2014, 03:00 AM
GeneralGiantPanda GeneralGiantPanda is offline
Member
 
Join Date: Jun 2013
Posts: 53
Default

Since only higher end processors can even begin to run this semi smoothly, could you possibly add AVX instead of just SSE?
Reply With Quote
  #1305  
Old 1st August 2014, 03:18 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

I can't test AVX on this machine. I only have up to SSE3, so basically I can use up to SSE2 for optimization in this plugin.

And anyway, it wouldn't work on most computers if I did that. The benefits of SSE are already overestimated; many things need to still be done to the RDP core independent of both AVX and SSE that would make much more major difference than that.
Reply With Quote
  #1306  
Old 1st August 2014, 03:22 AM
GeneralGiantPanda GeneralGiantPanda is offline
Member
 
Join Date: Jun 2013
Posts: 53
Default

Is there source code for this plugin? I may just toy with AVX implementation for giggles. If i DID decide to release my AVX edits if I get it working, there's no copyrighted stuff in the code from Nintendo, right?
Reply With Quote
  #1307  
Old 1st August 2014, 03:30 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by GeneralGiantPanda View Post
Is there source code for this plugin? I may just toy with AVX implementation for giggles.
Yes it's attached per each zip in the op of this thread.
When there's a binary download, there's a source download.

Be careful with adding SSE/AVX in some functions, as it often tends to make some functions actually slower instead of faster, when there isn't a long-term application of vector operations. I was just experimenting with vi_vl_lerp the other day for example...I rewrote that as SSE which resulted in fewer instructions, fewer data element moves/reads, but also about 50% of the performance out of that function as I had before so I undid that experiment.

Quote:
Originally Posted by GeneralGiantPanda View Post
If i DID decide to release my AVX edits if I get it working, there's no copyrighted stuff in the code from Nintendo, right?
It comes licensed under MAME. I normally release my source code either unlicensed or public domain, but that was not an option here, as angrylion's source code is still lightly based on MAME source. So to comply with the MAME licensing rules, we may not post binaries or release the plugin without also supplying the relevant source. It's similar to GPL.

As for where the code came from, MAME team and angrylion's reverse-engineering of the RDP hardware, and also angrylion's time-and-time-again checked interpretation of some copyrighted hardware information (but no copy pasta...it is completely modernized). If you feel in doubt, don't worry. You're welcome to play around with it yourself without throwing source code out there if you feel at risk, as long as you also don't do binary releases.
Reply With Quote
  #1308  
Old 1st August 2014, 04:32 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Quote:
Originally Posted by HatCat View Post
I made the SET_COMBINE struct a local variable to the set_combine DP command, but actually the number of statements goes up, not down, due to having to allocate the space dynamically.

I guess I can play around with it some more. Combiner changes don't currently amount to much with the slow triangle rasterizer algorithm in place though.
Lol my bad, it looks like the compiler produced better output than I thought. Still not perfect, but you're right that combiner changes don't amount to much. When I tried local variable instead, it reduced the function by 48 bytes and had 3 less instructions. Lol i tried posting both assembly output, but it exceeded character limit ;/ .

Basically, I don't like using global variables if it doesn't have a preset value and is only used inside of one function. The exception of course is if it's a large array/ structure of data.

My problem with the set_combine function is that it's storing values into the combine structure, which never gets used outside of that function. It would be better to just assign the value to a local variable and call the function right after it is assigned the value, so that the compiler only uses registers and never redundantly stores data to a memory location.

Then the fact that in the functions like SET_SUBA_RGB_INPUT(), it's redundantly ANDing that code variable. I realized this after looking at intel's smart compiler output and that's what caused me to look closer at the source code. You may not notice a cpu performance difference after making changes, but at least it will reduce size.

The compiler won't even inline functions like SET_SUBA_RGB_INPUT(), unless you convert the switch statements to LUTs. If the compiler doesn't inline your INLINE functions, is it better to just erase INLINE from those functions?

Quote:
Originally Posted by HatCat View Post
With or without intrinsics, all the spans render methodology is still in n64video.c and feeds from huge heaps of global data administered by draw_triangle, so all of that stuff is needed.
I just wanted to convert code like
Code:
d_rgba_deh[0] = d_rgba_de[0] & ~0x000001FF;
d_rgba_deh[1] = d_rgba_de[1] & ~0x000001FF;
d_rgba_deh[2] = d_rgba_de[2] & ~0x000001FF;
d_rgba_deh[3] = d_rgba_de[3] & ~0x000001FF;
d_stwz_deh[0] = d_stwz_de[0] & ~0x000001FF;
d_stwz_deh[1] = d_stwz_de[1] & ~0x000001FF;
d_stwz_deh[2] = d_stwz_de[2] & ~0x000001FF;
d_stwz_deh[3] = d_stwz_de[3] & ~0x000001FF;
to SSE. At the very least, it should save space since less instructions.

Anyway, since my current goal is to simply become better at writing optimized code, I want to see results so that I know I've accomplished something. That being said, you mind telling me some functions I should focus on?
Reply With Quote
  #1309  
Old 1st August 2014, 04:55 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by RPGMaster View Post
I just wanted to convert code like
Code:
d_rgba_deh[0] = d_rgba_de[0] & ~0x000001FF;
d_rgba_deh[1] = d_rgba_de[1] & ~0x000001FF;
d_rgba_deh[2] = d_rgba_de[2] & ~0x000001FF;
d_rgba_deh[3] = d_rgba_de[3] & ~0x000001FF;
d_stwz_deh[0] = d_stwz_de[0] & ~0x000001FF;
d_stwz_deh[1] = d_stwz_de[1] & ~0x000001FF;
d_stwz_deh[2] = d_stwz_de[2] & ~0x000001FF;
d_stwz_deh[3] = d_stwz_de[3] & ~0x000001FF;
to SSE. At the very least, it should save space since less instructions.
Actually if you look at MSVC 2013 output I think you'll find that this already was SSE code.

Code:
; Line 1583
	movdqu	xmm7, XMMWORD PTR _d_rgba_de$[esp+320]
; Line 1587
	movdqu	xmm5, XMMWORD PTR _d_stwz_de$[esp+320]
	pand	xmm7, xmm0
	pand	xmm5, xmm0
That's what that C code of mine you just quoted compiles to. :3 So there is no need to write it as explicit, hard-coded and non-portable, non-ANSI-compliant SSE code.

Quote:
Originally Posted by RPGMaster View Post
My problem with the set_combine function is that it's storing values into the combine structure, which never gets used outside of that function. It would be better to just assign the value to a local variable and call the function right after it is assigned the value, so that the compiler only uses registers and never redundantly stores data to a memory location.
You are talking about over 20 different struct members here.
What makes you think register allocation isn't already being done to cycle between so much memory?

And if you look further, functions like SET_HELL_DAMN_ASS(***) take pointers as arguments to those datum, and you can't take the address of a register, can you? So register allocation wouldn't be an accurate depiction of what's happening.

Quote:
Originally Posted by RPGMaster View Post
The compiler won't even inline functions like SET_SUBA_RGB_INPUT(), unless you convert the switch statements to LUTs. If the compiler doesn't inline your INLINE functions, is it better to just erase INLINE from those functions?
No, not really, because the INLINE macro still serves as a hint.

If, in the future, changes to that function make it smaller and more worth in-lining, the compiler will read in the INLINE advice and decide to in-line it. If there is a conceptual reason why we *would* in some scenarios want it in-lined, then writing INLINE is symbolic to both programmer and compiler analysis, even though in its current state, it shouldn't be in-lined.

Last edited by HatCat; 1st August 2014 at 04:58 AM.
Reply With Quote
  #1310  
Old 1st August 2014, 05:19 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Quote:
Originally Posted by HatCat View Post
Actually if you look at MSVC 2013 output I think you'll find that this already was SSE code.

Code:
; Line 1583
	movdqu	xmm7, XMMWORD PTR _d_rgba_de$[esp+320]
; Line 1587
	movdqu	xmm5, XMMWORD PTR _d_stwz_de$[esp+320]
	pand	xmm7, xmm0
	pand	xmm5, xmm0
That's what that C code of mine you just quoted compiles to. :3 So there is no need to write it as explicit, hard-coded and non-portable, non-ANSI-compliant SSE code.
Geez, either I looked at the assembly output of a different piece of code, or my memory is failing me ;/ .

Quote:
Originally Posted by HatCat View Post
You are talking about over 20 different struct members here. What makes you think register allocation isn't already being done to cycle between so much memory?
To some extent it does, but not for all of them. If you prefer to organize it with readability in mind, then I understand leaving it the way it is. If you'd prefer to go for optimal compiler output, then it should be rearranged.

Quote:
Originally Posted by HatCat View Post
And if you look further, functions like SET_HELL_DAMN_ASS(***) take pointers as arguments to those datum, and you can't take the address of a register, can you? So register allocation wouldn't be an accurate depiction of what's happening.
Pushing the addresses aren't a problem. Take this piece of code for example
Code:
SET_MUL_RGB_INPUT(
		&combiner_rgbmul_r[0], &combiner_rgbmul_g[0], &combiner_rgbmul_b[0],
		combine.mul_rgb0);
//assembly output
17D4ACDF  push        dword ptr ds:[18352E28h]  
17D4ACE5  mov         edx,183D3858h  
17D4ACEA  mov         ecx,183DC050h  
17D4ACEF  push        183DDD20h  
17D4ACF4  call        SET_MUL_RGB_INPUT (17D334A0h)
Since the compiler doesn't seem to rearrange the code optimally, although other advanced compilers may do so, combine.mul_rgb0 was pushed as a variable instead of a register. I'm surprised the compiler is pushing a variable, instead of moving the value to a register and pushing the register. I heard pushing a variable is extra slow. But I guess the compiler felt that size was more important. Also I don't like how it pushed &combiner_rgbmul_r[0] onto the stack, when it could have just been moved to eax. I feel like these problems add up, so that's why I like to do everything I can, no matter how small it is.

Quote:
Originally Posted by HatCat View Post
No, not really, because the INLINE macro still serves as a hint.

If, in the future, changes to that function make it smaller and more worth in-lining, the compiler will read in the INLINE advice and decide to in-line it. If there is a conceptual reason why we *would* in some scenarios want it in-lined, then writing INLINE is symbolic to both programmer and compiler analysis, even though in its current state, it shouldn't be in-lined.
Oh ok. That makes sense now.


I should start benchmarking code more. I think that will help me understand things better .

I forgot to elaborate on some redundancy.
Code:
combine.add_a1     = (cmd_data[cmd_cur + 0].UW32[1] & 0x00000007) >> 0;
You see how it does & 0x7, yet the function also does (code & 0x7) for the switch statement.

Last edited by RPGMaster; 1st August 2014 at 08:37 AM.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 09:50 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.