#131  
Old 24th June 2014, 10:47 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

bump..................

Quote:
Originally Posted by RPGMaster View Post
Kinda odd it's like 6 VI/s slower than your build.
orly? Might be the port to the plugin or just stuff cen64 didn't do that I did.
Reply With Quote
  #132  
Old 25th June 2014, 12:03 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,029
Default

Quote:
Originally Posted by HatCat View Post
orly? Might be the port to the plugin or just stuff cen64 didn't do that I did.
Ya, you've prolly gone a long way and I just didn't notice. I do admit, i don't rememebr getting 40VI/s in SSB64 with filters off, until recently . Although part of it is because of RSP recompiler and 1964. Still I can already do 60 if i strip backgrounds. You've proven to me the important of the overall algorithm, rather than microoptimizing everything xD. I do like to periodically revise my programs and optimize along the way.

I wonder how fast an HLE implementation that didn't reduce accuracy would be.

Sigh idk about using z64gl anymore. I honestly felt like angrylions code was easier to follow. Plus I have no idea how to fix the bugs in z64gl ;/ . I'll have to learn a lot about API's before I can tweak z64gl and most other gfx plugins.
Reply With Quote
  #133  
Old 9th July 2014, 06:52 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,029
Default

I'm wondering what to do lol. Seeing how your version 3 is faster than the cen64 fork, despite the fact that there's a good amount of optimizations done in the cen64 fork. Do you guys even think we could get considerable speed gains with SSE code? Maybe I'll try profiling again.

I think it'l be awesome basing an HLE plugin off of an accurate plugin! I'd like to make a plugin similar to z64gl, but also with an HLE implementation. I need to really learn some API's though.
Reply With Quote
  #134  
Old 9th July 2014, 02:19 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

The CEN64 plugin is mostly constrained by the RSP/VR4300 simulation.

Quote:
Originally Posted by RPGMaster View Post
Do you guys even think we could get considerable speed gains with SSE code? Maybe I'll try profiling again.
I've told Mr. Cat my views on this before. The SSE in CEN64 isn't all that optimized because it's constantly pulling values out from memory, operating on them once, and then flushing them back to memory. I saw a spot where multiplies were being done and could be trivially vectorized, for example, and wrote a quick snippet of code to do it. It releases pressure on the decoders and other microprocessor resources due to the compaction of the instructions, but doesn't offer much more than that.

SSE, really, yields the best results when you load a bunch of data from memory (x86_64 has 16 %xmm registers) and operate on those registers as much as humanly possible before pushing them back out to memory. In order to optimize the plugin, you really need to understand what all the algorithms are doing (which I do not) and rewrite portions to be vector friendly.

By vectorizing the plugin, you could load up the all xmms you might even possibly need before the specific RDP function and allow the loads to be serviced while the indirect jump and tail end of the RDP decoding mechanism is doing its business. Not to mention it would get rid of all that bloated memcpy code everywhere. This is just one example of the benefits of fully vectorizing it. But it's not an overnight, trivial task.
Reply With Quote
  #135  
Old 9th July 2014, 04:02 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

memcpy is one of the things that gets optimized to SSE2 by the compiler (like you once told me so long ago, "Let the compiler flow through you." ). The whole problem with angrylion's use of memcpy wasn't that he wrote "memcpy" instead of SSE intrinsics like you'd prefer; it was that it was copying command buffer data for the RDP operations to memory, to a second batch of memory, and then into variables anyway. Even with SSE/vectorization rewrites, that would still be redundant. So I unified memory access between RDP command storage and the edge walker data into a simple union type to remove that redundancy.

Similarly, memset is almost always set to 0 in angrylion's code, so it was extremely easy for the compiler to generate SSE2 code to zero an xmm and mov it into memory a few times, so there wasn't much reason to use SSE intrinsics because memset(x, 0, const) was pretty easy for the compiler to figure out.

Last edited by HatCat; 9th July 2014 at 05:37 PM.
Reply With Quote
  #136  
Old 9th July 2014, 06:35 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,029
Default

Thanks for the reassurance guys. I did feel like it was incomplete, based on what I saw. I'd have to go back and find the old angrylion plugin that it was based on to see the difference. I did like how a lot of switch statements got replaced by LUTs.

I think you'll have to reorganize some variables too, to make it work better with SSE.

I honestly don't see a problem with memcpy, but then again the output is dependant on the compiler. I do like to avoid memset though, since it is usually a function call, unless it's only a small amount of data being set. For applications that require SSE, I'd rather do my own implementation, than use memset, simply because of the runtime checks it does.

I've always wanted to try coding for x64, but never had a real opportunity for that yet. I love the fact that you get more registers. It doesn't make sense how people make x64 programs that are slower than their x86 equivalent.
Reply With Quote
  #137  
Old 9th July 2014, 06:56 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by HatCat View Post
The whole problem with angrylion's use of memcpy wasn't that he wrote "memcpy" instead of SSE intrinsics like you'd prefer; it was that it was copying command buffer data for the RDP operations to memory, to a second batch of memory, and then into variables anyway. Even with SSE/vectorization rewrites, that would still be redundant. So I unified memory access between RDP command storage and the edge walker data into a simple union type to remove that redundancy.
I'm not very good at English, just C. What I meant was JD's/angrylion's (and, by extension, CEN64's), had a separate memcpy for every different type of RDP instruction. You can hoist up the memcpy before the indirect branch (for starters). Not only does it give the load units a head start of fetching the data, but the data gets to all the functions for free in registers instead of an array on the stack or in BSS.

Also:

I'm not condoning him or anyone to rewrite memcpy using SSE. Rather, I'm trying to say that you should fill up SSE registers (where the memcpys currently are) and don't spit data back out of the SSE registers until you've done some actual data processing on them (i.e., the edgewalker_for_prims and whatnot should take a crapton of __m128is as arguments).

Last edited by MarathonMan; 9th July 2014 at 06:59 PM.
Reply With Quote
  #138  
Old 22nd July 2014, 06:08 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,029
Default

Lol my mistake. I didn't realize the port was unfinished. What I did was mostly look at the source code of the 1st file posted in this thread, then after realizing the plugin didn't work, I quickly downloaded the 2nd one and compiled that.

I started looking at the 2nd file's source and saw that a good amount of the code was changed back to the original.

So much work to be done lol.
Reply With Quote
  #139  
Old 22nd July 2014, 04:19 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Quote:
Originally Posted by MarathonMan View Post
JD's/angrylion's
Pretty much just angrylion's...the whole edgewalker unit and command buffering system was interpreted from the holy repo, not MAME.

Quote:
Originally Posted by MarathonMan View Post
(and, by extension, CEN64's), had a separate memcpy for every different type of RDP instruction. You can hoist up the memcpy before the indirect branch (for starters).
Not every different type of RDP instruction, only the RDP triangle commands or anything drawing [textured] rectangles for calling render_spans* with that data.

Anyway there's no need to do that either, because to rephrase what I was just saying before, the data was already written by the core RDP interpreter loop, but is redundantly memcpy'd to a secondary buffer...using SSE or XMM registers like you're saying hardly makes it less redundant. Really, it makes more sense just to grab the data straight from the original command buffer in the first place. No more ewdata[] arrays, and no added copying to other things like SSE registers.

Quote:
Originally Posted by MarathonMan View Post
Rather, I'm trying to say that you should fill up SSE registers (where the memcpys currently are) and don't spit data back out of the SSE registers until you've done some actual data processing on them (i.e., the edgewalker_for_prims and whatnot should take a crapton of __m128is as arguments).
Well if you're only targeting 64-bit, then you would have up to 16 XMM registers across function calls. Anything past that would just be the same as using a global C array soo....

Anyway, all my RDP command functions are static void func(void).

There is no more edgewalker_for_prims function. The only use for edgewalker_for_prims, if you haven't seen the pattern by now, is the triangle RDP commands, so I renamed it to void triangle():
Code:
static NOINLINE void draw_triangle(int shade, int texture, int zbuffer)
{
    static unsigned char triangle_count;
    register int base;
    int lft, level, tile;
    s32 yl, ym, yh; /* triangle edge y-coordinates */
    s32 xl, xh, xm; /* triangle edge x-coordinates */
    s32 DxLDy, DxHDy, DxMDy; /* triangle edge inverse-slopes */
    int tilenum, flip;

    i32 rgba[4]; /* RGBA color components */
    i32 d_rgba_dx[4]; /* RGBA delda per x-coordinate delta */
    i32 d_rgba_de[4]; /* RGBA delta along the edge */
    i32 d_rgba_dy[4]; /* RGBA delta per y-coordinate delta */
    i16 rgba_int[4], rgba_frac[4];
    i16 d_rgba_dx_int[4], d_rgba_dx_frac[4];
    i16 d_rgba_de_int[4], d_rgba_de_frac[4];
    i16 d_rgba_dy_int[4], d_rgba_dy_frac[4];

    i32 stwz[4];
    i32 d_stwz_dx[4];
    i32 d_stwz_de[4];
    i32 d_stwz_dy[4];
    i16 stwz_int[4], stwz_frac[4];
    i16 d_stwz_dx_int[4], d_stwz_dx_frac[4];
    i16 d_stwz_de_int[4], d_stwz_de_frac[4];
    i16 d_stwz_dy_int[4], d_stwz_dy_frac[4];

    i32 d_rgba_dxh[4];
    i32 d_stwz_dxh[4];
    i32 d_rgba_diff[4], d_stwz_diff[4];
    i32 xlr[2], xlr_inc[2];
    u8 xfrac;
#ifdef USE_SSE_SUPPORT
    __m128i xmm_d_rgba_de, xmm_d_stwz_de;
#endif
    int sign_dxhdy;
    int ycur, ylfar;
    int yllimit, yhlimit;
    int ldflag;
    int invaly;
    int curcross;
    int allover, allunder, curover, curunder;
    int allinval;
    register int j, k;
    const i32 clipxlshift = clip.xl << 1;
    const i32 clipxhshift = clip.xh << 1;
Here all the SSE work is done locally to this NOINLINE function. The 3 int args to this function are Booleans signaling whether to skip some or all of the SIMD work (SSE/MMX) of a Z-buffered, textured and shaded triangle, where certain RDP commands don't use those datum.

It grabs the data straight from my global `cmd_data' union, which makes data fetching so much easier than a normal UINT32 array.
Reply With Quote
  #140  
Old 5th August 2014, 06:40 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,029
Default

HatCat, iirc, the compiler doesn't automatically align those int32 arrays you have, by 16 bytes. Do you plan on using the compiler specific features to align it?
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 10:44 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.