|
#131
|
||||
|
||||
![]()
bump..................
orly? Might be the port to the plugin or just stuff cen64 didn't do that I did.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#132
|
||||
|
||||
![]() Quote:
![]() I wonder how fast an HLE implementation that didn't reduce accuracy would be. Sigh idk about using z64gl anymore. I honestly felt like angrylions code was easier to follow. Plus I have no idea how to fix the bugs in z64gl ;/ . I'll have to learn a lot about API's before I can tweak z64gl and most other gfx plugins. |
#133
|
||||
|
||||
![]()
I'm wondering what to do lol. Seeing how your version 3 is faster than the cen64 fork, despite the fact that there's a good amount of optimizations done in the cen64 fork. Do you guys even think we could get considerable speed gains with SSE code? Maybe I'll try profiling again.
I think it'l be awesome basing an HLE plugin off of an accurate plugin! I'd like to make a plugin similar to z64gl, but also with an HLE implementation. I need to really learn some API's though. |
#134
|
||||
|
||||
![]()
The CEN64 plugin is mostly constrained by the RSP/VR4300 simulation.
Quote:
SSE, really, yields the best results when you load a bunch of data from memory (x86_64 has 16 %xmm registers) and operate on those registers as much as humanly possible before pushing them back out to memory. In order to optimize the plugin, you really need to understand what all the algorithms are doing (which I do not) and rewrite portions to be vector friendly. By vectorizing the plugin, you could load up the all xmms you might even possibly need before the specific RDP function and allow the loads to be serviced while the indirect jump and tail end of the RDP decoding mechanism is doing its business. Not to mention it would get rid of all that bloated memcpy code everywhere. This is just one example of the benefits of fully vectorizing it. But it's not an overnight, trivial task. |
#135
|
||||
|
||||
![]()
memcpy is one of the things that gets optimized to SSE2 by the compiler (like you once told me so long ago, "Let the compiler flow through you."
![]() Similarly, memset is almost always set to 0 in angrylion's code, so it was extremely easy for the compiler to generate SSE2 code to zero an xmm and mov it into memory a few times, so there wasn't much reason to use SSE intrinsics because memset(x, 0, const) was pretty easy for the compiler to figure out.
__________________
http://theoatmeal.com/comics/cat_vs_internet Last edited by HatCat; 9th July 2014 at 05:37 PM. |
#136
|
||||
|
||||
![]()
Thanks for the reassurance guys. I did feel like it was incomplete, based on what I saw. I'd have to go back and find the old angrylion plugin that it was based on to see the difference. I did like how a lot of switch statements got replaced by LUTs.
I think you'll have to reorganize some variables too, to make it work better with SSE. I honestly don't see a problem with memcpy, but then again the output is dependant on the compiler. I do like to avoid memset though, since it is usually a function call, unless it's only a small amount of data being set. For applications that require SSE, I'd rather do my own implementation, than use memset, simply because of the runtime checks it does. I've always wanted to try coding for x64, but never had a real opportunity for that yet. I love the fact that you get more registers. It doesn't make sense how people make x64 programs that are slower than their x86 equivalent. |
#137
|
||||
|
||||
![]() Quote:
Also: I'm not condoning him or anyone to rewrite memcpy using SSE. Rather, I'm trying to say that you should fill up SSE registers (where the memcpys currently are) and don't spit data back out of the SSE registers until you've done some actual data processing on them (i.e., the edgewalker_for_prims and whatnot should take a crapton of __m128is as arguments). Last edited by MarathonMan; 9th July 2014 at 06:59 PM. |
#138
|
||||
|
||||
![]()
Lol my mistake. I didn't realize the port was unfinished. What I did was mostly look at the source code of the 1st file posted in this thread, then after realizing the plugin didn't work, I quickly downloaded the 2nd one and compiled that.
I started looking at the 2nd file's source and saw that a good amount of the code was changed back to the original. So much work to be done lol. |
#139
|
||||
|
||||
![]()
Pretty much just angrylion's...the whole edgewalker unit and command buffering system was interpreted from the holy repo, not MAME.
Quote:
Anyway there's no need to do that either, because to rephrase what I was just saying before, the data was already written by the core RDP interpreter loop, but is redundantly memcpy'd to a secondary buffer...using SSE or XMM registers like you're saying hardly makes it less redundant. Really, it makes more sense just to grab the data straight from the original command buffer in the first place. No more ewdata[] arrays, and no added copying to other things like SSE registers. Quote:
Anyway, all my RDP command functions are static void func(void). There is no more edgewalker_for_prims function. The only use for edgewalker_for_prims, if you haven't seen the pattern by now, is the triangle RDP commands, so I renamed it to void triangle(): Code:
static NOINLINE void draw_triangle(int shade, int texture, int zbuffer) { static unsigned char triangle_count; register int base; int lft, level, tile; s32 yl, ym, yh; /* triangle edge y-coordinates */ s32 xl, xh, xm; /* triangle edge x-coordinates */ s32 DxLDy, DxHDy, DxMDy; /* triangle edge inverse-slopes */ int tilenum, flip; i32 rgba[4]; /* RGBA color components */ i32 d_rgba_dx[4]; /* RGBA delda per x-coordinate delta */ i32 d_rgba_de[4]; /* RGBA delta along the edge */ i32 d_rgba_dy[4]; /* RGBA delta per y-coordinate delta */ i16 rgba_int[4], rgba_frac[4]; i16 d_rgba_dx_int[4], d_rgba_dx_frac[4]; i16 d_rgba_de_int[4], d_rgba_de_frac[4]; i16 d_rgba_dy_int[4], d_rgba_dy_frac[4]; i32 stwz[4]; i32 d_stwz_dx[4]; i32 d_stwz_de[4]; i32 d_stwz_dy[4]; i16 stwz_int[4], stwz_frac[4]; i16 d_stwz_dx_int[4], d_stwz_dx_frac[4]; i16 d_stwz_de_int[4], d_stwz_de_frac[4]; i16 d_stwz_dy_int[4], d_stwz_dy_frac[4]; i32 d_rgba_dxh[4]; i32 d_stwz_dxh[4]; i32 d_rgba_diff[4], d_stwz_diff[4]; i32 xlr[2], xlr_inc[2]; u8 xfrac; #ifdef USE_SSE_SUPPORT __m128i xmm_d_rgba_de, xmm_d_stwz_de; #endif int sign_dxhdy; int ycur, ylfar; int yllimit, yhlimit; int ldflag; int invaly; int curcross; int allover, allunder, curover, curunder; int allinval; register int j, k; const i32 clipxlshift = clip.xl << 1; const i32 clipxhshift = clip.xh << 1; It grabs the data straight from my global `cmd_data' union, which makes data fetching so much easier than a normal UINT32 array.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#140
|
||||
|
||||
![]()
HatCat, iirc, the compiler doesn't automatically align those int32 arrays you have, by 16 bytes. Do you plan on using the compiler specific features to align it?
|