|
#211
|
||||
|
||||
![]()
What I eventually want to do with the RSP plugin is to write concise inline functions that map well to SSE, and then write efficient non-SSE replacements:
Like [https://github.com/tj90241/cen64-rsp...ter/CP2.c#L74]. It's the prologue (whoops, I said prologue in my last post when I meant epilogue) to a lot of those blasted RSP instructions where the element value determines which bytes of one of the source vector get broadcasted into the actual operation. The "pure C" implementation is just to (for i = 0; i < n; i++) ... = ...; That way, you get the best of both worlds, and it's amenable to future vectorized architectures -- if you don't have SSE, or don't want to include it, the code simply gets compiled out. I don't like the idea of lock-in, but I can't look away from the performance when I need every little last bleeding cycle for simulation. |
#212
|
||||
|
||||
![]() Quote:
When I use GCC, and only GCC (not MSVS), to compile my RSP plugin with #define PARALLELIZE_VECTOR_TRANSFERS #define EMULATE_VECTOR_RESULT_BUFFER // zilmar does this, me by default I have it off ^ Both of those defined, it generates SSSE3 code for me on VAND, VOR, VXOR, VNAND, VNOR, VNXOR. And I think some of the other opcodes like the adds (VADD components), but not too many others. So your idea was already implemented by me. ![]() Obviously though I cannot run SSSE3 on my machine, so I didn't release that version of my RSP dll into my thread as I can't test it...somebody else can compile it themselves in the meantime.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#213
|
||||
|
||||
![]()
Hmmm interesting... I'll have to look at the assembly it generates and see how it compares.
|
#214
|
|||
|
|||
![]() Quote:
Quote:
|
#215
|
||||
|
||||
![]() Quote:
The shift-only equivalent of that, in that case, would be ((var >> 8) << 2) == ((var >> 6) & ~3) . And AND-masking is probably faster than shifting right. Then again...we have a special case here ![]() Shifting right by 8 bits means you could do a byte-indexed fetch using movzbl or MOV AX, DH like I said earlier. After all, like I said, you have to do a MOV regardless. it mov's into a register first, and then shifts right by 8 it would be more direct to just mov eax, dh, than to mov eax, edx;shr eax, 8; , but like you said manual says DO NOT DO ![]() so tl;dw , not that important To be consistent with the rest of your code, which does not shift right by 6 or 8, I would probably keep the AND-mask method like you did, than use one potential micro-optimization in a special case to >> 8, << 2. So that would be an example where I might deliberately sacrifice code efficiency/performance micro-boosts for the consistency of the code. Quote:
Why should it insert 8 additional moves? It makes 8 moves unconditional. Right now you have it set to, if (!flip) dest = +src; dest2 = +src2; dest3 = +src3; else dest = -src; dest2 = -src2; dest3 = -src3; etc. .. I just made some of the code unconditional... dest = src; dest2 = src2; dest3 = src3; ... etc. if (flip) dest *= -1; dest2 *= -1; dest3 *= -1; .. etc. By removing code outside of the if-else branch frame and leaving only the unique case-specific parts behind, I would think that my code should be optimized either the same, or better, but not worse? But, oh well. I'm not going to spend an entire post arguing about it. The key thing that I wanted to prove, is that if it wasn't for all those color INC key variables you had: Code:
void render_spans_1cycle_complete(int start, int end, int tilenum, int flip) { ... // int drinc, dginc, dbinc, dainc, dzinc, dsinc, dtinc, dwinc; int xinc; if (flip) { /* drinc = spans_dr; dginc = spans_dg; dbinc = spans_db; dainc = spans_da; dzinc = spans_dz; dsinc = spans_ds; dtinc = spans_dt; dwinc = spans_dw; */ xinc = 1; } else { /* drinc = -spans_dr; dginc = -spans_dg; dbinc = -spans_db; dainc = -spans_da; dzinc = -spans_dz; dsinc = -spans_ds; dtinc = -spans_dt; dwinc = -spans_dw; */ xinc = -1; } .... Code:
void render_spans_1cycle_complete(int start, int end, int tilenum, int flip) { ... const int xinc = flip ^ (flip - 1); ... There still must be hundreds of these examples. I had no idea MarathonMan and yourself were ready to pick back about this stuff. ![]()
__________________
http://theoatmeal.com/comics/cat_vs_internet Last edited by HatCat; 19th June 2013 at 02:16 PM. |
#216
|
||||
|
||||
![]() Quote:
![]() (As if I haven't forced that down your throat enough) ![]() Really, though, if you look at what the compiler is probably generating in this case: Code:
if (something_zero_or_not_zero) Can just break down into: testl %eax, %eax je eax_is_zero: eax_is_nonzero: add; add; add; jmp prologue eax_is_zero: sub; sub; sub; prologue: My point is, if the compiler knew that ADDs and IMULs were faster than the conditional branch to ADD/SUB blocks, or it could avoid a branch in some manner, it would have emitted that code. Trust the compiler. Let the compiler flow through you: http://blog.heartland.org/wp-content...d-300x225.jpeg |
#217
|
||||
|
||||
![]()
Actually this is not the best example either.
But I am curious and do not have a profiler I know how to use. Starting Line 792 at n64video.cpp of r75: Code:
if (tile[num].mask_t) { if (tile[num].mt) { wrap = *T >> tile[num].f.masktclamped; wrap &= 1; *T ^= (-wrap); } *T &= maskbits_table[tile[num].mask_t]; } I'm somewhat asking, but I think it probably is. ![]() Code:
if (tile[num].mask_t) { wrap = *T >> (tile[num].f.masktclamped & -tile[num].mt); *T ^= -(wrap &= 1); *T &= maskbits_table[tile[num].mask_t]; }
Again not a good example compared to other stuff I've seen. ![]() But I want to get this question off my head first.
__________________
http://theoatmeal.com/comics/cat_vs_internet Last edited by HatCat; 19th June 2013 at 02:49 PM. |
#218
|
||||
|
||||
![]() Quote:
In case you weren't following I was already using NEGate. There is no extra code with my method. It just assumes copying the data over. Additionally, if (flip) is set, use the NEG op on all 8 of them. What the hell. You must not be following me. ![]()
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#219
|
||||
|
||||
![]()
Probably not. For some reason, I have the hardest time digesting your posts.
![]() |
#220
|
||||
|
||||
![]() Quote:
__________________
--------------------- CPU: Intel U7300 1.3 GHz GPU: Mobile Intel 4 Series (on board) AUDIO: Realtek HD Audio (on board) RAM: 4 GB OS: Windows 7 - 32 bit |