Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #1171  
Old 26th January 2015, 11:39 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by RPGMaster View Post
Could you post an example of vectorizing CFC2 using zilmar's endian?
You can't. CFC2 is a scalar operation (like much of LWC2 and SWC2, even).

Literally the RSP vector control flag registers (sometimes called VCF[] in the documentation for other vector units, but not usually SGI's) are just bit masks of multiple Boolean meanings. They are scalar values, like (true << 3) | (false << 2) | (true << 1) | (true << 0), or 1011_2.

I only store them as arrays of 16-bit shorts so that COP2::C2 vector operations are vectorized, without having to do all the scalar bit-masking to RSP_Flags[0, 1, 2] in pj64 for instance. However, when it comes time to execute CFC2, you need to convert the 8 16-bit shorts into an 8-bit mask, and that has no solution for vectorization necessarily. (There is however a non-portable, SSE-specific intrinsic solution for doing it, which I have used here in the specific function of CFC2 reading from $vco: https://github.com/cxd4/rsp/blob/master/vu/vu.c#L144)
Reply With Quote
  #1172  
Old 27th January 2015, 12:06 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

I know that the original algorithm doesn't use arrays for the flags. Since I want to use 16bit arrays for the flags, CFC2 is more of a hassle to implement. I'm aware of your solution, but I have to account for the fact that the vector elements aren't in the same order. So it requires extra shuffling.

I bet I just made some silly mistake somewhere. I will keep reviewing my code.

If I can manage to use arrays for flags and split the accumulators into separate arrays, I could easily plug in my current code and PJ64's recompiler could become very fast .
Reply With Quote
  #1173  
Old 27th January 2015, 12:49 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Don't forget Azimer's advice--working code is a higher priority than fast code.
Start small and keep a few things as messy scalar C, with a minor root vectorization or two.

Quote:
Originally Posted by RPGMaster View Post
I'm aware of your solution, but I have to account for the fact that the vector elements aren't in the same order. So it requires extra shuffling.
No I think better yet it requires somebody to make the endianness in pj64 recompiler/interpreter for RSP more accurate.

A) On actual N64 it looks like:
Code:
[0 1] [2 3] [4 5] [6 7] [8 9] [A B] [C D] [E F]
B) On my RSP plugin it looks like:
Code:
[1 0] [3 2] [5 4] [7 6] [9 8] [B A] [D C] [F E]
C) In zilmar's RSP module it looks like:
Code:
[F E] [D C] [B A] [9 8] [7 6] [5 4] [3 2] [1 0]
A) Obviously is the most accurate, but it is too difficult to write out some of the operations...you have to byte-swap all the time on x86 little-endian.

B) Mine is semi-accurate, semi-fast. You can instantly write each 16-bit vector computation on every element with this solution; you just have to account for the scalar differences in byte endian within a scalar opcode, like, LBV/SBV, accessing them.

C) zilmar's is mostly written for a form of speed, even in the interpreter. The way that the union handles attempts at neutralizing segment access, it makes it easier to access more than 16 bits at once in a single hit in some cases...but I personally prefer limiting myself to just 16 bits as it's more accurate, as this is the size of the vectors' elements. (LWC2 and SWC2 are not vector operations; they're scalar.)
Reply With Quote
  #1174  
Old 27th January 2015, 01:38 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Quote:
Originally Posted by HatCat View Post
Don't forget Azimer's advice--working code is a higher priority than fast code.
Already did take his advice . I'm simply sad that I got stuck on something as seemingly simple as this ;/ . Only thing I haven't got working afaik, is Rogue Squadron (when using CPU<->RSP sync or Semaphore). There may be some other minor things that need to be checked. The rest is just finishing optimizations.
Quote:
Originally Posted by HatCat View Post
No I think better yet it requires somebody to make the endianness in pj64 recompiler/interpreter for RSP more accurate.
Well if zilmar wants to do B, that would actually be easier for me, since that's the only one I've got working ;/ .
Quote:
Originally Posted by HatCat View Post
A) On actual N64 it looks like:
Code:
[0 1] [2 3] [4 5] [6 7] [8 9] [A B] [C D] [E F]
B) On my RSP plugin it looks like:
Code:
[1 0] [3 2] [5 4] [7 6] [9 8] [B A] [D C] [F E]
C) In zilmar's RSP module it looks like:
Code:
[F E] [D C] [B A] [9 8] [7 6] [5 4] [3 2] [1 0]
A) Obviously is the most accurate, but it is too difficult to write out some of the operations...you have to byte-swap all the time on x86 little-endian.

B) Mine is semi-accurate, semi-fast. You can instantly write each 16-bit vector computation on every element with this solution; you just have to account for the scalar differences in byte endian within a scalar opcode, like, LBV/SBV, accessing them.

C) zilmar's is mostly written for a form of speed, even in the interpreter. The way that the union handles attempts at neutralizing segment access, it makes it easier to access more than 16 bits at once in a single hit in some cases...but I personally prefer limiting myself to just 16 bits as it's more accurate, as this is the size of the vectors' elements. (LWC2 and SWC2 are not vector operations; they're scalar.)
For me, the ideal endian for recompiler would be
D)
Code:
[3 2] [1 0] [7 6] [5 4] [B A] [9 8] [F E] [D C]
I want to switch from B to D for my own recompiler implementation . I think it'll give me the boost I need to achieve my goal for LLE audio's speed. Also I had a silly idea of testing with C) to compare algorithms at run-time, but now I realize I'd be better off just reading zilmar's interpreter, to look for any flaws.
Reply With Quote
  #1175  
Old 27th January 2015, 02:17 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by RPGMaster View Post
For me, the ideal endian for recompiler would be
D)
Code:
[3 2] [1 0] [7 6] [5 4] [B A] [9 8] [F E] [D C]
I want to switch from B to D for my own recompiler implementation . I think it'll give me the boost I need to achieve my goal for LLE audio's speed. Also I had a silly idea of testing with C) to compare algorithms at run-time, but now I realize I'd be better off just reading zilmar's interpreter, to look for any flaws.
Rather pointless, really.

The C) method zilmar is already using allows him to read 8-, 16-, 32-, 64-, or 128-bit storage out at once.

Yours only allows writing 8, 16, or 32 bits at once. His is better than that for speed currently. That has nothing to do with accuracy and little to do with speed, because, what does 32-bit endian have to do with the endian in 16-, 64- or 128-bit storage??

It also fails to miss the overall point of my post. Currently zilmar has to do something like this, excluding the shuffles:
Code:
for (el = 0; el < 8; el++)
{
    VR[vd][7 - el] = VR[vs][7 - el] <operation> VR[vt][7 - el];
    bit = compare_flag_operation(VR[vs][7 - el], VR[vt][7 - el]);
    RSP_Flags[...] |= (bit << (8 + el)) | (bit << (0 + el));
}
With your D) method, you have to make it even harder, with this:
Code:
for (el = 0; el < 8; el++)
{
    VR[vd][1 ^ el] = VR[vs][1 ^ el] <operation> VR[vt][1 ^ el];
    bit = compare_flag_operation(VR[vs][1 ^ el], VR[vt][1 ^ el]);
    RSP_Flags[...] |= (bit << (8 + (1 ^ (7 - el)))) | (bit << (0 + (1 ^ (7 - el))));
}
I'm telling you, you're getting way too worked up about optimizing LWC2/SWC2 edge cases, when a lot of games don't even enforce alignment (actually unaligned at least as common as aligned for several ops), and making COP2 even slower. That stuff doesn't mean shit compared to what goes on under COP2. Keep it simple, readable, maintainable, accurate, and uh...oh fast might not hurt.

Working code is always better when it works. And I can guarantee you, there's no way you're ever going to get that D) method to work without realizing that it's ultimately going to be slower than even C).

Last edited by HatCat; 27th January 2015 at 02:39 AM.
Reply With Quote
  #1176  
Old 27th January 2015, 02:48 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

I admit I don't fully understand endianess ;/ . I assumed 32bit little endian was the fastest, due to the fact that I believe DMEM is 32bit little endian (for zilmar spec emulators). Whether I decide to change endian or not, I want to investigate it, just to understand it better.

I've already pretty much optimized the other instructions as best I could already ;/ . Only thing left to do is clean up code, finish other recompiler-specific optimizations, and continue checking accuracy.

I did get carried away though, since I could have finished up a good amount of the remaining optimizations by now. Really, I've been taking it slow these days. I mostly just experiment and review my code.

I can definitely agree with keeping it simple, readable and maintainable. That's why I plan to do some serious code cleaning, once I'm done with optimizations and accuracy checking.
Reply With Quote
  #1177  
Old 27th January 2015, 03:30 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

No you're right that DMEM on Win32 plugin specs is 32-bit segment barrier for endianness...in fact thinking back about what I said I guess I worded that a bit hastily when I presented the argument that C) can read 64-, 32-, 16-bit at once and yours only 32-. Because technically, zilmar's method does let him READ a 32-, 64-, or 128-bit "number" out at once with that endian, but, when actually moving the store to DMEM swapped on a 32-bit interval, then actually no. Yours would have this slight fekkin' microoptimization from not having to XOR an address a few times for LQV, maybe LDV...but honestly man, even on the off chance a game is doing this shit with alignment, COP2 performance is way more important. And that's why I'm sticking with method B).

I don't think the plugin spec should be saying, PluginInfo.MemoryBswaped means memory is byte-swapped on a 32-bit boundary. I think it should be saying, memory is in the EXACT same endianness as your native machine's CPU...which may or may not mean byte-swapped on a 32-bit boundary...it could be on a 64-bit boundary, a 16-bit boundary, or a word-swapped/mixed endian. It should just be, MemoryBswaped = TRUE for "native endian" (not Intel x86's ugly ass CPU w/ endian), or just MIPS-native endian. So your 32-bit DMEM assumption might have escaped my assessment and be a little faster in that sense...but this shit should be portable and fast I think.

Last edited by HatCat; 27th January 2015 at 03:33 AM.
Reply With Quote
  #1178  
Old 28th January 2015, 07:08 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

I discovered that the 4.8.2 version of GCC isn't vectorizing zero'd arrays.
Code:
    for (i = 0; i < N; i++)
        ne[i] = 0;
    for (i = 0; i < N; i++)
        co[i] = 0;
... produces ...
Code:
    mov     DWORD PTR _ne, 0
    mov     DWORD PTR _ne+4, 0
    mov     DWORD PTR _ne+8, 0
    mov     DWORD PTR _ne+12, 0
    mov     DWORD PTR _co, 0
    mov     DWORD PTR _co+4, 0
    mov     DWORD PTR _co+8, 0
    mov     DWORD PTR _co+12, 0
Which I find to be lulzworthy, since like everything else in the entire source EXCEPT for something as trivial as zeroing an entire vector, was getting vectorized to SSE, and usually quite tolerably. This is some bug in that version of GCC, and I don't know how long it will last.

So I decided to fix my vector_wipe macro to be bi-compatible, to accept either __m128i XMM's or C arrays/pointers as parameters and still produce the same output:
Code:
#define vector_wipe(vd) { \
    *(v16 *)&(vd) = _mm_cmpgt_epi16(*(v16 *)&(vd), *(v16 *)&(vd)); }
(Or _mm_xor_si128, setzero_si128 work as well. The gt comparison always fails because (x > x) is never true.)

So now I can just use the vector_wipe macro for both SSE and non-SSE to get the optimal output.
Code:
    vector_wipe(ne);
    vector_wipe(co);
... produces ...
Code:
    pxor    xmm0, xmm0
    movdqa  XMMWORD PTR _ne, xmm0
    movdqa  XMMWORD PTR _co, xmm0
There, see? If it feels like generating PCMPGT, it can generate PCMPGT instead of PXOR. (There has to be an analogous vector_fill macro which fills the entire vector with all ones, using PCMPEQ xmm0, xmm0, as (x == x) always passes.) And there are always non-SSE definitions of these macros anyway.

DLL also lost 1 KB in file size from that change. The idea is this may also improve speed in the interpreter but I guess who cares. More importantly, it's more accurate to zero the whole thing at once, than using 4 x86 DWORD writes.

Last edited by HatCat; 28th January 2015 at 07:10 PM.
Reply With Quote
  #1179  
Old 28th January 2015, 07:51 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Quote:
Originally Posted by HatCat View Post
I don't think the plugin spec should be saying, PluginInfo.MemoryBswaped means memory is byte-swapped on a 32-bit boundary. I think it should be saying, memory is in the EXACT same endianness as your native machine's CPU...which may or may not mean byte-swapped on a 32-bit boundary...it could be on a 64-bit boundary, a 16-bit boundary, or a word-swapped/mixed endian. It should just be, MemoryBswaped = TRUE for "native endian" (not Intel x86's ugly ass CPU w/ endian), or just MIPS-native endian.
You brought up a good point. Now I'm wondering what endian DMEM is for 64 bit emulators ;/ . Since I want to eventually move onto 64bit emulation. I'll probably start using more macros now, to make things easier. Another thing about zilmar spec is, I think it should have a variable to let the gfx plugin know whether expansion pack is enabled ;/ .

Quote:
Originally Posted by HatCat View Post
I discovered that the 4.8.2 version of GCC isn't vectorizing zero'd arrays.
Code:
    for (i = 0; i < N; i++)
        ne[i] = 0;
    for (i = 0; i < N; i++)
        co[i] = 0;
... produces ...
Code:
    mov     DWORD PTR _ne, 0
    mov     DWORD PTR _ne+4, 0
    mov     DWORD PTR _ne+8, 0
    mov     DWORD PTR _ne+12, 0
    mov     DWORD PTR _co, 0
    mov     DWORD PTR _co+4, 0
    mov     DWORD PTR _co+8, 0
    mov     DWORD PTR _co+12, 0
Which I find to be lulzworthy, since like everything else in the entire source EXCEPT for something as trivial as zeroing an entire vector, was getting vectorized to SSE, and usually quite tolerably. This is some bug in that version of GCC, and I don't know how long it will last.
I noticed that too, a few days ago. Turns out, I have GCC 4.8.1. Macros ftw . I think I'll do more experimenting with compiler output today.
Reply With Quote
  #1180  
Old 28th January 2015, 11:04 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Well I experimented with VMULF using intrinsics. For some reason, the compiler is bad with
Code:
round = _mm_cmpeq_epi16(vs, vs);
so I tried
Code:
round = _mm_set1_epi16(0xFFFF);
and it worked better . I couldn't be bothered figuring out the reason for the other compiler flaws in VMULF, but when I converted my asm algorithm to intrisincs, the compiler output was the same length as the code I pasted earlier . So you just have to be extra careful with the way you write intrinsics. If you want to improve your VMULF, the sign clamp algorithm can be improved.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 09:00 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.