Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #921  
Old 24th August 2014, 01:40 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Sorry, I forgot that I accidently didn't update shuffle.h. I knew about the mistake I made, but forgot to update it xD . What I did was simply drag the files from the zip, into the old folder I had. I somehow didn't drag in shuffle.h.

The VirtualProtect is a 1 time call, that gives /restricts access to the process that calls the function, depending on the flags you use, to a chunk of memory. The 0x100 I wrote is arbitrary, it just has to be large enough to cover the area you want to change. The OS decides what you can do with a certain memory region (read, write, execute, etc).

I don't even know for sure if I need to call VirtualProtect if I CloseDll() then InitiateRSP again. So the doOnce = 0 inside of CloseDll() might not even be needed. I'm not 100% sure how that all works yet. On CloseDll(), you may even want to call VirtualProtect again to restore the old protection setting, for safety.
Reply With Quote
  #922  
Old 25th August 2014, 04:09 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

I've decided to practice sse more before finishing the recompiler. So I was looking at your set_VCO function, trying to figure out how to write that in SSE. After some time looking at the instruction set and trying out ideas, I believe I found a formula, but it doesn't seem to be faster . Then again, idk a good way to benchmark, especially since I haven't even see the function get called ;/ .

So I'm wondering, what game uses that function?

I basically did something like this
Code:
ALIGNED __m128i xmm, xmm1;
	int value = VCO | (VCO << 16);
	xmm = _mm_cvtsi32_si128(value);
	xmm1 = _mm_load_si128( (__m128 *)num1);//num1[8] = { 0x0001, 0x0002, 0x0004, 0x0008, 0x0010, 0x0020, 0x0040, 0x0080 };
	xmm = _mm_shuffle_epi32(xmm, 0);
	xmm = _mm_and_si128(xmm, xmm1);
	xmm = _mm_cmpeq_epi16(xmm, xmm1);
	*(__m128i *)(co) = _mm_srli_epi16(xmm, 15);
Pardon me, if it's written poorly. Idk intrinsics yet lol. I still use assembly when testing out SSE instructions. Well, it was a good try, despite it not being as fast as I'd like. I know I only posted half the formula. Took me a while to figure out how to properly write intrinsics.

Edit: Hmm, looks like measuring cycles is hard to do right. When i spammed a few cpuid and rdtsc, and I saw different results. The sse was a bit faster.

Last edited by RPGMaster; 25th August 2014 at 04:53 AM.
Reply With Quote
  #923  
Old 25th August 2014, 05:02 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

set $vco means you're setting a vector control register, so it would be CTC2 on the RSP, under COP2 primary opcode. What "game" uses it? What game doesn't use it?

I'm out of time tonight so can't look at your code in time to reply about it fully now.

Last edited by HatCat; 25th August 2014 at 01:36 PM.
Reply With Quote
  #924  
Old 25th August 2014, 10:03 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Quote:
Originally Posted by HatCat View Post
set $vco means you're setting a vector control register, so it would be CTC0 on the RSP, under COP2 primary opcode. What "game" uses it? What game doesn't use it?
I'll have to double check and see if maybe my project settings are to blame. Kinda weird how my breakpoint never hit ;/ .

I've come up with a bright idea where when I'm mentally exhausted, I will work on the grunt work, aka the code gen .

I've run into a problem though. I'm not sure which exact SSE instructions to implement. For instance, idk if movaps is better, or movdqa. My cycle measuring tests have shown that movaps is a little better, even though I was working with integer data. Then there's instructions like xorps vs pxor. Should I just always go for the one with smaller bytes?

I never thought of using pcmpeqd xmm0, xmm0 to set all of the register's bits to 1. This is really interesting, learning SSE algorithms.

Edit: Now that I think about it, i've actually had trouble with breakpoints recently ;/ . I remember trying to see how often SLTIU was being used, while debugging Ziggy's PJ64 RSP, and it was strange. I had to set a breakpoint in the disassembly window for some reason.. Maybe I should start using MSVC 2010 more again, I don't recall having debugging issues on that version.

Last edited by RPGMaster; 25th August 2014 at 10:14 AM.
Reply With Quote
  #925  
Old 25th August 2014, 01:52 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by RPGMaster View Post
I basically did something like this
Code:
ALIGNED __m128i xmm, xmm1;
	int value = VCO | (VCO << 16);
	xmm = _mm_cvtsi32_si128(value);
...
	xmm = _mm_shuffle_epi32(xmm, 0);
...
Don't do this.
If you're going to resign yourself to SSE intrinsic functions, then just use the high-level _mm_set1_epi16 intrinsic.

It can set all 8 of the 16-bit elements in your XMM to VCO. It's a pseudo-instruction, of course, not one physically on Intel hardware, but it should compile less hazardously to the _mm_shuffle_epi32(xmm, 0x00) method you wrote except even better if nothing else, not to mention not having to go through the ugly cvtsi32_si128 operation.

The rest of your algorithm seems acceptable to the task, but I still wouldn't vectorize it like that. CTC2 is rather rarely called for.
Reply With Quote
  #926  
Old 25th August 2014, 01:56 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

There is no floating-point unit on the RSP, so I wouldn't bother substituting MOVDQA with some weird floating-point move from SSE1.

xorps is sort of a bit-wise XOR, except it's more for floating-point numbers. I'm not even sure that would work with integers the way you'd expect? Honestly I'd just stick to SSE2; even an MMX fallback makes more sense than trying to use SSE1 on the RSP.
Reply With Quote
  #927  
Old 25th August 2014, 06:26 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by RPGMaster View Post
I get an error message saying "error C2057: expected constant expression" and the line it's refering to is
Code:
static __m128i shuffle_0q(__m128i xmm)
{
    const int order = simm[0x2];

    xmm = _mm_shufflehi_epi16(xmm, order);//this line
    xmm = _mm_shufflelo_epi16(xmm, order);
    return (xmm);
}
order needs to be a compile time constant; a const int type may not suffice. To prevent issues, I'd just use either a macro or a immediate value. You might be able to get away with a little more, you might not; I'm not sure. In the following example, I marked the int 'const', but it's value isn't known at compile time as the intrinsic requires. So MSVC complains:

Code:
__m128i this_wont_work(__m128i xmm, const int foo) {
   return _mm_shufflehi_epi16(xmm, foo);
}

int main(int argc, const char *argv[]) {
   this_wont_work(__mm_setzero_si128(), atoi(argv[1]));
   return 0;
}
Quote:
Originally Posted by HatCat View Post
It's a pseudo-instruction, of course, not one physically on Intel hardware, but
I'mma let you finish!

This is the power of intrinsics: AVX/AVX2 bring vbroadcast*, which maps directly to this intrinsic. When compiling with -mavx/mavx2, vbroadcast* is used instead of set1 and shuffle.

Either way, I agree.

Last edited by MarathonMan; 25th August 2014 at 06:30 PM.
Reply With Quote
  #928  
Old 25th August 2014, 07:08 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

I'ma let me finish and say that most of C programming itself is pretty much pseudo-instructions in a portable assembly language.

I have not yet tested whether:
Code:
for (i = 0; i < 8; i++)
    target[i] = source[0];
... would be a successful pseudo-instruction for the _mm_set1_epi16 pseudo-instruction , but if it wasn't, I'm happy to macro that out and use the latter SSE intrinsic function as the primary solution until modernization improves.

Code:
#ifdef ARCH_MIN_SSE2
#define mm_set1_epi16(target, source) \
    target = _mm_set1_epi16(source);
#else
#define mm_set1_epi16(target, source) \
    for (int i = 0; i < N; i++) \
        target[i] = source;
#endif

Last edited by HatCat; 25th August 2014 at 07:12 PM.
Reply With Quote
  #929  
Old 25th August 2014, 07:30 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Quote:
Originally Posted by HatCat View Post
Don't do this.
If you're going to resign yourself to SSE intrinsic functions, then just use the high-level _mm_set1_epi16 intrinsic.

It can set all 8 of the 16-bit elements in your XMM to VCO. It's a pseudo-instruction, of course, not one physically on Intel hardware, but it should compile less hazardously to the _mm_shuffle_epi32(xmm, 0x00) method you wrote except even better if nothing else, not to mention not having to go through the ugly cvtsi32_si128 operation.

The rest of your algorithm seems acceptable to the task, but I still wouldn't vectorize it like that. CTC2 is rather rarely called for.
O nice! I had a feeling there was a better way to set all 8 of the 16-bit elements. I really do need to devote a whole day into learning all the sse instructions. I can understand not implementing sse for that function. The SSE code is smaller in size though, so I'm going to use it for recompiler. Lol I also need to learn intrinsics ;/ . I like how intellisense can help me with intrinsics.
Quote:
Originally Posted by HatCat View Post
There is no floating-point unit on the RSP, so I wouldn't bother substituting MOVDQA with some weird floating-point move from SSE1.

xorps is sort of a bit-wise XOR, except it's more for floating-point numbers. I'm not even sure that would work with integers the way you'd expect? Honestly I'd just stick to SSE2; even an MMX fallback makes more sense than trying to use SSE1 on the RSP.
I'm aware that RSP doesn't use floating point. It just weirds me out that there are instructions that pretty much do the same thing, in SSE. I guess I'll have to experiment to figure out which ones are better. I'd like to use the instructions that take up less bytes, I think that would be benefitial for a recompiler. Apparently, there's some penalty when you use the wrong type, but I guess it's hardware specific.
Quote:
Originally Posted by MarathonMan View Post
order needs to be a compile time constant; a const int type may not suffice. To prevent issues, I'd just use either a macro or a immediate value. You might be able to get away with a little more, you might not; I'm not sure. In the following example, I marked the int 'const', but it's value isn't known at compile time as the intrinsic requires. So MSVC complains:

Code:
__m128i this_wont_work(__m128i xmm, const int foo) {
   return _mm_shufflehi_epi16(xmm, foo);
}

int main(int argc, const char *argv[]) {
   this_wont_work(__mm_setzero_si128(), atoi(argv[1]));
   return 0;
}
Macros ftw .
Reply With Quote
  #930  
Old 27th August 2014, 09:24 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Quote:
Originally Posted by HatCat View Post
I'ma let me finish and say that most of C programming itself is pretty much pseudo-instructions in a portable assembly language.

I have not yet tested whether:
Code:
for (i = 0; i < 8; i++)
    target[i] = source[0];
... would be a successful pseudo-instruction for the _mm_set1_epi16 pseudo-instruction , but if it wasn't, I'm happy to macro that out and use the latter SSE intrinsic function as the primary solution until modernization improves.
I bet it depends on the compiler. GCC is KING!!! One of these days, I should try out the real clang, instead of this MSVC frontend ;/ . The MSVC frontend for Clang is pretty good, but it seems to struggle when it comes to auto vectorization. I now remember why I didn't use GCC for a while. It's because winapi is a hassle ;/ . I still haven't even figured out how to use resource files. I don't think the MSVC one is compatible. That probably explains why Ziggy didn't package the resource file with his modified PJ64 1.4 RSP source.

What I'm doing now is writting the code gen for the vector instructions. Man are they long! I can see why Jabo and Zilmar didn't bother with register caching, although it would still help. At first, I was reading the intel output until I got to a part where I felt the compiler didn't do a good job, since not all of the output was in SSE, so I looked at the GCC version and sure enough, it was all vectorized for VMULF. I can't wait to see the results when I'm done .

I can also see why you prefer ANSI-C over intrinsics. I personally hate relying on the compiler, since I've seen some output I don't like, but it sure is a lot more convenient to write 3 small loops, instead of writing 60 lines of intrinsics.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 06:31 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.