Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #911  
Old 23rd August 2014, 09:01 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Well I figured out why my SMC didn't work. it's because I didn't pay attention to the fact that HatCat's SSE2 shuffle implementation uses different instructions, depending on the value of e. I was just assuming SMC is unstable.

Is there a convenient way to do SMC in C? So far, I'm stuck with writing code like *(char*)0x0A951923 = simm[e]; . It's way easier for me to do SMC code in assembly.

So after some trial and error, I've discovered that the performance penalty for SMC is too great. I put rdtsc and the beginning and end of the function, since the benchmark program I used, didn't work for some reason. Maybe my SMC code is poorly written. I wrote the SMC in assembly, just because it was too much of a hassle for me to do in C.

Basically on my machine, the original SSE2 version of VAND in HatCat's RSP on my machine took around 0x6C7497A cycles to run 0x300000 times.
When I used SMC inside of the VAND function, it took around 0x42A0613A cycles. Since the instructions were different, I needed to overwrite 12 bytes, so 0x42A0613A cycles was with the SSE write. When I did three 4-byte writes, it was about 1.4x slower than the SSE one.

If i cheat and put the SMC outside of the function, I got 0x42C07F8 cycles. If i simply put the SMC to behind rdtsc, inside of VAND, just to see what the penalty is after doing the write, and I got around 0x1B818D2C cycles. So that means putting the SMC too close has a larger penalty. The last thing I tested was timing only the SMC itself outside of the function. Doing three 4-byte writes took around 0x8A87415 cycles and the SSE write took around the same for some odd reason. I guess measuring only a few instructions may not be too accurate.

So it doesn't seem like SMC will help for sse2 shuffling.
Reply With Quote
  #912  
Old 23rd August 2014, 02:05 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by RPGMaster View Post
Well I figured out why my SMC didn't work. it's because I didn't pay attention to the fact that HatCat's SSE2 shuffle implementation uses different instructions, depending on the value of e. I was just assuming SMC is unstable.

Is there a convenient way to do SMC in C? So far, I'm stuck with writing code like *(char*)0x0A951923 = simm[e]; . It's way easier for me to do SMC code in assembly.

So after some trial and error, I've discovered that the performance penalty for SMC is too great. I put rdtsc and the beginning and end of the function, since the benchmark program I used, didn't work for some reason. Maybe my SMC code is poorly written. I wrote the SMC in assembly, just because it was too much of a hassle for me to do in C.

Basically on my machine, the original SSE2 version of VAND in HatCat's RSP on my machine took around 0x6C7497A cycles to run 0x300000 times.
When I used SMC inside of the VAND function, it took around 0x42A0613A cycles. Since the instructions were different, I needed to overwrite 12 bytes, so 0x42A0613A cycles was with the SSE write. When I did three 4-byte writes, it was about 1.4x slower than the SSE one.

If i cheat and put the SMC outside of the function, I got 0x42C07F8 cycles. If i simply put the SMC to behind rdtsc, inside of VAND, just to see what the penalty is after doing the write, and I got around 0x1B818D2C cycles. So that means putting the SMC too close has a larger penalty. The last thing I tested was timing only the SMC itself outside of the function. Doing three 4-byte writes took around 0x8A87415 cycles and the SSE write took around the same for some odd reason. I guess measuring only a few instructions may not be too accurate.

So it doesn't seem like SMC will help for sse2 shuffling.
Can you paste your code somewhere public?

AFAIK, there's no really pretty way to do SMC in C. A couple months ago, I wrote a library hoping to use dynamic codegen for CEN64, but couldn't figure out an effective way: https://github.com/tj90241/ecg

The other thing that I wonder if it would be faster is a static dispatch-based shuffling solution. You'd only need 15 functions and a LUT with 16 pointers.
Reply With Quote
  #913  
Old 23rd August 2014, 06:00 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by RPGMaster View Post
it's because I didn't pay attention to the fact that HatCat's SSE2 shuffle implementation uses different instructions, depending on the value of e.
So we need a LUT for not only the 8-bit shuffle immediate, but probably the shuffle opcode as well (for 2 sse2 shuffle instructions). So we have 4 memory writes. It's not really possible to use the same shuffle opcode for all 16 cases AFAIK.

Quote:
Originally Posted by RPGMaster View Post
Is there a convenient way to do SMC in C? So far, I'm stuck with writing code like *(char*)0x0A951923 = simm[e]; .
There's no ANSI-portable way; that much I can guarantee you.
But I'm sure there are better methods than that.

Rather than saying *(char *)0x0A951923 you should try to make the pointer an offset relative to the start of the no-inline shuffle function you are calling. This, of course, will still break compilers that succeed in implicitly optimizing anything, inline the function, re-order it with other things that go on in the function, or simply non-Intel compilers that don't have the same function prologue. However, it could still be way, way more portable than what you gave, at least for Intel machines supporting SSE2.

Quote:
Originally Posted by RPGMaster View Post
Since the instructions were different, I needed to overwrite 12 bytes, so 0x42A0613A cycles was with the SSE write. When I did three 4-byte writes, it was about 1.4x slower than the SSE one.
Should only need to do 4 writes I think.
1) SSE2 64-bit high shuffle op-code
2) SSE2 64-bit low shuffle op-code
3) SSE2 64-bit high shuffle immediate 8 bits
4) SSE2 64-bit low shuffle immediate 8 bits

Maybe when I have time off the RDP to get off my ass and try it myself I'll show you.

Quote:
Originally Posted by RPGMaster View Post
So it doesn't seem like SMC will help for sse2 shuffling.
Better than using a massive switch statement / function pointer table for SSE2 shuffles within shuffle funcs.
Reply With Quote
  #914  
Old 23rd August 2014, 10:38 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Quote:
Originally Posted by HatCat View Post
There's no ANSI-portable way; that much I can guarantee you.
But I'm sure there are better methods than that.

Rather than saying *(char *)0x0A951923 you should try to make the pointer an offset relative to the start of the no-inline shuffle function you are calling. This, of course, will still break compilers that succeed in implicitly optimizing anything, inline the function, re-order it with other things that go on in the function, or simply non-Intel compilers that don't have the same function prologue. However, it could still be way, way more portable than what you gave, at least for Intel machines supporting SSE2.
I figured there was no ANSI-portable way. Even in MSVC, there's no other way I think. But who cares about MSVC for this . I haven't tried doing fancy stuff with GCC yet. That's one of the reasons I used inline assembly. Also part of the reason I did *(char *)0x0A951923 = , was because I was being a noob and didn't pay attention to the fact that the instructions were different. So for this implementation, I don't have to worry about doing arithmetic with labels, since I have to change the entire instruction, rather than the last byte of it. So with Intel, GCC, etc, you could assign a pointer to a label and just do something like *pointer = simm[e]. Interesting! While writing this giant post, I figured out you can do arithmetic with labels. You just can't do it at initialization. I had to do
Code:
char *pointer = &&label;
pointer += 4;
So now I know that assembly is not needed at all, unless you want to change the instructions or order of instructions. What a relief to know I can set the address to any arbitrary number, using labels as a base address, without having to use assembly. Maybe GCC is even more flexible and allows you to do label arithmetic during initialization.
Quote:
Originally Posted by HatCat View Post
Should only need to do 4 writes I think.
1) SSE2 64-bit high shuffle op-code
2) SSE2 64-bit low shuffle op-code
3) SSE2 64-bit high shuffle immediate 8 bits
4) SSE2 64-bit low shuffle immediate 8 bits
Ya I ended up doing 3 writes or 1 SSE write. But that was after converting the function to assembly, so that I could arrange the instructions in a more convenient way for SMC.
Quote:
Originally Posted by HatCat View Post
Better than using a massive switch statement / function pointer table for SSE2 shuffles within shuffle funcs.
For size, sure. Performance is questionable though. It will take a good amount of creativity to find a way to improve this function via SMC, assuming it's even possible in this case.
Quote:
Originally Posted by MarathonMan View Post
Can you paste your code somewhere public?

AFAIK, there's no really pretty way to do SMC in C. A couple months ago, I wrote a library hoping to use dynamic codegen for CEN64, but couldn't figure out an effective way: https://github.com/tj90241/ecg

The other thing that I wonder if it would be faster is a static dispatch-based shuffling solution. You'd only need 15 functions and a LUT with 16 pointers.
Here's what I did . I downloaded HatCat's latest source, since he made a recent minor update. In rsp.c, i added
Code:
#include <Windows.h>
unsigned char doOnce;
PDWORD oldProtect;
I had to comment out this code in Rsp_#1.1.h after including windows.h
Code:
struct HWND__ {int unused;};
typedef struct HWND__ *HWND;
struct HINSTANCE__ {int unused;};
typedef struct HINSTANCE__ *HINSTANCE;
struct HMENU__ {int unused;};
typedef struct HMENU__ *HMENU;
struct HDC__ {int unused;};
typedef struct HDC__ *HDC;
Then I added doOnce = 0; into CloseDll() and
Code:
if (!doOnce)
	{
		VirtualProtect(&VAND, 0x100, PAGE_EXECUTE_READWRITE, &oldProtect);
		doOnce = 1;
	}
inside of Initiate RSP. I'm still new to SMC so bear with me. In fact, I've just realized a few ways to make my SMC more convenient. Yesterday, I used a constant address, instead of &VAND. I also didn't even think to check if there were intrinsics for rdtsc. I just happened to figure that out after searching to see if there's a better method. So in VAND I did this to measure the cycles used in the original code
Code:
static void VAND(int vd, int vs, int vt, int e)
{
	static unsigned int cycles = 0, counter = 0;
	static char output[32];
	unsigned int temp;
	short ST[N];

	temp = (int)(__rdtsc());
	SHUFFLE_VECTOR(ST, VR[vt], e);
	do_and(VR[vd], VR[vs], ST);
	cycles = (int)(__rdtsc()) - temp + cycles;
	
	if (counter++ == 0x300000){
		sprintf(output, "Cycles = %X", cycles);
		MessageBoxA(0, output, "", 0);
		counter = cycles = 0;
	}
	return;
}
You could go for higher precision and do 64 bits, but I felt that 32 bits was enough. Maybe rdtscp is better, I'll need to look into that. As for the SMC, I'm going to try and see how to improve my current method before posting it. I was using inline assembly which is a bad idea. What I did for SMC was look at the compiler output of the different cases. This is what the compiler generated
Code:
F3 0F 70 C8 E4       pshufhw     xmm1,xmm0,0E4h
F2 0F 70 C1 E4       pshuflw     xmm0,xmm1,0E4h

F3 0F 70 C8 E4       pshufhw     xmm1,xmm0,0E4h
F2 0F 70 C1 E4       pshuflw     xmm0,xmm1,0E4h

F3 0F 70 C8 A0       pshufhw     xmm1,xmm0,0A0h
F2 0F 70 C1 A0       pshuflw     xmm0,xmm1,0A0h

F3 0F 70 C8 F5       pshufhw     xmm1,xmm0,0F5h
F2 0F 70 C1 F5       pshuflw     xmm0,xmm1,0F5h

F3 0F 70 C8 00       pshufhw     xmm1,xmm0,0
F2 0F 70 C1 00       pshuflw     xmm0,xmm1,0

F3 0F 70 C8 55       pshufhw     xmm1,xmm0,55h
F2 0F 70 C1 55       pshuflw     xmm0,xmm1,55h

F3 0F 70 C8 AA       pshufhw     xmm1,xmm0,0AAh
F2 0F 70 C1 AA       pshuflw     xmm0,xmm1,0AAh

F3 0F 70 C8 FF       pshufhw     xmm1,xmm0,0FFh
F2 0F 70 C1 FF       pshuflw     xmm0,xmm1,0FFh

F2 0F 70 C0 00       pshuflw     xmm0,xmm0,0
66 0F 61 C0          punpcklwd   xmm0,xmm0

F2 0F 70 C0 55       pshuflw     xmm0,xmm0,55h
66 0F 61 C0          punpcklwd   xmm0,xmm0

F2 0F 70 C0 AA       pshuflw     xmm0,xmm0,0AAh
66 0F 61 C0          punpcklwd   xmm0,xmm0

F2 0F 70 C0 FF       pshuflw     xmm0,xmm0,0FFh
66 0F 61 C0          punpcklwd   xmm0,xmm0

F3 0F 70 C0 00       pshufhw     xmm0,xmm0,0
66 0F 69 C0          punpckhwd   xmm0,xmm0

F3 0F 70 C0 55       pshufhw     xmm0,xmm0,55h
66 0F 69 C0          punpckhwd   xmm0,xmm0

F3 0F 70 C0 AA       pshufhw     xmm0,xmm0,0AAh
66 0F 69 C0          punpckhwd   xmm0,xmm0

F3 0F 70 C0 FF       pshufhw     xmm0,xmm0,0FFh
66 0F 69 C0          punpckhwd   xmm0,xmm0
Since they aren't all the same size, I had to add a nop to the shorter ones, in the LUT i had. Since the compiler may rearrange instructions, the LUT will have to be specific to your compiler and current implementation inside of the function. So after making the LUT, I just wrote to the specified address, based on the LUT.

I think SMC would have been a much smaller penalty if the instructions were the same and you only had to change the last byte. Also there seems to be less of a penalty when the SMC is not in the same function as the code which is being modified.

During this long post, I've already improved the SMC and profiling implementation a lot, compared to yesterday .
Reply With Quote
  #915  
Old 23rd August 2014, 11:28 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by RPGMaster View Post
So now I know that assembly is not needed at all, unless you want to change the instructions or order of instructions.
Or if an optimizing compiler wants to do it for you.

I wonder if doing a small FORCEINLINE function with nothing more than just the 2 shuffle intrinsic functions, and then addressing by pointer to the function would still target it correctly even though inline is forced.

Quote:
Originally Posted by RPGMaster View Post
What a relief to know I can set the address to any arbitrary number, using labels as a base address, without having to use assembly. Maybe GCC is even more flexible and allows you to do label arithmetic during initialization.
Good, now let's see you make up your mind about never using a compiler other than GCC again for the rest of your life. If that seems out of the question, you might wanna think again about relying on non-standard extensions that only GCC supports , like taking the address of a label. Only compilers like GCC can do that, not MSVC or Intel.

I would rather continue to support testing and building it with Microsoft Visual Studio for testing/better linking.

Quote:
Originally Posted by RPGMaster View Post
Here's what I did . I downloaded HatCat's latest source, since he made a recent minor update. In rsp.c, i added
Code:
#include <Windows.h>
unsigned char doOnce;
PDWORD oldProtect;
I had to comment out this code in Rsp_#1.1.h after including windows.h
Code:
struct HWND__ {int unused;};
typedef struct HWND__ *HWND;
struct HINSTANCE__ {int unused;};
typedef struct HINSTANCE__ *HINSTANCE;
struct HMENU__ {int unused;};
typedef struct HMENU__ *HMENU;
struct HDC__ {int unused;};
typedef struct HDC__ *HDC;
#include <windows.h>
comprises mostly of,
#include <Eminem.h>


So I don't include it. Albeit in an inexperienced, semi-hazardous way, I made zilmar's #Rsp 1.1.h header free of dependencies on <windows.h> since RSP plugin has nothing to do with operating system APIs, let alone Windows. So you don't really have to include windows.h to do what you wanted.
Reply With Quote
  #916  
Old 24th August 2014, 12:01 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

I'm pretty sure you need windows.h for using VirtualProtect, which is a function needed for allowing self modifying code. Perhaps you could just call the exact address of VirtualProtect, using a pointer, but idk if the function address will be the same for everyone .

Intel and Clang also support arithmetic with labels and computed goto. For MSVC, I'm not sure what you can do for SMC.

I really do need to start using GCC more. I just need to stop procrastinating and learn its features. I also need to find a good debugger to use with it.

Have you actually tested your latest rsp with MSVC? I can't even compile in MSVC when I define ARCH_MIN_SSE2 in rsp.h.

I get an error message saying "error C2057: expected constant expression" and the line it's refering to is
Code:
static __m128i shuffle_0q(__m128i xmm)
{
    const int order = simm[0x2];

    xmm = _mm_shufflehi_epi16(xmm, order);//this line
    xmm = _mm_shufflelo_epi16(xmm, order);
    return (xmm);
}
Reply With Quote
  #917  
Old 24th August 2014, 12:14 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Clearly I didn't need windows.h for anything, or I would have included it.
In fact, no one does.

The only problem with the way I rewrote the RSP spec header include was that it breaks compilation if someone DOES include <windows.h>, which, eh, not really any loss. I can amend for this later when revising the specs for 64-bit compatibility.
Reply With Quote
  #918  
Old 24th August 2014, 12:48 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

I meant to say, you will need to include windows.h for SMC, unless you decide to call the exact address for VirtualProtect, using a pointer. I know that your plugin doesn't need windows.h, when not using SMC.

So any idea why that intrinsic doesn't work with msvc?

Edit: So I did some reading on SMC and found an interesting post in MSDN.
"Note: When executing self-modifying code the use of FlushInstructionCache is required on CPU architectures that do not implement a transparent (self-snooping) I-cache. These include PPC, MIPS, Alpha, and Itanium. FlushInstructionCache is not necessary on x86 or x64 CPU architectures as these have a transparent cache. According to the Intel 64 and IA-32 Architectures Software Deverloper's Manual, Volume 3A: System Programming Guide, Part 1, a jump instruction is sufficient to serialize the instruction prefetch queue when executing self-modifying code."
That probably explains why having the SMC outside of the function had a smaller penalty.

Last edited by RPGMaster; 24th August 2014 at 01:01 AM.
Reply With Quote
  #919  
Old 24th August 2014, 01:14 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

You will need to include <helloworld.h> in order to write a hello world program in C for my commercial operating system.
Reply With Quote
  #920  
Old 24th August 2014, 01:19 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

As for the rest of your comments, I don't really know where to begin.

First, why would you expect it to work with MSVC? I said I purposely broke building on MSVC to stop improper builds on what was then an inferior compiler for this type of project.

You said something about minor updates I made recently...that would imply you've seen my Git repository already and are not just downloading off attached release to OP of this thread. If that's the case, you did not correctly install my latest source code. You are still using old source code. I don't know why there is even a simm[] array anywhere in those code pasts you gave anymore because if you were correct in what you said about installing my latest source tree/Git commits then there is no way they should exist.

Second, what the hell is VirtualProtect? Do I need to call that function every single interpreter instruction? If so, screw that. I'll just make the shuffle function a no-inline procedure and modify the instructions in it before calling/entering the function.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 05:59 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.