|
#911
|
||||
|
||||
![]()
Well I figured out why my SMC didn't work. it's because I didn't pay attention to the fact that HatCat's SSE2 shuffle implementation uses different instructions, depending on the value of e. I was just assuming SMC is unstable.
Is there a convenient way to do SMC in C? So far, I'm stuck with writing code like *(char*)0x0A951923 = simm[e]; . It's way easier for me to do SMC code in assembly. So after some trial and error, I've discovered that the performance penalty for SMC is too great. I put rdtsc and the beginning and end of the function, since the benchmark program I used, didn't work for some reason. Maybe my SMC code is poorly written. I wrote the SMC in assembly, just because it was too much of a hassle for me to do in C. Basically on my machine, the original SSE2 version of VAND in HatCat's RSP on my machine took around 0x6C7497A cycles to run 0x300000 times. When I used SMC inside of the VAND function, it took around 0x42A0613A cycles. Since the instructions were different, I needed to overwrite 12 bytes, so 0x42A0613A cycles was with the SSE write. When I did three 4-byte writes, it was about 1.4x slower than the SSE one. If i cheat and put the SMC outside of the function, I got 0x42C07F8 cycles. If i simply put the SMC to behind rdtsc, inside of VAND, just to see what the penalty is after doing the write, and I got around 0x1B818D2C cycles. So that means putting the SMC too close has a larger penalty. The last thing I tested was timing only the SMC itself outside of the function. Doing three 4-byte writes took around 0x8A87415 cycles and the SSE write took around the same for some odd reason. I guess measuring only a few instructions may not be too accurate. So it doesn't seem like SMC will help for sse2 shuffling. |
#912
|
||||
|
||||
![]() Quote:
AFAIK, there's no really pretty way to do SMC in C. A couple months ago, I wrote a library hoping to use dynamic codegen for CEN64, but couldn't figure out an effective way: https://github.com/tj90241/ecg The other thing that I wonder if it would be faster is a static dispatch-based shuffling solution. You'd only need 15 functions and a LUT with 16 pointers. |
#913
|
||||
|
||||
![]() Quote:
Quote:
But I'm sure there are better methods than that. Rather than saying *(char *)0x0A951923 you should try to make the pointer an offset relative to the start of the no-inline shuffle function you are calling. This, of course, will still break compilers that succeed in implicitly optimizing anything, inline the function, re-order it with other things that go on in the function, or simply non-Intel compilers that don't have the same function prologue. However, it could still be way, way more portable than what you gave, at least for Intel machines supporting SSE2. Quote:
1) SSE2 64-bit high shuffle op-code 2) SSE2 64-bit low shuffle op-code 3) SSE2 64-bit high shuffle immediate 8 bits 4) SSE2 64-bit low shuffle immediate 8 bits Maybe when I have time off the RDP to get off my ass and try it myself I'll show you. Better than using a massive switch statement / function pointer table for SSE2 shuffles within shuffle funcs.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#914
|
||||
|
||||
![]() Quote:
![]() Code:
char *pointer = &&label; pointer += 4; Quote:
Quote:
Quote:
Code:
#include <Windows.h> unsigned char doOnce; PDWORD oldProtect; Code:
struct HWND__ {int unused;}; typedef struct HWND__ *HWND; struct HINSTANCE__ {int unused;}; typedef struct HINSTANCE__ *HINSTANCE; struct HMENU__ {int unused;}; typedef struct HMENU__ *HMENU; struct HDC__ {int unused;}; typedef struct HDC__ *HDC; Code:
if (!doOnce) { VirtualProtect(&VAND, 0x100, PAGE_EXECUTE_READWRITE, &oldProtect); doOnce = 1; } Code:
static void VAND(int vd, int vs, int vt, int e) { static unsigned int cycles = 0, counter = 0; static char output[32]; unsigned int temp; short ST[N]; temp = (int)(__rdtsc()); SHUFFLE_VECTOR(ST, VR[vt], e); do_and(VR[vd], VR[vs], ST); cycles = (int)(__rdtsc()) - temp + cycles; if (counter++ == 0x300000){ sprintf(output, "Cycles = %X", cycles); MessageBoxA(0, output, "", 0); counter = cycles = 0; } return; } Code:
F3 0F 70 C8 E4 pshufhw xmm1,xmm0,0E4h F2 0F 70 C1 E4 pshuflw xmm0,xmm1,0E4h F3 0F 70 C8 E4 pshufhw xmm1,xmm0,0E4h F2 0F 70 C1 E4 pshuflw xmm0,xmm1,0E4h F3 0F 70 C8 A0 pshufhw xmm1,xmm0,0A0h F2 0F 70 C1 A0 pshuflw xmm0,xmm1,0A0h F3 0F 70 C8 F5 pshufhw xmm1,xmm0,0F5h F2 0F 70 C1 F5 pshuflw xmm0,xmm1,0F5h F3 0F 70 C8 00 pshufhw xmm1,xmm0,0 F2 0F 70 C1 00 pshuflw xmm0,xmm1,0 F3 0F 70 C8 55 pshufhw xmm1,xmm0,55h F2 0F 70 C1 55 pshuflw xmm0,xmm1,55h F3 0F 70 C8 AA pshufhw xmm1,xmm0,0AAh F2 0F 70 C1 AA pshuflw xmm0,xmm1,0AAh F3 0F 70 C8 FF pshufhw xmm1,xmm0,0FFh F2 0F 70 C1 FF pshuflw xmm0,xmm1,0FFh F2 0F 70 C0 00 pshuflw xmm0,xmm0,0 66 0F 61 C0 punpcklwd xmm0,xmm0 F2 0F 70 C0 55 pshuflw xmm0,xmm0,55h 66 0F 61 C0 punpcklwd xmm0,xmm0 F2 0F 70 C0 AA pshuflw xmm0,xmm0,0AAh 66 0F 61 C0 punpcklwd xmm0,xmm0 F2 0F 70 C0 FF pshuflw xmm0,xmm0,0FFh 66 0F 61 C0 punpcklwd xmm0,xmm0 F3 0F 70 C0 00 pshufhw xmm0,xmm0,0 66 0F 69 C0 punpckhwd xmm0,xmm0 F3 0F 70 C0 55 pshufhw xmm0,xmm0,55h 66 0F 69 C0 punpckhwd xmm0,xmm0 F3 0F 70 C0 AA pshufhw xmm0,xmm0,0AAh 66 0F 69 C0 punpckhwd xmm0,xmm0 F3 0F 70 C0 FF pshufhw xmm0,xmm0,0FFh 66 0F 69 C0 punpckhwd xmm0,xmm0 I think SMC would have been a much smaller penalty if the instructions were the same and you only had to change the last byte. Also there seems to be less of a penalty when the SMC is not in the same function as the code which is being modified. During this long post, I've already improved the SMC and profiling implementation a lot, compared to yesterday ![]() |
#915
|
||||
|
||||
![]() Quote:
I wonder if doing a small FORCEINLINE function with nothing more than just the 2 shuffle intrinsic functions, and then addressing by pointer to the function would still target it correctly even though inline is forced. Quote:
![]() I would rather continue to support testing and building it with Microsoft Visual Studio for testing/better linking. Quote:
comprises mostly of, #include <Eminem.h> ![]() So I don't include it. Albeit in an inexperienced, semi-hazardous way, I made zilmar's #Rsp 1.1.h header free of dependencies on <windows.h> since RSP plugin has nothing to do with operating system APIs, let alone Windows. So you don't really have to include windows.h to do what you wanted.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#916
|
||||
|
||||
![]()
I'm pretty sure you need windows.h for using VirtualProtect, which is a function needed for allowing self modifying code. Perhaps you could just call the exact address of VirtualProtect, using a pointer, but idk if the function address will be the same for everyone
![]() Intel and Clang also support arithmetic with labels and computed goto. For MSVC, I'm not sure what you can do for SMC. I really do need to start using GCC more. I just need to stop procrastinating and learn its features. I also need to find a good debugger to use with it. Have you actually tested your latest rsp with MSVC? I can't even compile in MSVC when I define ARCH_MIN_SSE2 in rsp.h. I get an error message saying "error C2057: expected constant expression" and the line it's refering to is Code:
static __m128i shuffle_0q(__m128i xmm) { const int order = simm[0x2]; xmm = _mm_shufflehi_epi16(xmm, order);//this line xmm = _mm_shufflelo_epi16(xmm, order); return (xmm); } |
#917
|
||||
|
||||
![]()
Clearly I didn't need windows.h for anything, or I would have included it.
In fact, no one does. The only problem with the way I rewrote the RSP spec header include was that it breaks compilation if someone DOES include <windows.h>, which, eh, not really any loss. I can amend for this later when revising the specs for 64-bit compatibility.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#918
|
||||
|
||||
![]()
I meant to say, you will need to include windows.h for SMC, unless you decide to call the exact address for VirtualProtect, using a pointer. I know that your plugin doesn't need windows.h, when not using SMC.
So any idea why that intrinsic doesn't work with msvc? Edit: So I did some reading on SMC and found an interesting post in MSDN. "Note: When executing self-modifying code the use of FlushInstructionCache is required on CPU architectures that do not implement a transparent (self-snooping) I-cache. These include PPC, MIPS, Alpha, and Itanium. FlushInstructionCache is not necessary on x86 or x64 CPU architectures as these have a transparent cache. According to the Intel 64 and IA-32 Architectures Software Deverloper's Manual, Volume 3A: System Programming Guide, Part 1, a jump instruction is sufficient to serialize the instruction prefetch queue when executing self-modifying code." That probably explains why having the SMC outside of the function had a smaller penalty. Last edited by RPGMaster; 24th August 2014 at 01:01 AM. |
#919
|
||||
|
||||
![]()
You will need to include <helloworld.h> in order to write a hello world program in C for my commercial operating system.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#920
|
||||
|
||||
![]()
As for the rest of your comments, I don't really know where to begin.
First, why would you expect it to work with MSVC? I said I purposely broke building on MSVC to stop improper builds on what was then an inferior compiler for this type of project. You said something about minor updates I made recently...that would imply you've seen my Git repository already and are not just downloading off attached release to OP of this thread. If that's the case, you did not correctly install my latest source code. You are still using old source code. I don't know why there is even a simm[] array anywhere in those code pasts you gave anymore because if you were correct in what you said about installing my latest source tree/Git commits then there is no way they should exist. Second, what the hell is VirtualProtect? Do I need to call that function every single interpreter instruction? If so, screw that. I'll just make the shuffle function a no-inline procedure and modify the instructions in it before calling/entering the function.
__________________
http://theoatmeal.com/comics/cat_vs_internet |