|
#761
|
||||
|
||||
![]() Quote:
Make sure you pass -msse4 instead of -msse2 to the GNU compiler, and it will auto-vectorize to SSE4 instead of SSE2.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#762
|
||||
|
||||
![]()
Since I've been testing MSVC's SSE code generation (not for this plugin), so I thought maybe I should try using the benchmark thing you made lol for the rsp plugin, to see the difference between compilers. Maybe I'll try learning intrinsics instead of casting variables lol.
|
#763
|
||||
|
||||
![]()
My RSP benchmark module might not be totally reliable. It uses simple wall clock time to compute execution time spent, and the times seem to change significantly just by changing certain other things about the RSP plugin not related to those functions (link-time code alignment possibly affecting the benchmark results?). Still, it's a cool feature and works well enough for me to rely on it for studying whether a change I just did to that particular opcode is for better or worse.
![]() Currently my RSP benchmark results amount to this on my 1.90 GHz AMD K8: Code:
RSP Vector Benchmarks Log VMULF : 0.625 s VMACF : 0.898 s VMULU : 0.633 s VMACU : 0.966 s VMUDL : 0.582 s VMADL : 1.099 s VMUDM : 0.657 s VMADM : 1.114 s VMUDN : 0.559 s VMADN : 1.025 s VMUDH : 0.446 s VMADH : 0.580 s VADD : 0.311 s VSUB : 0.393 s VABS : 0.474 s VADDC : 0.366 s VSUBC : 0.398 s VSAW : 0.151 s VEQ : 0.618 s VNE : 0.519 s VLT : 0.608 s VGE : 0.528 s VCH : 1.133 s VCL : 1.286 s VCR : 0.788 s VMRG : 0.240 s VAND : 0.232 s VNAND : 0.242 s VOR : 0.232 s VNOR : 0.241 s VXOR : 0.257 s VNXOR : 0.260 s VRCPL : 0.679 s VRSQL : 0.708 s VRCPH : 0.225 s VRSQH : 0.234 s VMOV : 0.226 s VNOP : 0.071 s Total time spent: 20.604 s
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#764
|
||||
|
||||
![]()
Lol that's strange about the benchmark. I started writing benchmark code in assembly because of the compiler messing with the results
![]() It's definitely a cool feature. I'll try benchmarking on my computer soon. |
#765
|
||||
|
||||
![]()
That's actually not too shameful of an idea. Maybe I should rewrite my benchmark code in an assembly language...at least then I could get rid of having to make a function pointer table to prevent compilers from inlining the entire functions all over again.
![]() Thing is though, with assembly languages, portability gets even more complicated. Have a look at this assembly language source file I wrote a couple weeks ago; it's written to assemble under both GNU assembler (GAS, which by default used the annoying AT&T syntax) and Microsoft Macro Assembler: Code:
;/* ; * written for compatibility with the GNU assembler (GAS), which uses AT&T ; * assembly language syntax and non-semicolon (;) comment notations ; */ ;.intel_syntax noprefix ;/* .code ;*/ .text ;.include "opsd.inc" /* GNU assembler syntax INCLUDE opsd.inc; */ # Microsoft MASM syntax ;/* invalid PROC; */invalid: RET;/* invalid ENDP; */ ;/* fs_add PROC; */fs_add: ADDSD xmm0, xmm1 RET;/* fs_add ENDP; */ ;/* fs_sub PROC; */fs_sub: SUBSD xmm0, xmm1 RET;/* fs_sub ENDP; */ ;/* fs_mul PROC; */fs_mul: MULSD xmm0, xmm1 RET;/* fs_mul ENDP; */ ;/* fs_div PROC; */fs_div: DIVSD xmm0, xmm1 RET;/* fs_div ENDP; */ ;/* fs_sqrt PROC; */fs_sqrt: ;# Since the SIMD square root operation wastes one of the two operands, ;# the extraneous operand is put to use as a multiplier coefficient. SQRTSD xmm1, xmm1;# f(x) = sqrt(x) MULSD xmm0, xmm2;# g(x, y) = y * f(x) RET;/* fs_sqrt ENDP; */ ;/* fs_min PROC; */fs_min: MINSD xmm0, xmm1 RET;/* fs_min ENDP; */ ;/* fs_max PROC; */fs_max: MAXSD xmm0, xmm1 RET;/* fs_max ENDP; */ ;/* fs_and PROC; */fs_and: ANDPD xmm0, xmm1 RET;/* fs_and ENDP; */ ;/* fs_or PROC; */fs_or: ORPD xmm0, xmm1 RET;/* fs_or ENDP; */ ;/* fs_xor PROC; */fs_xor: XORPD xmm0, xmm1 RET;/* fs_xor ENDP; */ ;/* fs_andn PROC; */fs_andn: ANDNPD xmm0, xmm1 RET;/* fs_andn ENDP; */ ;/* fs_eq PROC; */fs_eq: CMPSD xmm0, xmm1, 0;# EQ RET;/* fs_eq ENDP; */ ;/* fs_lt PROC; */fs_lt: CMPSD xmm0, xmm1, 1;# LT RET;/* fs_lt ENDP; */ ;/* fs_gt PROC; */fs_gt: CMPLTSD xmm1, xmm0;# pseudo-operation meaning: CMPSD xmm1, xmm0, 1 MOVSD xmm0, xmm1;# Return slot is always xmm0. RET;/* fs_gt ENDP; */ ;/* fs_unord PROC; */fs_unord: CMPSD xmm0, xmm1, 3;# UNORD RET;/* fs_unord ENDP; */ ;.macro END ;.end # using GNU assembler's `.macro' directive to directly map to MASM's `END' ;.endm END
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#766
|
||||
|
||||
![]()
I've been using macro's a lot more lately. Isn't there a compiler option for not inlining specific functions in GCC? I know with MSVC, there's _declspec(noinline) . It's pretty neat to take advantage of a compiler's features. Too bad not every compiler supports the same features. Like idk the replacement for __assume()
![]() ![]() My favorite ASM syntax is MASM ![]() Just curious, why did you make functions for 1 instruction? Was this a test or something? |
#767
|
||||
|
||||
![]() Quote:
Quote:
Quote:
I always wondered if Intel ever wrote their own official assembler, because I was worried that MASM was too over-featured and that if I wrote asm code for it, it would be hard to go back to make the same code make via other assemblers. I even looked around for some Intel blogs/articles on x86 assembly language, but surprisingly some of them used the AT&T syntax like GCC does. Haha...guess they didn't have a full syntax set in stone for full x86 source files. Quote:
What I gave you is the assembly portion of my program; the rest is in C. GCC compiled the C code to recognize that the extern func ptr table did not require a CALL opcode...it would be indexed using a computed JMP, as with a switch statement. As a whole, my program is a simple calculator. main.exe 63.3 / 51.4 ... the '/' char is read in as the index to my function pointer table and jumps to the SSE2 divide function, or fs_div, as I defined in the asm source. The operands 63.3 and 51.4 are double-precision 64-bit floating point values, loaded in as SSE2 __m128d doubles (the lower halves of the otherwise unmodified XMM registers). In the end it prints out the result: xmm0 <operation> xmm1 I implemented all of the operations that were directly mappable to the SSE2 instruction set (interestingly enough, 16 in total, divisible evenly into 4 noticeable patterns of groups of ops).
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#768
|
||||
|
||||
![]()
Wow after messing around and testing MSVC's automatic sse generation, I'm quite dissappointed. In fact, i won't even bother compiling your rsp plugin with it. I tried seeing what happens when moving an array of floats to another array. It was using movss... Looks like I'll have to do things explicitly. Maybe PGO screws up optimization after making a lot of changes ._.
One thing I'd like to try doing is completely removing FPU instructions from the emulator. Are there any important features that are only available with the FPU? To some extent, I agree with you about compiler features. If I was writing something open source, I wouldn't want to force people to use MSVC, just to compile my program. Maybe I should though xD. I've honestly never linked asm files with C before. I wonder how it works. Lol for some reason, even though I love assembly, I just can't go back to using an assembler ;/ . I still do inline assembly xD. I don't think I'd go the extra mile to be fully ANSI compliant though. #if's are sufficient for me when possible ![]() |
#769
|
||||
|
||||
![]() Quote:
![]() ![]() Quote:
Quote:
So, instead, I found a way to build assembly source from within a Visual Studio project. I'd tell you if I knew what the hell I did though. ![]()
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#770
|
||||
|
||||
![]()
I do remember trying assembly through MSVC, since it had some MASM thing built in. I didn't know you could use that to mix with C code though. I bet it would be better to just use MASM for the asm file since it supports nice macros like declaring bytes in the code section. Although it's not as useful as it would be in a C compiler since my main reason for doing that would be to have the compiler not interfere with my inline assembly. I'm pretty sure there's other uses for declaring bytes in code section. Oh ya, maybe for optimized storage of variables xD. Placing them at the end of functions instead of padding with int3. It's also nice for aligning loops if the assembler doesn't support those obscure multibyte NOP's. Lol you know... I never found a convenient way to do stuff like *(int*)0x404040 = 24 in assembly lol.
What I don't get is why the compiler did better on a different source file of the same project. Maybe I should turn off PGO since I made a ton of changes. I need to figure out how to do stuff like MOVAPS in intrinsics or something. I'd prefer doing things like *(__m128*)(var1) = *(__m128*)(var2); though, but it sometimes uses MOVUPS which is annoying because I made sure it's aligned. How do I do byteswap with intrinsics? Then I could get rid of some inline assembly in pj64 that's messing up the functions they're in. I noticed a compiler warning and got rid of one of the inline ASM pieces of code that had already commented out the bswap. Dunno why he left it there lol. This just makes it seem more like Zilmar was super busy back then, probably still is today lol. I want to get rid of FPU stuff in the recompiler for sure, since it converts MIPS floating point to x86 FPU code. I bet I could also speed up the interpreter core using manual SSE instead of auto generated lol. Lol for the neg instruction in the interpreter I explicitly xor'd the sign bit ![]() Idk if I'll even bother with PJ64's RSP, unless I completely understand how recompilers work and am able to improve it. It would be interesting to see how much faster your graphics plugin would run with a recompiler RSP. If I never get a full understanding of how recompilers work, I'd rather look at your RSP plugin, since it's more accurate and the interpreter is waaay faster xD. I wonder how fast 2.1's RSP interpreter is though. I should test that. Edit: I decided to start abusing compiler specific features. Since PJ64 is pretty much stuck to MSVC, I might as well optimize it for MSVC ![]() Also, for 1964 0.85, I'm getting a constant 61.0 VI/s. Is there a way to fix it and set it to 60? Last edited by RPGMaster; 21st May 2014 at 11:11 PM. |