Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #761  
Old 18th May 2014, 03:25 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by RPGMaster View Post
Would sse4 make a significant improvement? My laptop supports sse4 so I'd like to use it if possible .
https://dl.dropboxusercontent.com/u/...anual.html#4.2

Make sure you pass -msse4 instead of -msse2 to the GNU compiler, and it will auto-vectorize to SSE4 instead of SSE2.
Reply With Quote
  #762  
Old 21st May 2014, 02:34 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Since I've been testing MSVC's SSE code generation (not for this plugin), so I thought maybe I should try using the benchmark thing you made lol for the rsp plugin, to see the difference between compilers. Maybe I'll try learning intrinsics instead of casting variables lol.
Reply With Quote
  #763  
Old 21st May 2014, 03:09 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

My RSP benchmark module might not be totally reliable. It uses simple wall clock time to compute execution time spent, and the times seem to change significantly just by changing certain other things about the RSP plugin not related to those functions (link-time code alignment possibly affecting the benchmark results?). Still, it's a cool feature and works well enough for me to rely on it for studying whether a change I just did to that particular opcode is for better or worse.

Currently my RSP benchmark results amount to this on my 1.90 GHz AMD K8:
Code:
RSP Vector Benchmarks Log

VMULF  :  0.625 s
VMACF  :  0.898 s
VMULU  :  0.633 s
VMACU  :  0.966 s
VMUDL  :  0.582 s
VMADL  :  1.099 s
VMUDM  :  0.657 s
VMADM  :  1.114 s
VMUDN  :  0.559 s
VMADN  :  1.025 s
VMUDH  :  0.446 s
VMADH  :  0.580 s
VADD   :  0.311 s
VSUB   :  0.393 s
VABS   :  0.474 s
VADDC  :  0.366 s
VSUBC  :  0.398 s
VSAW   :  0.151 s
VEQ    :  0.618 s
VNE    :  0.519 s
VLT    :  0.608 s
VGE    :  0.528 s
VCH    :  1.133 s
VCL    :  1.286 s
VCR    :  0.788 s
VMRG   :  0.240 s
VAND   :  0.232 s
VNAND  :  0.242 s
VOR    :  0.232 s
VNOR   :  0.241 s
VXOR   :  0.257 s
VNXOR  :  0.260 s
VRCPL  :  0.679 s
VRSQL  :  0.708 s
VRCPH  :  0.225 s
VRSQH  :  0.234 s
VMOV   :  0.226 s
VNOP   :  0.071 s
Total time spent:  20.604 s
Reply With Quote
  #764  
Old 21st May 2014, 03:19 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Lol that's strange about the benchmark. I started writing benchmark code in assembly because of the compiler messing with the results . Maybe I'll try converting your benchmark code to assembly.

It's definitely a cool feature. I'll try benchmarking on my computer soon.
Reply With Quote
  #765  
Old 21st May 2014, 03:30 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

That's actually not too shameful of an idea. Maybe I should rewrite my benchmark code in an assembly language...at least then I could get rid of having to make a function pointer table to prevent compilers from inlining the entire functions all over again.

Thing is though, with assembly languages, portability gets even more complicated. Have a look at this assembly language source file I wrote a couple weeks ago; it's written to assemble under both GNU assembler (GAS, which by default used the annoying AT&T syntax) and Microsoft Macro Assembler:

Code:
;/*
; * written for compatibility with the GNU assembler (GAS), which uses AT&T
; * assembly language syntax and non-semicolon (;) comment notations
; */

;.intel_syntax noprefix

;/*
.code
;*/ .text

;.include "opsd.inc" /* GNU assembler syntax
INCLUDE opsd.inc; */ # Microsoft MASM syntax

;/*
invalid	PROC; */invalid:
    RET;/*
invalid	ENDP; */
;/*
fs_add PROC; */fs_add:
    ADDSD   xmm0, xmm1
    RET;/*
fs_add ENDP; */
;/*
fs_sub PROC; */fs_sub:
    SUBSD   xmm0, xmm1
    RET;/*
fs_sub ENDP; */
;/*
fs_mul PROC; */fs_mul:
    MULSD   xmm0, xmm1
    RET;/*
fs_mul ENDP; */
;/*
fs_div PROC; */fs_div:
    DIVSD   xmm0, xmm1
    RET;/*
fs_div ENDP; */
;/*
fs_sqrt PROC; */fs_sqrt:
;# Since the SIMD square root operation wastes one of the two operands,
;# the extraneous operand is put to use as a multiplier coefficient.
    SQRTSD  xmm1, xmm1;# f(x) = sqrt(x)
    MULSD   xmm0, xmm2;# g(x, y) = y * f(x)
    RET;/*
fs_sqrt ENDP; */
;/*
fs_min PROC; */fs_min:
    MINSD   xmm0, xmm1
    RET;/*
fs_min ENDP; */
;/*
fs_max PROC; */fs_max:
    MAXSD   xmm0, xmm1
    RET;/*
fs_max ENDP; */
;/*
fs_and PROC; */fs_and:
    ANDPD   xmm0, xmm1
    RET;/*
fs_and ENDP; */
;/*
fs_or PROC; */fs_or:
    ORPD    xmm0, xmm1
    RET;/*
fs_or ENDP; */
;/*
fs_xor PROC; */fs_xor:
    XORPD   xmm0, xmm1
    RET;/*
fs_xor ENDP; */
;/*
fs_andn PROC; */fs_andn:
    ANDNPD  xmm0, xmm1
    RET;/*
fs_andn ENDP; */
;/*
fs_eq PROC; */fs_eq:
    CMPSD   xmm0, xmm1, 0;# EQ
    RET;/*
fs_eq ENDP; */
;/*
fs_lt PROC; */fs_lt:
    CMPSD   xmm0, xmm1, 1;# LT
    RET;/*
fs_lt ENDP; */
;/*
fs_gt PROC; */fs_gt:
    CMPLTSD xmm1, xmm0;# pseudo-operation meaning:  CMPSD xmm1, xmm0, 1
    MOVSD   xmm0, xmm1;# Return slot is always xmm0.
    RET;/*
fs_gt ENDP; */
;/*
fs_unord PROC; */fs_unord:
    CMPSD   xmm0, xmm1, 3;# UNORD
    RET;/*
fs_unord ENDP; */

;.macro END
;.end # using GNU assembler's `.macro' directive to directly map to MASM's `END'
;.endm

END
Believe it or not, this .asm file will assemble safely on both GNU asm and MASM (not the netwide assembler though because NASM does too many rebellious things with Intel's original syntax, and I don't care about supporting NASM since the GNU toolchain is at least as widely ported).
Reply With Quote
  #766  
Old 21st May 2014, 03:57 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

I've been using macro's a lot more lately. Isn't there a compiler option for not inlining specific functions in GCC? I know with MSVC, there's _declspec(noinline) . It's pretty neat to take advantage of a compiler's features. Too bad not every compiler supports the same features. Like idk the replacement for __assume() . It would be nice to be able to make perfect jumptables using switch WHILE still being compatible on all the major compilers. Lol I was looking at 1964's audio source, and saw they used inline assembly for a jump table. It sure pays off to have a lot of free time, I can't wait for summer to start . MSVC does a terrible job for inline assembly so I end up resorting to using _asm _emit lol. I have #ifdef's to switch between C code and ASM code incase anything bad happens, or I want to test the difference.

My favorite ASM syntax is MASM . I should probably learn some others incase I ever decide to program for a different OS. I hear FASM is really good.

Just curious, why did you make functions for 1 instruction? Was this a test or something?
Reply With Quote
  #767  
Old 21st May 2014, 04:12 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by RPGMaster View Post
I've been using macro's a lot more lately. Isn't there a compiler option for not inlining specific functions in GCC? I know with MSVC, there's _declspec(noinline) .
Yeah I know you can see that already in my source to the angrylion RDP. It's in z64.h I think...#ifdef MSC_VER then #define NOINLINE __declspec(noinline), like you said, else if it's GCC then, something else I wrote.

Quote:
Originally Posted by RPGMaster View Post
It's pretty neat to take advantage of a compiler's features. Too bad not every compiler supports the same features. Like idk the replacement for __assume() .
That's why I avoid them. I don't use compiler-specific extensions unless significant in several conditions or places. Partly faggotry of my own, but I just like to keep compliant with the ANSI standards for predictable/defined behavior, to minimize hazardous practices. I think C should be meant as portable assembly language, not a fusion of assembly code with high-level code. I would rather have a mix of .c and .asm source files than to use inline assembly.

Quote:
Originally Posted by RPGMaster View Post
My favorite ASM syntax is MASM . I should probably learn some others incase I ever decide to program for a different OS. I hear FASM is really good.
FASM's alright. I don't know much about it. The UI felt kind of strange to me, but it was backward-compatible with MASM enough that I was able to write portable assembly language that worked on both MASM and FASM simultaneously, I think. It's only NASM that proved challenging.

I always wondered if Intel ever wrote their own official assembler, because I was worried that MASM was too over-featured and that if I wrote asm code for it, it would be hard to go back to make the same code make via other assemblers. I even looked around for some Intel blogs/articles on x86 assembly language, but surprisingly some of them used the AT&T syntax like GCC does. Haha...guess they didn't have a full syntax set in stone for full x86 source files.

Quote:
Originally Posted by RPGMaster View Post
Just curious, why did you make functions for 1 instruction? Was this a test or something?
It's like a jump table, or a switch.

What I gave you is the assembly portion of my program; the rest is in C. GCC compiled the C code to recognize that the extern func ptr table did not require a CALL opcode...it would be indexed using a computed JMP, as with a switch statement.

As a whole, my program is a simple calculator.
main.exe 63.3 / 51.4

... the '/' char is read in as the index to my function pointer table and jumps to the SSE2 divide function, or fs_div, as I defined in the asm source.

The operands 63.3 and 51.4 are double-precision 64-bit floating point values, loaded in as SSE2 __m128d doubles (the lower halves of the otherwise unmodified XMM registers).

In the end it prints out the result: xmm0 <operation> xmm1
I implemented all of the operations that were directly mappable to the SSE2 instruction set (interestingly enough, 16 in total, divisible evenly into 4 noticeable patterns of groups of ops).
Reply With Quote
  #768  
Old 21st May 2014, 04:34 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

Wow after messing around and testing MSVC's automatic sse generation, I'm quite dissappointed. In fact, i won't even bother compiling your rsp plugin with it. I tried seeing what happens when moving an array of floats to another array. It was using movss... Looks like I'll have to do things explicitly. Maybe PGO screws up optimization after making a lot of changes ._.

One thing I'd like to try doing is completely removing FPU instructions from the emulator. Are there any important features that are only available with the FPU?

To some extent, I agree with you about compiler features. If I was writing something open source, I wouldn't want to force people to use MSVC, just to compile my program. Maybe I should though xD. I've honestly never linked asm files with C before. I wonder how it works. Lol for some reason, even though I love assembly, I just can't go back to using an assembler ;/ . I still do inline assembly xD. I don't think I'd go the extra mile to be fully ANSI compliant though. #if's are sufficient for me when possible .
Reply With Quote
  #769  
Old 21st May 2014, 04:41 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by RPGMaster View Post
Wow after messing around and testing MSVC's automatic sse generation, I'm quite dissappointed. In fact, i won't even bother compiling your rsp plugin with it. I tried seeing what happens when moving an array of floats to another array. It was using movss... Looks like I'll have to do things explicitly. Maybe PGO screws up optimization after making a lot of changes ._.
That's what I suspected. You can see why I insist that people compile with GCC -msse2 or -mssse3 or -msse4.

Quote:
Originally Posted by RPGMaster View Post
One thing I'd like to try doing is completely removing FPU instructions from the emulator. Are there any important features that are only available with the FPU?
Maybe some, like sin/cos or w/e...you mean the main PJ64 core or the RSP plugin? I don't think I have floats/doubles anywhere in RSP emu.

Quote:
Originally Posted by RPGMaster View Post
To some extent, I agree with you about compiler features. If I was writing something open source, I wouldn't want to force people to use MSVC, just to compile my program. Maybe I should though xD. I've honestly never linked asm files with C before. I wonder how it works. Lol for some reason, even though I love assembly, I just can't go back to using an assembler ;/ . I still do inline assembly xD. I don't think I'd go the extra mile to be fully ANSI compliant though. #if's are sufficient for me when possible .
Yeah before I was trying like, compile the C code with CL.EXE, assemble the asm code with ML64.EXE, then link the C-compiled OBJ file with the asm-compiled OBJ file with LINK.EXE. It's a hassle indeed if you don't have a nice batch/commands script set up for it.

So, instead, I found a way to build assembly source from within a Visual Studio project. I'd tell you if I knew what the hell I did though. Some long-ass tutorial. Maybe I can send you the files sometime when that's finished.
Reply With Quote
  #770  
Old 21st May 2014, 05:07 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,008
Default

I do remember trying assembly through MSVC, since it had some MASM thing built in. I didn't know you could use that to mix with C code though. I bet it would be better to just use MASM for the asm file since it supports nice macros like declaring bytes in the code section. Although it's not as useful as it would be in a C compiler since my main reason for doing that would be to have the compiler not interfere with my inline assembly. I'm pretty sure there's other uses for declaring bytes in code section. Oh ya, maybe for optimized storage of variables xD. Placing them at the end of functions instead of padding with int3. It's also nice for aligning loops if the assembler doesn't support those obscure multibyte NOP's. Lol you know... I never found a convenient way to do stuff like *(int*)0x404040 = 24 in assembly lol.

What I don't get is why the compiler did better on a different source file of the same project. Maybe I should turn off PGO since I made a ton of changes. I need to figure out how to do stuff like MOVAPS in intrinsics or something. I'd prefer doing things like *(__m128*)(var1) = *(__m128*)(var2); though, but it sometimes uses MOVUPS which is annoying because I made sure it's aligned.

How do I do byteswap with intrinsics? Then I could get rid of some inline assembly in pj64 that's messing up the functions they're in. I noticed a compiler warning and got rid of one of the inline ASM pieces of code that had already commented out the bswap. Dunno why he left it there lol. This just makes it seem more like Zilmar was super busy back then, probably still is today lol.

I want to get rid of FPU stuff in the recompiler for sure, since it converts MIPS floating point to x86 FPU code. I bet I could also speed up the interpreter core using manual SSE instead of auto generated lol. Lol for the neg instruction in the interpreter I explicitly xor'd the sign bit and or'd the sign bit for ABS.

Idk if I'll even bother with PJ64's RSP, unless I completely understand how recompilers work and am able to improve it. It would be interesting to see how much faster your graphics plugin would run with a recompiler RSP. If I never get a full understanding of how recompilers work, I'd rather look at your RSP plugin, since it's more accurate and the interpreter is waaay faster xD. I wonder how fast 2.1's RSP interpreter is though. I should test that.

Edit: I decided to start abusing compiler specific features. Since PJ64 is pretty much stuck to MSVC, I might as well optimize it for MSVC . My initial problem with using __assume(0), aside from it not being portable, is that I'd have to write out all the cases. Well what I do now is have a program generate all the cases! Then the ones that are duplicates, will be highlighted in red, then I delete those and BOOM I have a perfect jump table!

Also, for 1964 0.85, I'm getting a constant 61.0 VI/s. Is there a way to fix it and set it to 60?

Last edited by RPGMaster; 21st May 2014 at 11:11 PM.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 08:57 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.