|
#1081
|
||||
|
||||
![]()
Anyway, i decided to just do some benchmarks out of curiosity. Gotta say I'm somewhat surprised about conker. I guess it was hard for me to measure, due to the VI/s fluxing like crazy in the intro. So I compiled your latest rsp and used your latest gfx plugin, here are the results, using No Audio + HLE audio enabled and refresh set to 2 and triangles set to 0. I left conker on for about 3 mins and 45 seconds, if that matters. For kirby, i just loaded a savestate and stood still. For F-zero, i loaded a save state and played the first level.
Conker ![]() Kirby ![]() F-zero ![]() As for kirby, not surprised that it's using ~40% since it's the game I noticed the most significant difference. For Conker, looks like it's heavy on both RDP and RSP. After some careful eye balling, I can definitely confirm RSP makes a big difference. Just harder to see cause of all the VI/s fluxing! As for F-zero, no surprise at all. Also, i tried manually using your config, to enable HLE audio. Dunno what happened but couldn't get it working, so I just edited source to always enable HLE audio ![]() Perhaps i should actually play conker, to see how important RSP is in gameplay. Intros aren't too important to me xD. + they are just too hard to benchmark! I wish I knew a good way to benchmark recompiler lol. Honestly I used to not like relying on triangle skip, but with focus, it can prove to be useful. Still for benchmarking RSP, I still think that dll should be made. No Audio has been a great benchmarking plugin for LLE audio. Edit: went back and fixed the rsp config. I just added in Code:
(mode[0] == 'r') ? OPEN_EXISTING : CREATE_ALWAYS, (mode[0] == 'r') ? FILE_ATTRIBUTE_NORMAL : FILE_FLAG_WRITE_THROUGH, Last edited by RPGMaster; 24th October 2014 at 11:26 AM. |
#1082
|
||||
|
||||
![]() |
#1083
|
||||
|
||||
![]()
This is one polite way of putting it.
![]()
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#1084
|
||||
|
||||
![]()
After finishing up all my optimizations, I decided to benchmark WDC. That game is odd ;/ . For some reason my recompiler was extra slow, same with stunt racer. Turns out it's something I did with LQV and SDV ;/ . Time for me to move onto another project though. RSP was a fun project to work on
![]() Anyway, here's a benchmark for WDC using your latest source. ![]() Lol im thinking I should have prolly benchmarked using your older source ;/ . I may go back and do that, later on. Anyone interested in more game benchmarks? Last edited by RPGMaster; 28th October 2014 at 12:38 AM. |
#1085
|
||||
|
||||
![]()
I was just working with VMADN and saw that all the games I checked behaved the exact same way even if I do not emulate signed clamping. It goes as an unexploited bug (possibly even unexploit-able with the way these games do shit).
VMADN without signed clamping: Code:
_VMADN: movdqa xmm3, xmm0 movdqa xmm2, xmm0 pmullw xmm3, xmm1 pmulhuw xmm2, xmm1 psraw xmm1, 15 pxor xmm4, xmm4 pand xmm0, xmm1 psubw xmm2, xmm0 movdqa xmm0, XMMWORD PTR _VACC+32 paddw xmm0, xmm3 movdqa xmm1, xmm0 psubusw xmm1, xmm3 pcmpeqw xmm3, xmm0 pcmpeqw xmm1, xmm4 pandn xmm3, xmm1 psubw xmm2, xmm3 movdqa xmm3, XMMWORD PTR _VACC+16 paddw xmm3, xmm2 movdqa XMMWORD PTR _VACC+32, xmm0 movdqa XMMWORD PTR _VACC+16, xmm3 movdqa xmm1, xmm3 psubusw xmm1, xmm2 pcmpeqw xmm3, xmm2 psraw xmm2, 15 paddw xmm2, XMMWORD PTR _VACC pcmpeqw xmm1, xmm4 pandn xmm3, xmm1 psubw xmm2, xmm3 movdqa XMMWORD PTR _VACC, xmm2 ret Code:
_VMADN: movdqa xmm3, xmm0 movdqa xmm2, xmm0 pmullw xmm3, xmm1 pmulhuw xmm2, xmm1 psraw xmm1, 15 pxor xmm4, xmm4 pand xmm0, xmm1 psubw xmm2, xmm0 movdqa xmm0, XMMWORD PTR _VACC+32 paddw xmm0, xmm3 movdqa xmm1, xmm0 psubusw xmm1, xmm3 pcmpeqw xmm3, xmm0 pcmpeqw xmm1, xmm4 pandn xmm3, xmm1 psubw xmm2, xmm3 movdqa xmm1, XMMWORD PTR _VACC+16 paddw xmm1, xmm2 movdqa XMMWORD PTR _VACC+32, xmm0 movdqa XMMWORD PTR _VACC+16, xmm1 movdqa xmm3, xmm1 psubusw xmm3, xmm2 movdqa xmm6, xmm1 pcmpeqw xmm3, xmm4 movdqa xmm4, xmm2 psraw xmm2, 15 paddw xmm2, XMMWORD PTR _VACC pcmpeqw xmm4, xmm1 movdqa xmm5, xmm4 pandn xmm5, xmm3 movdqa xmm3, xmm1 psubw xmm2, xmm5 punpckhwd xmm3, xmm2 movdqa XMMWORD PTR _VACC, xmm2 punpcklwd xmm6, xmm2 movdqa xmm2, xmm6 packssdw xmm2, xmm3 pcmpeqw xmm2, xmm1 pand xmm0, xmm2 ret Code:
_VMADN PROC push ebp mov ebp, esp and esp, -8 movdqa xmm7, XMMWORD PTR _VACC+32 movdqa xmm2, xmm0 movdqa xmm6, XMMWORD PTR _VACC+16 movdqa xmm4, xmm0 pmullw xmm2, xmm1 xorps xmm5, xmm5 pmulhuw xmm4, xmm1 psraw xmm1, 15 ; 0000000fH pand xmm1, xmm0 paddw xmm7, xmm2 psubw xmm4, xmm1 movdqa XMMWORD PTR _VACC+32, xmm7 movdqa xmm0, xmm7 movdqa xmm1, xmm7 psubusw xmm0, xmm2 pcmpeqw xmm1, xmm2 pcmpeqw xmm0, xmm5 pandn xmm1, xmm0 psubw xmm4, xmm1 paddw xmm6, xmm4 movdqa xmm3, xmm4 movdqa xmm0, xmm6 psraw xmm3, 15 psubusw xmm0, xmm4 movdqa XMMWORD PTR _VACC+16, xmm6 paddw xmm3, XMMWORD PTR _VACC pcmpeqw xmm0, xmm5 movdqa xmm1, xmm6 movdqa xmm2, xmm6 pcmpeqw xmm1, xmm4 movdqa xmm4, xmm6 pandn xmm1, xmm0 psubw xmm3, xmm1 punpckhwd xmm4, xmm3 punpcklwd xmm2, xmm3 packssdw xmm2, xmm4 pcmpeqw xmm4, xmm4 pcmpeqw xmm6, xmm2 movdqa XMMWORD PTR _VACC, xmm3 pxor xmm4, xmm6 movdqa xmm1, xmm6 movdqa xmm0, xmm4 pand xmm1, xmm7 pand xmm0, xmm2 psllw xmm4, 15 por xmm0, xmm1 pxor xmm0, xmm4 mov esp, ebp pop ebp ret 0 _VMADN ENDP
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#1086
|
||||
|
||||
![]()
Did some more profiling, using original z64gl r17 build in MinGW to test.
Even with VMADL and VMADN (the latter being by far at the top of the latent opcodes list, about the same position as SHUFFLE_VECTOR) optimized the living crap out of, it's still incredible how often VMADN and VMADH are used to the point where they're still bottlenecks even after fixing GCC's inability to auto-vectorize them. 1024x768 Screenshot of Profiler Results: http://ft.trillian.im/785abe041074fd...c3ir2zJdmG.jpg It even lists some other familiar friends, like VMUDL, which this time if I may say for certain for those who remember me already claiming this honestly a year ago ![]()
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#1087
|
||||
|
||||
![]()
Are you saying that that there just wont be anything worth optimizing at this point, because if these bottlenecks?
How is your RSP? Is it faster than the last BIN on page one, and will we see full speed on more games with a 4GHz system? Last edited by theboy181; 29th October 2014 at 01:33 AM. |
#1088
|
||||
|
||||
![]()
Wondering why rsp.dll crashes with an exception, but rsp_sse2.dll works fine.
My processor is supposed to support sse3. Code:
Processor 1 ID = 0 Number of cores 2 (max 2) Number of threads 2 (max 2) Name Intel Pentium D 820 Codename SmithField Specification Intel(R) Pentium(R) D CPU 2.80GHz Package (platform ID) Socket 775 LGA (0x4) CPUID F.4.7 Extended CPUID F.4 Core Stepping B0 Technology 90 nm Core Speed 2792.8 MHz Multiplier x Bus Speed 14.0 x 199.5 MHz Rated Bus speed 798.0 MHz Stock frequency 2800 MHz Instructions sets MMX, SSE, SSE2, SSE3, EM64T L1 Data cache 2 x 16 KBytes, 8-way set associative, 64-byte line size Trace cache 2 x 12 Kuops, 8-way set associative L2 cache 2 x 1024 KBytes, 8-way set associative, 64-byte line size FID/VID Control no |
#1089
|
||||
|
||||
![]()
rsp.dll actually requires SSSE3. It needs it for instructions like pshufb.
|
#1090
|
||||
|
||||
![]() Quote:
Must have a touch of dyslexia, reading SSSE3 as SSE3. ![]() Hopefully I'll be getting a new PC end of the year. |