|
#581
|
||||
|
||||
![]()
I understand your point, though the new plugin is totally rewritten from scratch.
Most of your stability test results for the current version, might not apply to what I am running... But it is tremendously faster than the current build here. The scalar unit is 100% rewritten but stable; I think the vector stuff has one minor bug left that only MusyX audio uses with pitch swapping effects in Star Wars and stuff but I didn't take the time to write a file I/O checker with the results of the stable release here to prove that that's IN FACT the only bug left to fix TOTALLY. I might use that level of proof later. Such an investigation is not huge. I've spent over a full month tackling just one single bug before; this is nothing. The real reason I haven't released yet is just disinterest in RSP at the moment. I'm doing something else. That, and I'll have to write quite a large manual reflecting the immense differences of the new public release. All in all, I'm more than confident that there's no reason why it should take any further past this month of November to release.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#582
|
||||
|
||||
![]()
I understand your point too, it become tedious working on a single project for a long period of time, assuming you don't discover something new or a breakthrough that keep your interest on the thing you are working on. Its nice to hear though that your RSP plugin its almost finished and you have only a minor bug left to fix...
Last edited by Marcelo_20xx; 1st November 2013 at 08:03 AM. |
#583
|
||||
|
||||
![]()
As a matter of fact, it slipped my mind for a while, but there is something new I discovered that excited me. Again, I didn't think to post it here to this thread anymore because other factors made me forget it.
It's no longer about accuracy at this point though. I tried to only make speed changes in ways that also induced accuracy (use SIMD technology to emulate SGI vector technology...more speed is one thing, but it's more accurate). But what I am thinking of here, is not exactly an accuracy change anymore. It IS however a significant speed change that simulates a degree of recompiler theory, integrated into the particular section of the interpreter where this could not have otherwise been optimized. The more instantaneous 2-D jump table for the scalar executions unit doesn't appeal to me for a couple, maybe three reasons (like embedded system control within the main CPU loop lost in another function's stack handling). It is possible that I could sacrifice this back into a linear jump look-up within the main CPU loop, instead using a 2-D, or even up to a 4-D, jump table for LWC2 and SWC2. My reasoning for this sacrifice is simple. Whether it was interpreter or recompiler theory, it used to be that no matter how much you optimized any of the recompiler or interpreter operations elsewhere in the program, the stubborn vector operations always slowed it down. (There is a limited use of MMX in Jabo's RSP recompiler plugin, though, but neither the native algorithms nor the SIMD technology are as competent as the major multiplier, clamp, accumulate, etc. stuff I added.) But nowadays, this became inverted. No matter how much I improve the benchmark-level performance of a Vector operation, it becomes less and less significant, BECAUSE, of LWC2 and SWC2. You can't really apply SSE to those things because several games will use unaligned addresses or, at times, even exceed the addr DMEM boundary allocation barrier 0xFFF, which makes branch weighing and conditional use of SSE become more and more deficient for some games (and not an option with the Microsoft compiler). So the only real way to make LWC2/SWC2 any faster is some type of recompiler theory. There are too many unaligned exception override cases and so-documented "illegal" RSP instruction cases to implement for it to depend on static interpreter SSE theory alone.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#584
|
||||
|
||||
![]()
And I forget to say, but there are many micro-optimizations I could have made that sacrificed accuracy.
To the best of my memory, I've avoided them all. Any speed changes should vary directly with accuracy, if at all. Fortunately I don't think I ran into any real noticeable speed-ups that I ever had to turn down because it also degraded accuracy of the interpreter. Probably the most noticeable thing was not constantly checking the lowest bit of SP_STATUS_REG in the scalar unit's loop to know whether the RSP should be halted and report back to the host CPU, and just use some stupid global variable to guesstimate it instead, and that only had 1 VI/s difference anyway before I made any MMX or SSE out of the vector things.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#585
|
||||
|
||||
![]()
Good news! GCC compiler devs seem to have stabilized their latest improvements.
Short of a year or two, I kept myself downgraded to the older 4.7.2 release. If I upgraded to the new 4.8.1 compiler it meant worse auto stack allocation in the ANSI-C-to-SSE4 vectorizer. Now, that I have upgraded to 4.8.1-4, (-3 I never got to try; -2 and -1 did not work.) MinGW not only maintains the current efficiency of vectorization output from my ANSI C code but improves on it. Vector Multiply-Accumulate of Middle Partial Products 4.7.2 x86 output in AT&T: Code:
_VMADN: LFB666: .cfi_startproc pushl %ebp .cfi_def_cfa_offset 8 .cfi_offset 5, -8 movl %esp, %ebp .cfi_def_cfa_register 5 pushl %esi pushl %ebx andl $-16, %esp subl $144, %esp .cfi_offset 6, -12 .cfi_offset 3, -16 movzbl _inst+1, %esi movl _inst, %ebx shrl $6, %ebx movl %esi, %eax andl $31, %ebx shrb $3, %al sall $4, %ebx movl %eax, %esi movzbl _inst+2, %eax andl $31, %eax sall $4, %eax movdqu _VR(%eax), %xmm0 movzwl _inst+2, %eax shrw $5, %ax andl $15, %eax call *_SSE2_SHUFFLE_16(,%eax,4) movl %esi, %edx movdqa _VACC+32, %xmm5 movzbl %dl, %eax pxor %xmm1, %xmm1 sall $4, %eax movdqa %xmm5, %xmm6 movdqu _VR(%eax), %xmm3 punpcklwd %xmm1, %xmm6 movdqa %xmm3, %xmm2 movdqa %xmm3, %xmm7 punpckhwd %xmm1, %xmm5 pmullw %xmm0, %xmm2 movdqa %xmm2, %xmm4 punpckhwd %xmm1, %xmm7 punpckhwd %xmm1, %xmm2 punpcklwd %xmm1, %xmm4 paddd %xmm2, %xmm5 movdqa %xmm3, %xmm2 psrld $16, %xmm5 paddd %xmm6, %xmm4 movdqa %xmm1, %xmm6 punpcklwd %xmm1, %xmm3 pmullw %xmm0, %xmm2 psrld $16, %xmm4 paddw _VACC+32, %xmm2 pcmpgtw %xmm0, %xmm6 movaps %xmm2, _VACC+32 movaps %xmm6, 32(%esp) movdqa %xmm0, %xmm6 punpckhwd 32(%esp), %xmm6 punpcklwd 32(%esp), %xmm0 movaps %xmm6, (%esp) movdqa (%esp), %xmm6 pmuludq %xmm7, %xmm6 movaps %xmm6, 16(%esp) psrldq $4, %xmm7 movdqa (%esp), %xmm6 psrldq $4, %xmm6 pmuludq %xmm6, %xmm7 pshufd $8, 16(%esp), %xmm6 pshufd $8, %xmm7, %xmm7 punpckldq %xmm7, %xmm6 psrad $16, %xmm6 paddd %xmm5, %xmm6 movdqa %xmm3, %xmm5 psrldq $4, %xmm3 pmuludq %xmm0, %xmm5 psrldq $4, %xmm0 pmuludq %xmm3, %xmm0 pshufd $8, %xmm5, %xmm3 pshufd $8, %xmm0, %xmm0 punpckldq %xmm0, %xmm3 movdqa _VACC+16, %xmm0 psrad $16, %xmm3 paddd %xmm4, %xmm3 movdqa %xmm0, %xmm4 punpcklwd %xmm1, %xmm0 punpckhwd %xmm1, %xmm4 paddd %xmm0, %xmm3 movdqa %xmm3, %xmm0 paddd %xmm4, %xmm6 movdqa %xmm3, %xmm1 psrld $16, %xmm3 punpcklwd %xmm6, %xmm0 punpckhwd %xmm6, %xmm1 psrld $16, %xmm6 movdqa %xmm0, %xmm4 punpcklwd %xmm1, %xmm0 punpckhwd %xmm1, %xmm4 movdqa %xmm3, %xmm1 punpcklwd %xmm6, %xmm3 punpckhwd %xmm6, %xmm1 punpcklwd %xmm4, %xmm0 movdqa %xmm3, %xmm4 punpcklwd %xmm1, %xmm3 punpckhwd %xmm1, %xmm4 movdqa %xmm0, %xmm1 movaps %xmm0, _VACC+16 punpcklwd %xmm4, %xmm3 movdqa %xmm0, %xmm4 paddw _VACC, %xmm3 punpcklwd %xmm3, %xmm1 punpckhwd %xmm3, %xmm4 movaps %xmm3, _VACC packssdw %xmm4, %xmm1 pcmpeqw %xmm1, %xmm0 pxor LC1, %xmm1 pandn LC0, %xmm0 psubw %xmm2, %xmm1 pmullw %xmm0, %xmm1 paddw %xmm2, %xmm1 movlps %xmm1, _VR(%ebx) movhps %xmm1, _VR+8(%ebx) leal -8(%ebp), %esp popl %ebx .cfi_restore 3 popl %esi .cfi_restore 6 popl %ebp .cfi_restore 5 .cfi_def_cfa 4, 4 ret .cfi_endproc Code:
_VMADN: LFB666: .cfi_startproc pushl %ebp .cfi_def_cfa_offset 8 .cfi_offset 5, -8 movl %esp, %ebp .cfi_def_cfa_register 5 pushl %ebx andl $-16, %esp addl $-128, %esp .cfi_offset 3, -12 movzbl _inst+1, %eax movzwl _inst+2, %edx movl _inst, %ebx shrb $3, %al shrw $5, %dx shrl $6, %ebx movl %eax, (%esp) movzbl _inst+2, %eax movl %edx, %ecx andl $15, %ecx andl $31, %ebx sall $4, %ebx movb %al, 16(%esp) movl 16(%esp), %edx andl $31, %edx sall $4, %edx movdqa _VR(%edx), %xmm0 call *_SSE2_SHUFFLE_16(,%ecx,4) movl (%esp), %eax pxor %xmm1, %xmm1 movdqa _VACC+32, %xmm5 movzbl %al, %eax movdqa %xmm5, %xmm7 sall $4, %eax punpckhwd %xmm1, %xmm5 movdqu _VR(%eax), %xmm3 punpcklwd %xmm1, %xmm7 movdqa %xmm3, %xmm2 movdqa %xmm7, %xmm6 pmullw %xmm0, %xmm2 movdqa %xmm2, %xmm7 punpckhwd %xmm1, %xmm2 punpcklwd %xmm1, %xmm7 paddd %xmm2, %xmm5 movdqa %xmm3, %xmm2 psrld $16, %xmm5 movdqa %xmm7, %xmm4 pmullw %xmm0, %xmm2 paddw _VACC+32, %xmm2 movdqa %xmm3, %xmm7 punpcklwd %xmm1, %xmm3 paddd %xmm6, %xmm4 movaps %xmm4, (%esp) movdqa %xmm1, %xmm4 punpckhwd %xmm1, %xmm7 movdqa %xmm0, %xmm6 pcmpgtw %xmm0, %xmm4 movaps %xmm2, _VACC+32 punpckhwd %xmm4, %xmm6 movaps %xmm4, 16(%esp) movdqa %xmm7, %xmm4 psrlq $32, %xmm7 pmuludq %xmm6, %xmm4 psrlq $32, %xmm6 pmuludq %xmm6, %xmm7 pshufd $8, %xmm4, %xmm6 pshufd $8, %xmm7, %xmm7 movdqa %xmm3, %xmm4 psrlq $32, %xmm3 punpcklwd 16(%esp), %xmm0 punpckldq %xmm7, %xmm6 movdqa (%esp), %xmm7 pmuludq %xmm0, %xmm4 psrad $16, %xmm6 psrlq $32, %xmm0 paddd %xmm5, %xmm6 pmuludq %xmm3, %xmm0 pshufd $8, %xmm4, %xmm5 pshufd $8, %xmm0, %xmm0 psrld $16, %xmm7 punpckldq %xmm0, %xmm5 movdqa _VACC+16, %xmm0 movdqa %xmm0, %xmm4 psrad $16, %xmm5 paddd %xmm7, %xmm5 punpcklwd %xmm1, %xmm0 punpckhwd %xmm1, %xmm4 paddd %xmm0, %xmm5 movdqa %xmm5, %xmm0 paddd %xmm4, %xmm6 movdqa %xmm5, %xmm1 psrld $16, %xmm5 punpcklwd %xmm6, %xmm0 punpckhwd %xmm6, %xmm1 psrld $16, %xmm6 movdqa %xmm0, %xmm3 punpcklwd %xmm1, %xmm0 punpckhwd %xmm1, %xmm3 movdqa %xmm5, %xmm1 punpcklwd %xmm6, %xmm5 punpckhwd %xmm6, %xmm1 punpcklwd %xmm3, %xmm0 movdqa %xmm5, %xmm3 punpcklwd %xmm1, %xmm5 punpckhwd %xmm1, %xmm3 movdqa %xmm0, %xmm1 movaps %xmm0, _VACC+16 punpcklwd %xmm3, %xmm5 movdqa %xmm0, %xmm3 paddw _VACC, %xmm5 punpcklwd %xmm5, %xmm1 punpckhwd %xmm5, %xmm3 movaps %xmm5, _VACC packssdw %xmm3, %xmm1 pcmpeqw %xmm1, %xmm0 pxor LC1, %xmm1 pandn LC0, %xmm0 psubw %xmm2, %xmm1 pmullw %xmm0, %xmm1 paddw %xmm2, %xmm1 movups %xmm1, _VR(%ebx) movl -4(%ebp), %ebx leave .cfi_restore 5 .cfi_restore 3 .cfi_def_cfa 4, 4 ret .cfi_endproc In fact, I have the overall tests for the entire assembly output file. RSP.S generated by GCC 4.7.2 is 18114 lines long. RSP.S generated by GCC 4.8.1 is 17883 lines long. This relatively small delta indicates that the most relevant improvement to GCC, to my RSP project, is the auto-vectorizer code which takes my high-level C loops and converts them into SSE2, SSSE3, SSE4, AVX, or whichever I tell the build projects is allowed to be created for me.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#586
|
||||
|
||||
![]()
I also have an update about my pseudo-recompiler idea.
I haven't yet implemented it (distracted by other projects, not to mention being an online chess addict isn't helping), but in my mind what I wrote before was wrong. For LWC2, SWC2, I don't need it to be a 4-D function pointer table. It should suffice that I can have it as a 3-D function jump table. The dimensions of the switch-free procedure indexing are: * OPCODE (inst.R.rd, the sub-op-code for which LWC2/SWC2 vector move is done) * ELEMENT (the 4-bit element enabling some RSP so-defined illegal instruction overrides to get executed in hard-to-optimize edge cases) * (SR[base] + (signed)(inst.R.offset)) & 0xF (the 4-bit alignment indicator so that I can move all of the bytes at once without having to branch mid-way any times, or having to XOR every single smaller 8-bit write to cover endianness inversions) To conduct this multitudinously faster ?WC2 rewrite, however, means I am more likely better off using a switch statement on the primary operation code (is it SPECIAL, is it REGIMM, etc.) rather than doing a function jump table within a function jump table. There are still some disadvantages of the switch over the newly adopted 2-D jump table proposed by MarathonMan, but there are also some advantages (like localized PC loop increments and less constant checking for system control) and I think that the rest of the scalar opcodes, outside of SWC2 and LWC2, which MarathonMan's way of saying it improved on greatly, can easily be sacrificed for the better performance of the recompiler-demanding nature of the ?WC2 algorithms.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#587
|
||||
|
||||
![]()
I don't mean to post unmeaningful things here, but I guess I let the thread die for some weeks during my break.
For anyone who has been successfully compiling my plugin off the Git repository, you might have noticed some offset noise disturbance in the background of MusyX audio microcode (Star Wars, TWINE 007, Gauntlet Legends, other games using MusyX). That was a problem with zero-extending subsetted types into superset data segments, before adding them in the extended-precision result during VADDC. I thought this was happening automatically, but it seems my knowledge of C type conversion operations was not complete. So sound should sound perfect.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#588
|
||||
|
||||
![]()
Unexpectedly I have another update based straight off the above.
It took me until today to notice a zero-extension bug directly analogous to the one in VADDC, but in VSUBC. Both bugs were introduced in the same commit and falsely dictate the carry-out flag setting in RSP $vco flags. It was very easy to notice the VADDC version of the bug I fixed in the post above due to MusyX audio glitching, but it was very hard to notice one existed for VSUBC in World Driver Championship. Here the unsaturated difference in subtracting zero-extended VT from VS (VS - VT, that is), falsely set some carry-out flags, causing jittery, shaky 3-D textures in some small parts of the game. At first I thought it was just some stupid RDP bug, even if it was happening on angrylion's 99% accurate graphics plugin, but as eye-straining as it was to look out for it I saw that zilmar's less accurate RSP interpreter actually fixed that problem...so that should be all of the RSP bugs taken care of. Neither of these bugs apply to the latest public release posted to this thread. This is only meaningful for anyone who has been following my Git repository uploads. It is extremely difficult to make massive SSE vectorizer template rewrites of an entire code base without having any bugs introduce themselves; I'm surprised that was all that was left. I made the emulator faster with LWC2/SWC2, too. That just leaves how to WTFM for the next release and the possibility of one more speed-up before releasing.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#589
|
||||
|
||||
![]()
That's some great news, I am eagerly awaiting your next release to test the changes
|
#590
|
||||
|
||||
![]()
Updated. Sorry for the big break.
![]() There are no known game fixes in this version since no more bugs were reported, besides angrylion's report of a debugging trace message attached to MTC0 on the Funnelcube demo (and some others by marshallh). It probably is beyond a whole 12 VI/s faster than the previous release to this thread, except for games driven mad by the problem of RDP interpretation, like Conker's Bad Fur Day. It will also work in 1964 emulator now because there is no more releasing different DLL versions for the same plugin, thanks to Garteal's DllConfig user interface he submitted.
__________________
http://theoatmeal.com/comics/cat_vs_internet |