Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #581  
Old 1st November 2013, 02:52 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

I understand your point, though the new plugin is totally rewritten from scratch.

Most of your stability test results for the current version, might not apply to what I am running...

But it is tremendously faster than the current build here. The scalar unit is 100% rewritten but stable; I think the vector stuff has one minor bug left that only MusyX audio uses with pitch swapping effects in Star Wars and stuff but I didn't take the time to write a file I/O checker with the results of the stable release here to prove that that's IN FACT the only bug left to fix TOTALLY. I might use that level of proof later.

Such an investigation is not huge. I've spent over a full month tackling just one single bug before; this is nothing.
The real reason I haven't released yet is just disinterest in RSP at the moment. I'm doing something else.
That, and I'll have to write quite a large manual reflecting the immense differences of the new public release.

All in all, I'm more than confident that there's no reason why it should take any further past this month of November to release.
Reply With Quote
  #582  
Old 1st November 2013, 07:56 AM
Marcelo_20xx's Avatar
Marcelo_20xx Marcelo_20xx is offline
Senior Member
 
Join Date: Oct 2013
Posts: 171
Default

I understand your point too, it become tedious working on a single project for a long period of time, assuming you don't discover something new or a breakthrough that keep your interest on the thing you are working on. Its nice to hear though that your RSP plugin its almost finished and you have only a minor bug left to fix...

Last edited by Marcelo_20xx; 1st November 2013 at 08:03 AM.
Reply With Quote
  #583  
Old 1st November 2013, 04:52 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

As a matter of fact, it slipped my mind for a while, but there is something new I discovered that excited me. Again, I didn't think to post it here to this thread anymore because other factors made me forget it.

It's no longer about accuracy at this point though.
I tried to only make speed changes in ways that also induced accuracy (use SIMD technology to emulate SGI vector technology...more speed is one thing, but it's more accurate).
But what I am thinking of here, is not exactly an accuracy change anymore.
It IS however a significant speed change that simulates a degree of recompiler theory, integrated into the particular section of the interpreter where this could not have otherwise been optimized.

The more instantaneous 2-D jump table for the scalar executions unit doesn't appeal to me for a couple, maybe three reasons (like embedded system control within the main CPU loop lost in another function's stack handling). It is possible that I could sacrifice this back into a linear jump look-up within the main CPU loop, instead using a 2-D, or even up to a 4-D, jump table for LWC2 and SWC2.

My reasoning for this sacrifice is simple.
Whether it was interpreter or recompiler theory, it used to be that no matter how much you optimized any of the recompiler or interpreter operations elsewhere in the program, the stubborn vector operations always slowed it down. (There is a limited use of MMX in Jabo's RSP recompiler plugin, though, but neither the native algorithms nor the SIMD technology are as competent as the major multiplier, clamp, accumulate, etc. stuff I added.)

But nowadays, this became inverted.
No matter how much I improve the benchmark-level performance of a Vector operation, it becomes less and less significant, BECAUSE, of LWC2 and SWC2.
You can't really apply SSE to those things because several games will use unaligned addresses or, at times, even exceed the addr DMEM boundary allocation barrier 0xFFF, which makes branch weighing and conditional use of SSE become more and more deficient for some games (and not an option with the Microsoft compiler).

So the only real way to make LWC2/SWC2 any faster is some type of recompiler theory. There are too many unaligned exception override cases and so-documented "illegal" RSP instruction cases to implement for it to depend on static interpreter SSE theory alone.
Reply With Quote
  #584  
Old 1st November 2013, 05:02 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

And I forget to say, but there are many micro-optimizations I could have made that sacrificed accuracy.

To the best of my memory, I've avoided them all.
Any speed changes should vary directly with accuracy, if at all.

Fortunately I don't think I ran into any real noticeable speed-ups that I ever had to turn down because it also degraded accuracy of the interpreter. Probably the most noticeable thing was not constantly checking the lowest bit of SP_STATUS_REG in the scalar unit's loop to know whether the RSP should be halted and report back to the host CPU, and just use some stupid global variable to guesstimate it instead, and that only had 1 VI/s difference anyway before I made any MMX or SSE out of the vector things.
Reply With Quote
  #585  
Old 18th November 2013, 01:47 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Good news! GCC compiler devs seem to have stabilized their latest improvements.

Short of a year or two, I kept myself downgraded to the older 4.7.2 release.
If I upgraded to the new 4.8.1 compiler it meant worse auto stack allocation in the ANSI-C-to-SSE4 vectorizer.

Now, that I have upgraded to 4.8.1-4, (-3 I never got to try; -2 and -1 did not work.) MinGW not only maintains the current efficiency of vectorization output from my ANSI C code but improves on it.

Vector Multiply-Accumulate of Middle Partial Products
4.7.2 x86 output in AT&T:
Code:
_VMADN:
LFB666:
	.cfi_startproc
	pushl	%ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	movl	%esp, %ebp
	.cfi_def_cfa_register 5
	pushl	%esi
	pushl	%ebx
	andl	$-16, %esp
	subl	$144, %esp
	.cfi_offset 6, -12
	.cfi_offset 3, -16
	movzbl	_inst+1, %esi
	movl	_inst, %ebx
	shrl	$6, %ebx
	movl	%esi, %eax
	andl	$31, %ebx
	shrb	$3, %al
	sall	$4, %ebx
	movl	%eax, %esi
	movzbl	_inst+2, %eax
	andl	$31, %eax
	sall	$4, %eax
	movdqu	_VR(%eax), %xmm0
	movzwl	_inst+2, %eax
	shrw	$5, %ax
	andl	$15, %eax
	call	*_SSE2_SHUFFLE_16(,%eax,4)
	movl	%esi, %edx
	movdqa	_VACC+32, %xmm5
	movzbl	%dl, %eax
	pxor	%xmm1, %xmm1
	sall	$4, %eax
	movdqa	%xmm5, %xmm6
	movdqu	_VR(%eax), %xmm3
	punpcklwd	%xmm1, %xmm6
	movdqa	%xmm3, %xmm2
	movdqa	%xmm3, %xmm7
	punpckhwd	%xmm1, %xmm5
	pmullw	%xmm0, %xmm2
	movdqa	%xmm2, %xmm4
	punpckhwd	%xmm1, %xmm7
	punpckhwd	%xmm1, %xmm2
	punpcklwd	%xmm1, %xmm4
	paddd	%xmm2, %xmm5
	movdqa	%xmm3, %xmm2
	psrld	$16, %xmm5
	paddd	%xmm6, %xmm4
	movdqa	%xmm1, %xmm6
	punpcklwd	%xmm1, %xmm3
	pmullw	%xmm0, %xmm2
	psrld	$16, %xmm4
	paddw	_VACC+32, %xmm2
	pcmpgtw	%xmm0, %xmm6
	movaps	%xmm2, _VACC+32
	movaps	%xmm6, 32(%esp)
	movdqa	%xmm0, %xmm6
	punpckhwd	32(%esp), %xmm6
	punpcklwd	32(%esp), %xmm0
	movaps	%xmm6, (%esp)
	movdqa	(%esp), %xmm6
	pmuludq	%xmm7, %xmm6
	movaps	%xmm6, 16(%esp)
	psrldq	$4, %xmm7
	movdqa	(%esp), %xmm6
	psrldq	$4, %xmm6
	pmuludq	%xmm6, %xmm7
	pshufd	$8, 16(%esp), %xmm6
	pshufd	$8, %xmm7, %xmm7
	punpckldq	%xmm7, %xmm6
	psrad	$16, %xmm6
	paddd	%xmm5, %xmm6
	movdqa	%xmm3, %xmm5
	psrldq	$4, %xmm3
	pmuludq	%xmm0, %xmm5
	psrldq	$4, %xmm0
	pmuludq	%xmm3, %xmm0
	pshufd	$8, %xmm5, %xmm3
	pshufd	$8, %xmm0, %xmm0
	punpckldq	%xmm0, %xmm3
	movdqa	_VACC+16, %xmm0
	psrad	$16, %xmm3
	paddd	%xmm4, %xmm3
	movdqa	%xmm0, %xmm4
	punpcklwd	%xmm1, %xmm0
	punpckhwd	%xmm1, %xmm4
	paddd	%xmm0, %xmm3
	movdqa	%xmm3, %xmm0
	paddd	%xmm4, %xmm6
	movdqa	%xmm3, %xmm1
	psrld	$16, %xmm3
	punpcklwd	%xmm6, %xmm0
	punpckhwd	%xmm6, %xmm1
	psrld	$16, %xmm6
	movdqa	%xmm0, %xmm4
	punpcklwd	%xmm1, %xmm0
	punpckhwd	%xmm1, %xmm4
	movdqa	%xmm3, %xmm1
	punpcklwd	%xmm6, %xmm3
	punpckhwd	%xmm6, %xmm1
	punpcklwd	%xmm4, %xmm0
	movdqa	%xmm3, %xmm4
	punpcklwd	%xmm1, %xmm3
	punpckhwd	%xmm1, %xmm4
	movdqa	%xmm0, %xmm1
	movaps	%xmm0, _VACC+16
	punpcklwd	%xmm4, %xmm3
	movdqa	%xmm0, %xmm4
	paddw	_VACC, %xmm3
	punpcklwd	%xmm3, %xmm1
	punpckhwd	%xmm3, %xmm4
	movaps	%xmm3, _VACC
	packssdw	%xmm4, %xmm1
	pcmpeqw	%xmm1, %xmm0
	pxor	LC1, %xmm1
	pandn	LC0, %xmm0
	psubw	%xmm2, %xmm1
	pmullw	%xmm0, %xmm1
	paddw	%xmm2, %xmm1
	movlps	%xmm1, _VR(%ebx)
	movhps	%xmm1, _VR+8(%ebx)
	leal	-8(%ebp), %esp
	popl	%ebx
	.cfi_restore 3
	popl	%esi
	.cfi_restore 6
	popl	%ebp
	.cfi_restore 5
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc
Lines highlighted in red were merged/simplified in 4.8.1-3 x86 output in AT&T:
Code:
_VMADN:
LFB666:
	.cfi_startproc
	pushl	%ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	movl	%esp, %ebp
	.cfi_def_cfa_register 5
	pushl	%ebx
	andl	$-16, %esp
	addl	$-128, %esp
	.cfi_offset 3, -12
	movzbl	_inst+1, %eax
	movzwl	_inst+2, %edx
	movl	_inst, %ebx
	shrb	$3, %al
	shrw	$5, %dx
	shrl	$6, %ebx
	movl	%eax, (%esp)
	movzbl	_inst+2, %eax
	movl	%edx, %ecx
	andl	$15, %ecx
	andl	$31, %ebx
	sall	$4, %ebx
	movb	%al, 16(%esp)
	movl	16(%esp), %edx
	andl	$31, %edx
	sall	$4, %edx
	movdqa	_VR(%edx), %xmm0
	call	*_SSE2_SHUFFLE_16(,%ecx,4)
	movl	(%esp), %eax
	pxor	%xmm1, %xmm1
	movdqa	_VACC+32, %xmm5
	movzbl	%al, %eax
	movdqa	%xmm5, %xmm7
	sall	$4, %eax
	punpckhwd	%xmm1, %xmm5
	movdqu	_VR(%eax), %xmm3
	punpcklwd	%xmm1, %xmm7
	movdqa	%xmm3, %xmm2
	movdqa	%xmm7, %xmm6
	pmullw	%xmm0, %xmm2
	movdqa	%xmm2, %xmm7
	punpckhwd	%xmm1, %xmm2
	punpcklwd	%xmm1, %xmm7
	paddd	%xmm2, %xmm5
	movdqa	%xmm3, %xmm2
	psrld	$16, %xmm5
	movdqa	%xmm7, %xmm4
	pmullw	%xmm0, %xmm2
	paddw	_VACC+32, %xmm2
	movdqa	%xmm3, %xmm7
	punpcklwd	%xmm1, %xmm3
	paddd	%xmm6, %xmm4
	movaps	%xmm4, (%esp)
	movdqa	%xmm1, %xmm4
	punpckhwd	%xmm1, %xmm7
	movdqa	%xmm0, %xmm6
	pcmpgtw	%xmm0, %xmm4
	movaps	%xmm2, _VACC+32
	punpckhwd	%xmm4, %xmm6
	movaps	%xmm4, 16(%esp)
	movdqa	%xmm7, %xmm4
	psrlq	$32, %xmm7
	pmuludq	%xmm6, %xmm4
	psrlq	$32, %xmm6
	pmuludq	%xmm6, %xmm7
	pshufd	$8, %xmm4, %xmm6
	pshufd	$8, %xmm7, %xmm7
	movdqa	%xmm3, %xmm4
	psrlq	$32, %xmm3
	punpcklwd	16(%esp), %xmm0
	punpckldq	%xmm7, %xmm6
	movdqa	(%esp), %xmm7
	pmuludq	%xmm0, %xmm4
	psrad	$16, %xmm6
	psrlq	$32, %xmm0
	paddd	%xmm5, %xmm6
	pmuludq	%xmm3, %xmm0
	pshufd	$8, %xmm4, %xmm5
	pshufd	$8, %xmm0, %xmm0
	psrld	$16, %xmm7
	punpckldq	%xmm0, %xmm5
	movdqa	_VACC+16, %xmm0
	movdqa	%xmm0, %xmm4
	psrad	$16, %xmm5
	paddd	%xmm7, %xmm5
	punpcklwd	%xmm1, %xmm0
	punpckhwd	%xmm1, %xmm4
	paddd	%xmm0, %xmm5
	movdqa	%xmm5, %xmm0
	paddd	%xmm4, %xmm6
	movdqa	%xmm5, %xmm1
	psrld	$16, %xmm5
	punpcklwd	%xmm6, %xmm0
	punpckhwd	%xmm6, %xmm1
	psrld	$16, %xmm6
	movdqa	%xmm0, %xmm3
	punpcklwd	%xmm1, %xmm0
	punpckhwd	%xmm1, %xmm3
	movdqa	%xmm5, %xmm1
	punpcklwd	%xmm6, %xmm5
	punpckhwd	%xmm6, %xmm1
	punpcklwd	%xmm3, %xmm0
	movdqa	%xmm5, %xmm3
	punpcklwd	%xmm1, %xmm5
	punpckhwd	%xmm1, %xmm3
	movdqa	%xmm0, %xmm1
	movaps	%xmm0, _VACC+16
	punpcklwd	%xmm3, %xmm5
	movdqa	%xmm0, %xmm3
	paddw	_VACC, %xmm5
	punpcklwd	%xmm5, %xmm1
	punpckhwd	%xmm5, %xmm3
	movaps	%xmm5, _VACC
	packssdw	%xmm3, %xmm1
	pcmpeqw	%xmm1, %xmm0
	pxor	LC1, %xmm1
	pandn	LC0, %xmm0
	psubw	%xmm2, %xmm1
	pmullw	%xmm0, %xmm1
	paddw	%xmm2, %xmm1
	movups	%xmm1, _VR(%ebx)
	movl	-4(%ebp), %ebx
	leave
	.cfi_restore 5
	.cfi_restore 3
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc
Not extremely big optimization or anything (was 134 assembly statements, now 130), but overall for the most difficult opcodes on the RSP to optimize this is pretty helpful.

In fact, I have the overall tests for the entire assembly output file.
RSP.S generated by GCC 4.7.2 is 18114 lines long.
RSP.S generated by GCC 4.8.1 is 17883 lines long.

This relatively small delta indicates that the most relevant improvement to GCC, to my RSP project, is the auto-vectorizer code which takes my high-level C loops and converts them into SSE2, SSSE3, SSE4, AVX, or whichever I tell the build projects is allowed to be created for me.
Reply With Quote
  #586  
Old 18th November 2013, 02:02 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

I also have an update about my pseudo-recompiler idea.

I haven't yet implemented it (distracted by other projects, not to mention being an online chess addict isn't helping), but in my mind what I wrote before was wrong.


For LWC2, SWC2, I don't need it to be a 4-D function pointer table.
It should suffice that I can have it as a 3-D function jump table.

The dimensions of the switch-free procedure indexing are:
* OPCODE (inst.R.rd, the sub-op-code for which LWC2/SWC2 vector move is done)
* ELEMENT (the 4-bit element enabling some RSP so-defined illegal instruction overrides to get executed in hard-to-optimize edge cases)
* (SR[base] + (signed)(inst.R.offset)) & 0xF (the 4-bit alignment indicator so that I can move all of the bytes at once without having to branch mid-way any times, or having to XOR every single smaller 8-bit write to cover endianness inversions)

To conduct this multitudinously faster ?WC2 rewrite, however, means I am more likely better off using a switch statement on the primary operation code (is it SPECIAL, is it REGIMM, etc.) rather than doing a function jump table within a function jump table. There are still some disadvantages of the switch over the newly adopted 2-D jump table proposed by MarathonMan, but there are also some advantages (like localized PC loop increments and less constant checking for system control) and I think that the rest of the scalar opcodes, outside of SWC2 and LWC2, which MarathonMan's way of saying it improved on greatly, can easily be sacrificed for the better performance of the recompiler-demanding nature of the ?WC2 algorithms.
Reply With Quote
  #587  
Old 20th November 2013, 07:19 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

I don't mean to post unmeaningful things here, but I guess I let the thread die for some weeks during my break.

For anyone who has been successfully compiling my plugin off the Git repository, you might have noticed some offset noise disturbance in the background of MusyX audio microcode (Star Wars, TWINE 007, Gauntlet Legends, other games using MusyX). That was a problem with zero-extending subsetted types into superset data segments, before adding them in the extended-precision result during VADDC. I thought this was happening automatically, but it seems my knowledge of C type conversion operations was not complete.

So sound should sound perfect.
Reply With Quote
  #588  
Old 26th November 2013, 12:27 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Unexpectedly I have another update based straight off the above.

It took me until today to notice a zero-extension bug directly analogous to the one in VADDC, but in VSUBC.
Both bugs were introduced in the same commit and falsely dictate the carry-out flag setting in RSP $vco flags.

It was very easy to notice the VADDC version of the bug I fixed in the post above due to MusyX audio glitching, but it was very hard to notice one existed for VSUBC in World Driver Championship. Here the unsaturated difference in subtracting zero-extended VT from VS (VS - VT, that is), falsely set some carry-out flags, causing jittery, shaky 3-D textures in some small parts of the game. At first I thought it was just some stupid RDP bug, even if it was happening on angrylion's 99% accurate graphics plugin, but as eye-straining as it was to look out for it I saw that zilmar's less accurate RSP interpreter actually fixed that problem...so that should be all of the RSP bugs taken care of.

Neither of these bugs apply to the latest public release posted to this thread. This is only meaningful for anyone who has been following my Git repository uploads. It is extremely difficult to make massive SSE vectorizer template rewrites of an entire code base without having any bugs introduce themselves; I'm surprised that was all that was left.

I made the emulator faster with LWC2/SWC2, too.
That just leaves how to WTFM for the next release and the possibility of one more speed-up before releasing.
Reply With Quote
  #589  
Old 29th November 2013, 04:20 PM
Marcelo_20xx's Avatar
Marcelo_20xx Marcelo_20xx is offline
Senior Member
 
Join Date: Oct 2013
Posts: 171
Default

That's some great news, I am eagerly awaiting your next release to test the changes
Reply With Quote
  #590  
Old 5th December 2013, 01:58 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Updated. Sorry for the big break.

There are no known game fixes in this version since no more bugs were reported, besides angrylion's report of a debugging trace message attached to MTC0 on the Funnelcube demo (and some others by marshallh).

It probably is beyond a whole 12 VI/s faster than the previous release to this thread, except for games driven mad by the problem of RDP interpretation, like Conker's Bad Fur Day.

It will also work in 1964 emulator now because there is no more releasing different DLL versions for the same plugin, thanks to Garteal's DllConfig user interface he submitted.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 07:18 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.