Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #511  
Old 19th September 2013, 04:06 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

Oh, and one more thing.

The GCC ANSI C vectorizer bug doesn't exist when shuffling VR[vt].
It only fails to SSE2-ify/MOVDQA the accumulator-lo back to VR[vd] in the functions where e == 0x0 (no shuffling needed).

Glad this isn't as severe as I thought.
Reply With Quote
  #512  
Old 19th September 2013, 08:40 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

C-c-c-c-COMBO BREAKER!

Hilarious.
The static `SHUFFLE_VECTOR` inline is probably even edgier than you guessed.

So like, if I use SHUFFLE_VECTOR for all 15 copies of every function, for each vector op, then it compiles successfully.

But, for some reason, with ANY of the MULs/MACs, it waits until I install the SHUFFLE_VECTOR SSE2 inline with all 15 copies of the function to complain that the second parameter needs to be an 8-bit immediate, like you were mentioning.

If I uninstall it from any 1 of the 15 functions (like VMULF0q or VMULF7w, or both), then I no longer get the compiler error that the second parameter needs to be an 8-bit immediate.

But for some crazy, overly dynamic reason in the code generation, it waits until all 15 of them, for the multiplies, have this macro installed, for GCC to complain.

I do not have this issue with SELECT/LOGICAL/DIVIDE/ADD, just MULTIPLY.
I'll un-fuck it back tomorrow when I wake up.

Crazy mysteries, but this is really shrinking the DLL size by quite a bit.
Reply With Quote
  #513  
Old 22nd September 2013, 02:01 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

Vector Select Merge (VMRG) is now converted fully into SSE2.

The basic scalar algorithm was:
Code:
VMRG(vd, vs, vt) {
    for (int i = 0; i < 8; i++) {
        if (VCC[i])
            result[i] <-- VR[vs][i];
        else
            result[i] <-- VR[vt][i];
        VACC[i]15..0 <-- result[i];
    }
    VR[vd] <-- result;
}
The ternary one-liner wasn't really causing GCC to process it as SSE2 code generation either.


I'm not really a fan of this new way of saying it since it's somewhat less direct/accurate.
However, I think it seems optimal for ANSI C in this case:

Code:
INLINE static void do_mrg(short* VD, short* VS, short* VT)
{
    short diff[N];
    register int i;

    for (i = 0; i < N; i++)
        diff[i] = VS[i] - VT[i];
    for (i = 0; i < N; i++)
        VACC_L[i] = VT[i] + comp[i]*diff[i]; /* result = VCC[i] ? VS : VT; */
    for (i = 0; i < N; i++)
        VD[i] = VACC_L[i];
    return;
}
comes out as:

Code:
_VMRG:
LFB914:
	.cfi_startproc
	pushl	%ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	movl	%esp, %ebp
	.cfi_def_cfa_register 5
	andl	$-16, %esp
	subl	$16, %esp
	movl	_inst, %edx
	shrw	$6, %dx
	movb	_inst+1, %al
	shrb	$3, %al
	movzbl	%al, %eax
	sall	$4, %eax
	movdqu	_VR(%eax), %xmm0
	psubw	_ST, %xmm0
	pmullw	_comp, %xmm0
	paddw	_ST, %xmm0
	movdqa	%xmm0, _VACC+32
	movl	%edx, %eax
	andl	$31, %eax
	sall	$4, %eax
	movdqu	%xmm0, _VR(%eax)
	leave
	.cfi_restore 5
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc
Part of the problem may be that I made the VCC RSP flags register :: comp lower byte, an array of ints, instead of shorts, which causes 32*8 = 2 SSE2 registers needed.
Ideally I made it of type `int` just because it's a Boolean 0 or 1 every element.

Code:
int clip[N]; /* $vcc:  vector compare code register (high byte:  clip only) */
int comp[N]; /* $vcc:  vector compare code register (low byte:  compare) */
I can mess it later.

Last edited by HatCat; 22nd September 2013 at 08:51 AM.
Reply With Quote
  #514  
Old 22nd September 2013, 03:16 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

Damn, there I was thinking VMRG was the fastest/easiest of all the vector select op-codes.

VEQ (and slightly more-so VNE) is even faster.
Code:
_VNE:
LFB1050:
	.cfi_startproc
	movl	_inst, %edx
	shrw	$6, %dx
	movb	_inst+1, %al
	shrb	$3, %al
	pxor	%xmm1, %xmm1
	movdqa	%xmm1, _clip
	movzbl	%al, %eax
	sall	$4, %eax
	movdqu	_VR(%eax), %xmm2
	movdqa	%xmm2, %xmm0
	pcmpeqw	_ST, %xmm0
	pandn	LC2, %xmm0
	por	_ne, %xmm0
	movdqa	%xmm0, _comp
	movdqa	%xmm2, _VACC+32
	movl	%edx, %eax
	andl	$31, %eax
	sall	$4, %eax
	movdqu	%xmm2, _VR(%eax)
	movdqa	%xmm1, _ne
	movdqa	%xmm1, _co
	ret
	.cfi_endproc
So apparently GCC knows very well how to vectorize comparisons between two vectors (xmm1 == xmm2 or xmm1 != xmm2), just not storing conditional moves unless I perform that integration myself.

This interpreter is already faster than the previous, current release in this thread, and at the rate this is going it may catch up to recompiler plugin for Project64.

I wish I could say it could get nearly as fast as a HLE plugin, but it would have to be a piss-poor optimized HLE algorithm (or actually no HLE at all but LLE code like suanyuan's plugins all do) for that goal to be realistic.
And I'm sure Azimer etc. all took care to optimize the static C code for audio ucodes etc.

[Actually I was able to make some of the HLE code in the Mupen64 RSP plugin even more direct (didn't look at the JPEG stuff), but nothing I could ever sync with LLE speeds.]

Supposedly, if HLE RSP simulators relied too heavily on millions of scalar loops to update the 128-bit vectors millions of times in an audio ucode, then LLE could perhaps approach the HLE speeds by using SSE2.

Last edited by HatCat; 22nd September 2013 at 08:50 AM.
Reply With Quote
  #515  
Old 22nd September 2013, 08:57 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

WOOOOOOOOOOOOO!

Quote:
Originally Posted by BatCat View Post
Code:
int clip[N]; /* $vcc:  vector compare code register (high byte:  clip only) */
int comp[N]; /* $vcc:  vector compare code register (low byte:  compare) */
For this reason I had to edit my posts above with the NEW, even smaller code output for VNE/VMRG.

I was under the impression that this was just an insignificant thing, causing a 256-bit R/W using 2 ximm's and that was all.
I was horribly, horribly wrong. This removed a shitload of excessive packs/unpacks.

And YES, would you believe how small to emulate VCH, one of two most tricky and complex RSP opcodes to emulate, has become?

Code:
_VCH:
LFB691:
	pushl	%ebp
	movl	%esp, %ebp
	andl	$-16, %esp
	addl	$-128, %esp
	movl	_inst, %edx
	shrw	$6, %dx
	movb	_inst+1, %al
	shrb	$3, %al
	movdqa	_ST, %xmm4
	movzbl	%al, %eax
	sall	$4, %eax
	movdqu	_VR(%eax), %xmm2
	movdqa	%xmm2, %xmm0
	pxor	%xmm4, %xmm0
	psrlw	$15, %xmm0
	pxor	%xmm5, %xmm5
	movdqa	%xmm5, %xmm1
	psubw	%xmm0, %xmm1
	pxor	%xmm4, %xmm1
	movdqa	LC2, %xmm3
	movdqa	%xmm2, %xmm6
	pcmpeqw	%xmm1, %xmm6
	pand	%xmm3, %xmm6
	pand	%xmm0, %xmm6
	movdqa	%xmm6, _vce
	paddw	%xmm0, %xmm1
	movdqa	%xmm1, 16(%esp)
	pcmpeqw	%xmm2, %xmm1
	pand	%xmm3, %xmm1
	por	%xmm6, %xmm1
	movdqa	%xmm5, %xmm6
	psubw	%xmm2, %xmm6
	movdqa	%xmm3, %xmm7
	pxor	%xmm0, %xmm7
	movdqa	%xmm7, (%esp)
	movdqa	%xmm5, %xmm7
	psubw	(%esp), %xmm7
	por	%xmm7, %xmm6
	movdqa	%xmm4, %xmm7
	pcmpgtw	%xmm6, %xmm7
	movdqa	%xmm7, %xmm6
	pandn	%xmm3, %xmm6
	psubw	%xmm0, %xmm5
	por	%xmm2, %xmm5
	pcmpgtw	%xmm5, %xmm4
	pandn	%xmm3, %xmm4
	movdqa	%xmm6, %xmm5
	psubw	%xmm4, %xmm5
	pmullw	%xmm0, %xmm5
	paddw	%xmm4, %xmm5
	movdqa	16(%esp), %xmm7
	psubw	%xmm2, %xmm7
	pmullw	%xmm5, %xmm7
	paddw	%xmm2, %xmm7
	movdqa	%xmm7, _VACC+32
	movl	%edx, %eax
	andl	$31, %eax
	sall	$4, %eax
	movdqu	%xmm7, _VR(%eax)
	movdqa	%xmm4, _clip
	movdqa	%xmm6, _comp
	pxor	%xmm3, %xmm1
	movdqa	%xmm1, _ne
	movdqa	%xmm0, _co
	leave
	ret
Only 66 instructions to execute (including the crappy stack management with the IW global union fields).


Beat that with your intrinsics, MarathonMan!
Reply With Quote
  #516  
Old 22nd September 2013, 09:32 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
Beat that with your intrinsics, MarathonMan!


Though I have to watch who I "challenge".

I still haven't touched my rdp since telling suanyuan I would a few weeks ago.

... and I've been working overtime.
Reply With Quote
  #517  
Old 22nd September 2013, 09:57 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

Never mind him.

Your RDP emulator was already unquestionably faster than his, since the beginning.
("His", as in, the RDP algorithms he never upgraded; he only removed the real VI filtering code to make the plugin faster as a whole.)

There are people who optimize things merely by replacing emulation with standard API knowledge.
There are people who just rewrite the damn hardware emulation algorithms to be faster as they should!

Just concentrate on making the most accurate emulator, and it shall later be unrealistically speedy.
Similar to the advice others gave you in your EmuTalk CEN64 thread, you don't want to risk doing speed before accuracy.

I still don't know jack shit hardly about the RDP, so I'm mostly counting on someone else to pour over a similar effort as I did here.
Reply With Quote
  #518  
Old 26th September 2013, 07:54 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

Like my benchmark loop I wrote for the RSP opcodes?

It's pretty simple enough that, MarathonMan I think you could use a similar template for CEN64 RSP tests.

Code:
RSP Vector Benchmarks Log

LBV    :  0.331 s
SBV    :  0.265 s
LSV    :  0.307 s
SSV    :  0.301 s
LLV    :  0.356 s
SLV    :  0.378 s
LDV    :  0.390 s
SDV    :  0.883 s
LQV    :  0.416 s
SQV    :  1.699 s
LRV    :  0.256 s
SRV    :  0.250 s
LPV    :  0.360 s
SPV    :  0.554 s
LUV    :  0.462 s
SUV    :  0.512 s
VMULF  :  0.670 s
VMACF  :  1.208 s
VMULU  :  0.675 s
VMACU  :  1.104 s
VMUDL  :  0.582 s
VMADL  :  1.123 s
VMUDM  :  0.679 s
VMADM  :  1.164 s
VMUDN  :  0.667 s
VMADN  :  1.136 s
VMUDH  :  0.656 s
VMADH  :  0.851 s
VADD   :  0.717 s
VSUB   :  0.740 s
VABS   :  0.380 s
VADDC  :  0.454 s
VSUBC  :  0.445 s
VSAW   :  0.232 s
VLT    :  0.395 s
VEQ    :  0.322 s
VNE    :  0.335 s
VGE    :  0.389 s
VCH    :  0.676 s
VCL    :  0.865 s
VCR    :  0.505 s
VMRG   :  0.286 s
VAND   :  0.133 s
VNAND  :  0.222 s
VOR    :  0.144 s
VNOR   :  0.188 s
VXOR   :  0.203 s
VNXOR  :  0.207 s
VRCPL  :  1.041 s
VRSQL  :  1.097 s
VRCPH  :  0.306 s
VRSQH  :  0.220 s
VMOV   :  0.148 s
VNOP   :  0.193 s
Total time spent:  29.078 s
That's on my 1.90 GHz AMD Athlon 3600 X2.
Yes, I know some of the times don't appear to make sense (LBV is slower than LSV, SQV is ridiculously slow, etc.); that's because of a few technical factors with the Scalar Unit header I've not attended to yet.

The one thing I can't figure out is why NOP/VNOP has slower test results than VMOV/VAND/VOR/VNOR, when all NOP is is just a ret opcode.

Anyway, this is what it looks like when I disable SSE2 and force -march=mmx:

Code:
RSP Vector Benchmarks Log

LBV    :  0.268 s
SBV    :  0.271 s
LSV    :  0.249 s
SSV    :  0.225 s
LLV    :  0.342 s
SLV    :  0.301 s
LDV    :  0.334 s
SDV    :  0.830 s
LQV    :  0.383 s
SQV    :  1.674 s
LRV    :  0.177 s
SRV    :  0.193 s
LPV    :  0.342 s
SPV    :  0.510 s
LUV    :  0.394 s
SUV    :  0.458 s
VMULF  :  1.086 s
VMACF  :  3.607 s
VMULU  :  1.285 s
VMACU  :  2.395 s
VMUDL  :  0.544 s
VMADL  :  2.595 s
VMUDM  :  0.752 s
VMADM  :  2.947 s
VMUDN  :  0.792 s
VMADN  :  2.610 s
VMUDH  :  2.643 s
VMADH  :  2.711 s
VADD   :  1.252 s
VSUB   :  2.225 s
VABS   :  1.507 s
VADDC  :  0.856 s
VSUBC  :  0.940 s
VSAW   :  0.252 s
VLT    :  1.391 s
VEQ    :  0.553 s
VNE    :  0.646 s
VGE    :  1.575 s
VCH    :  3.256 s
VCL    :  3.733 s
VCR    :  1.783 s
VMRG   :  0.722 s
VAND   :  0.352 s
VNAND  :  0.337 s
VOR    :  0.342 s
VNOR   :  0.325 s
VXOR   :  0.358 s
VNXOR  :  0.329 s
VRCPL  :  1.002 s
VRSQL  :  1.037 s
VRCPH  :  0.254 s
VRSQH  :  0.248 s
VMOV   :  0.178 s
VNOP   :  0.192 s
Total time spent:  56.563 s
Reply With Quote
  #519  
Old 27th September 2013, 08:34 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

I made some fixes to SDV and SQV now.
They are 3x and 8x as fast as before when I assumed games would use illegal values for the 4-bit element.

Code:
RSP Vector Benchmarks Log

LBV    :  0.225 s
SBV    :  0.206 s
LSV    :  0.210 s
SSV    :  0.221 s
LLV    :  0.292 s
SLV    :  0.276 s
LDV    :  0.297 s
SDV    :  0.279 s
LQV    :  0.342 s
SQV    :  0.277 s
LRV    :  0.166 s
SRV    :  0.193 s
LPV    :  0.317 s
SPV    :  0.449 s
LUV    :  0.369 s
SUV    :  0.433 s
VMULF  :  0.522 s
VMACF  :  1.111 s
VMULU  :  0.576 s
VMACU  :  1.021 s
VMUDL  :  0.510 s
VMADL  :  1.016 s
VMUDM  :  0.543 s
VMADM  :  1.037 s
VMUDN  :  0.539 s
VMADN  :  1.017 s
VMUDH  :  0.534 s
VMADH  :  0.714 s
VADD   :  0.586 s
VSUB   :  0.580 s
VABS   :  0.275 s
VADDC  :  0.338 s
VSUBC  :  0.374 s
VSAW   :  0.138 s
VLT    :  0.298 s
VEQ    :  0.232 s
VNE    :  0.250 s
VGE    :  0.303 s
VCH    :  0.561 s
VCL    :  0.748 s
VCR    :  0.368 s
VMRG   :  0.234 s
VAND   :  0.126 s
VNAND  :  0.133 s
VOR    :  0.130 s
VNOR   :  0.130 s
VXOR   :  0.129 s
VNXOR  :  0.132 s
VRCPL  :  0.939 s
VRSQL  :  0.984 s
VRCPH  :  0.244 s
VRSQH  :  0.238 s
VMOV   :  0.118 s
VNOP   :  0.058 s
Total time spent:  22.338 s
Incidentally I fixed also "SRV", but that's meaningless at the moment because I have never before seen a game use that op-code besides the custom audio ucode in Resident Evil 2. Even then, the address was even and never odd so far in as I have seen.

Anyway, since SDV and SQV are executed more often than even VMULF and VMACF are for audio ucodes (sometimes, if not more often than very close...), of course this yields a noticeable speedup to the DLL in LLE.

My next focus is probably those damned multiply-accumulate 16-bit segmentations in my SSE2 driver...they take like > 1.000 s still which is annoying.

There are other op-codes I could optimize, but remember I already wrote a counter for the frequency of how often RSP op-codes are executed so am using that:
http://forum.pj64-emu.com/showthread.php?t=3398

VMADH and VMADN are executed far more often than anything under ?WC2 in many games.
Reply With Quote
  #520  
Old 28th September 2013, 02:37 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,255
Default

I decided to run a cheat sheet and adopt some of the explicit intrinsics for low-clamping (used by VM?DL and VM?DN) from MarathonMan's CEN64 RSP fork.

His original code is almost completely unmodified except for a couple things about the convenience of the in-line access (and my choice of 4 spaces to indent code instead of 2).

Code:
static INLINE void _MM_sclampz_lo(short* VD)
{
    __m128i accHi, accMid, accLow;
    __m128i negVal, posVal;
    __m128i negCheck, useValMask;
    __m128i setMask = _mm_cmpeq_epi16(_mm_setzero_si128(), _mm_setzero_si128());

    accHi  = _mm_load_si128((__m128i *)VACC_H);
    accMid = _mm_load_si128((__m128i *)VACC_M);
    accLow = _mm_load_si128((__m128i *)VACC_L);

    /* Compute some common values ahead of time. */
    negCheck = _mm_cmplt_epi16(accHi, _mm_setzero_si128());

    /* If accumulator < 0, clamp to Val if Val != TMin. */
    useValMask = _mm_and_si128(accHi, _mm_srai_epi16(accMid, 15));
    useValMask = _mm_cmpeq_epi16(useValMask, setMask);
    negVal = _mm_and_si128(useValMask, accLow);

    /* Otherwise, clamp to ~0 if any high bits are set. */
    useValMask = _mm_or_si128(accHi, _mm_srai_epi16(accMid, 15));
    useValMask = _mm_cmpeq_epi16(useValMask, _mm_setzero_si128());
    posVal = _mm_and_si128(useValMask, accLow);

    negVal = _mm_and_si128(negCheck, negVal);
    posVal = _mm_andnot_si128(negCheck, posVal);

    _mm_store_si128((__m128i *)VD, _mm_or_si128(negVal, posVal));
    return;
}
This saved 10 Intel/SSE2 instructions from being generated for both VMADL and VMADN.
So both MADL and VMADN are 10 instructions smaller now and SHOULD be faster.

However, the speed difference is, not really noticeable enough yet in my benchmark log.

Before adopting MarathonMan's intrinisics for signed clamping:
Code:
VMADL  :  0.959 s
VMADN  :  0.985 s
After adopting MarathonMan's intrinsics for signed clamping:
Code:
VMADL  :  0.959 s
VMADN  :  0.986 s
It would seem that any performance improvements as a result of doing this are microscopic.

I kept trying to suspend the CPU interference of other processes I have open on Windows 7, but there's always some sort of unpredictable behavior causing miniscule CPU spikes at random points during the benchmark test.

Anyway, I don't care.
10 instructions less = probably faster, so for now I'll keep it.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 02:57 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.