Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #521  
Old 28th September 2013, 09:10 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Quote:
Originally Posted by BatCat View Post
Code:
RSP Vector Benchmarks Log

...

VMULF  :  0.522 s
VMACF  :  1.111 s
VMULU  :  0.576 s
VMACU  :  1.021 s
VMUDL  :  0.510 s
VMADL  :  1.016 s
VMUDM  :  0.543 s
VMADM  :  1.037 s
VMUDN  :  0.539 s
VMADN  :  1.017 s
VMUDH  :  0.534 s
VMADH  :  0.714 s

...
Total time spent:  22.338 s
Code:
VMULF  :  0.539 s
VMACF  :  0.865 s
VMULU  :  0.558 s
VMACU  :  0.990 s
VMUDL  :  0.479 s
VMADL  :  0.969 s
VMUDM  :  0.539 s
VMADM  :  0.882 s
VMUDN  :  0.523 s
VMADN  :  0.928 s
VMUDH  :  0.334 s
VMADH  :  0.579 s

Total time spent:  21.812 s
macu is only faster incidentally.
I never touched that opcode. I only highlighted the opcodes I revised, which are all faster.
Reply With Quote
  #522  
Old 29th September 2013, 05:50 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

I'm not following these manipulations on VADD/VSUB of yours, MarathonMan.

You say it like this:

VADD
Code:
...

    /* VD is the signed sum of the two sources and the carry. Since we */
    /* have to saturate the sum of all three, we have to be clever. */
    __vmin = _mm_min_epi16(__vs, __vt);
    __vmax = _mm_max_epi16(__vs, __vt);
    __vmin = _mm_adds_epi16(__vmin, __vco);
    __vd = _mm_adds_epi16(__vmin, __vmax);
(What the hell is _max_ and _min_ for? Just add them directly!)

VSUB
Code:
...

    /* VD is the signed diff of the two sources and the carry. Since we */
    /* have to saturate the diff of all three, we have to be clever. */
    __vtneg = _mm_cmplt_epi16(__vt, _mm_setzero_si128());
    __vtpos = _mm_cmpeq_epi16(__vtneg, _mm_setzero_si128());

    __vd = _mm_subs_epi16(__vs, __vt);
    __vnof = _mm_cmpeq_epi16(__vchk, __vd);
    __vnof = _mm_and_si128(__vtneg, __vnof);

    __vtneg = _mm_and_si128(__vnof, __vco);
    __vtpos = _mm_and_si128(__vtpos, __vco);
    __vco = _mm_or_si128(__vtneg, __vtpos);
    __vd = _mm_subs_epi16(__vd, __vco);
Those appear to be extraneous steps and have nothing to do with the required mode of operation.

Thanks to the references you made to SSE2 `adds` and `subs` I was able to look them up in the Intel manual and find out that I was wrong that you needed 32-bit storage...this can go more directly than I thought!

In fact, the clamp to the VR[vd] file can be done even more directly than what you write in that fork of the emulator:

Code:
extern short co[N];

static INLINE void SIGNED_CLAMP_ADD(short* VD, short* VS, short* VT)
{
    __m128i dst, src, vco;

    src = _mm_load_si128((__m128i *)VS);
    dst = _mm_load_si128((__m128i *)VT);
    vco = _mm_load_si128((__m128i *)co);

    src = _mm_adds_epi16(src, dst);
    dst = _mm_adds_epi16(src, vco);
    _mm_store_si128((__m128i *)VD, dst);
    return;
}
static INLINE void SIGNED_CLAMP_SUB(short* VD, short* VS, short* VT)
{
    __m128i dst, src, vco;

    src = _mm_load_si128((__m128i *)VS);
    dst = _mm_load_si128((__m128i *)VT);
    vco = _mm_load_si128((__m128i *)co);

    src = _mm_subs_epi16(src, dst);
    dst = _mm_subs_epi16(src, vco);
    _mm_store_si128((__m128i *)VD, dst);
    return;
}
It is otherwise similar enough to your method, but much smaller in excessive steps and IMO easier to read.


I still can't believe how smoothly the rate of these optimizations are beginning to accelerate now that I am using SSE.
After consuming an entire year of non-SSE, pure x86 machine optimizations!

I owe this all to you , though I fear that your own efforts to assist me may have consolidated some forces to be weighed against your own work.
Reply With Quote
  #523  
Old 29th September 2013, 06:01 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Over 80 lines of crap removed from VSUB:

Code:
_VSUB:
LFB1203:
	.cfi_startproc
	movb	_inst+1, %al
	shrb	$3, %al
	movl	_inst, %edx
	shrw	$6, %dx
	andl	$31, %edx
	sall	$4, %edx
	movzbl	%al, %eax
	sall	$4, %eax
	movdqu	_VR(%eax), %xmm0
	psubw	_ST, %xmm0
	psubw	_co, %xmm0
	movdqa	%xmm0, _VACC+32
	movdqu	_VR(%eax), %xmm0
	psubsw	_ST, %xmm0
	psubsw	_co, %xmm0
	movdqa	%xmm0, _VR(%edx)
	pxor	%xmm0, %xmm0
	movdqa	%xmm0, _ne
	movdqa	%xmm0, _co
	ret
	.cfi_endproc
Over 80 lines of crap removed from VADD:
Code:
_VADD:
LFB632:
	.cfi_startproc
	movb	_inst+1, %al
	shrb	$3, %al
	movl	_inst, %edx
	shrw	$6, %dx
	andl	$31, %edx
	sall	$4, %edx
	movzbl	%al, %eax
	sall	$4, %eax
	movdqu	_VR(%eax), %xmm0
	paddw	_ST, %xmm0
	paddw	_co, %xmm0
	movdqa	%xmm0, _VACC+32
	movdqu	_VR(%eax), %xmm0
	paddsw	_ST, %xmm0
	paddsw	_co, %xmm0
	movdqa	%xmm0, _VR(%edx)
	pxor	%xmm0, %xmm0
	movdqa	%xmm0, _ne
	movdqa	%xmm0, _co
	ret
	.cfi_endproc
Benchmark before:
Code:
VADD   :  0.576 s
VSUB   :  0.579 s
Benchmark after:
Code:
VADD   :  0.225 s
VSUB   :  0.255 s
Reply With Quote
  #524  
Old 29th September 2013, 06:44 AM
oddMLan's Avatar
oddMLan oddMLan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2009
Location: Parappa Town
Posts: 210
Default

Reply With Quote
  #525  
Old 29th September 2013, 10:43 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
I'm not following these manipulations on VADD/VSUB of yours, MarathonMan.

You say it like this:

VADD
Code:
...

    /* VD is the signed sum of the two sources and the carry. Since we */
    /* have to saturate the sum of all three, we have to be clever. */
    __vmin = _mm_min_epi16(__vs, __vt);
    __vmax = _mm_max_epi16(__vs, __vt);
    __vmin = _mm_adds_epi16(__vmin, __vco);
    __vd = _mm_adds_epi16(__vmin, __vmax);
(What the hell is _max_ and _min_ for? Just add them directly!)

VSUB
Code:
...

    /* VD is the signed diff of the two sources and the carry. Since we */
    /* have to saturate the diff of all three, we have to be clever. */
    __vtneg = _mm_cmplt_epi16(__vt, _mm_setzero_si128());
    __vtpos = _mm_cmpeq_epi16(__vtneg, _mm_setzero_si128());

    __vd = _mm_subs_epi16(__vs, __vt);
    __vnof = _mm_cmpeq_epi16(__vchk, __vd);
    __vnof = _mm_and_si128(__vtneg, __vnof);

    __vtneg = _mm_and_si128(__vnof, __vco);
    __vtpos = _mm_and_si128(__vtpos, __vco);
    __vco = _mm_or_si128(__vtneg, __vtpos);
    __vd = _mm_subs_epi16(__vd, __vco);
Those appear to be extraneous steps and have nothing to do with the required mode of operation.

...
Be careful, young Skywalker, you must!

VADD and VSUB treat the inputs as signed, correct? The instrinsics, _mm_adds_epi16 and _mm_subs_epi16, also treat inputs as two's complimented signed values. Thus it could be the case that we have:

INT16_MIN + INT16_MIN + VCO. Here, the result should be INT16_MIN. If we just do a flat saturating add and then tack on the carry, our result is INT16_MIN + 1, when in reality it should be INT16_MIN. Is this bug substantial enough to cause noticable problems? Almost certainly not, but I believe it is the wrong behaviour. Similar logic follows for VSUB, albeit it's not as easy to correctly maintain the same semantics as the min and max trick will not work there.

I was extremely meticulous when writing these opcodes, so they should be sterling clean (at least, everything other than LWC2/SWC2... but those are technically scalar anyways).

BatCat, day 1:
Quote:
I will never use anything but ANSI C, screw intrinsics.
BatCat, day 50:
Quote:
Okay, these intrinsics are kinda cool...
BatCat, day 100:

Last edited by MarathonMan; 29th September 2013 at 11:00 PM.
Reply With Quote
  #526  
Old 29th September 2013, 11:12 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Quote:
Originally Posted by MarathonMan View Post
Be careful, young Skywalker, you must!

VADD and VSUB treat the inputs as signed, correct? The instrinsics, _mm_adds_epi16 and _mm_subs_epi16, also treat inputs as two's complimented signed values. Thus it could be the case that we have:

INT16_MIN + INT16_MIN + VCO. Here, the result should be INT16_MIN. If we just do a flat saturating add and then tack on the carry, our result is INT16_MIN + 1, when in reality it should be INT16_MIN.
Well the minimum short is -0x8000.

So adds(adds(-0x8000, -0x8000), 0x0001) is -0x8001,
But adds(adds(-0x8000, 0x0001), -0x8000) is still -0x8000.

So I still don't see really what you're getting at.

If you mangled the order of operations yourself, then MAX and MIN would be just as irrelevant factors.
If you did (VS=MIN + VT=MIN) + (VCO is 1) then it would come out as -0x8001 and you would still have the bug.

If nothing else, it's the order of operations that matters.

Quote:
Originally Posted by MarathonMan View Post
BatCat, day 100:
The purpose of the intrinsics is only to handle cases where the compiler has no anticipated chance of recognizing them itself.
Obviously if it was possible for compilers to easily convert the mechanism into a signed clamp intrinsic I would not need to write the intrinsic.

I never really ruled out using intrinsics, just overexerting them.
Reply With Quote
  #527  
Old 29th September 2013, 11:53 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
Well the minimum short is -0x8000.

So adds(adds(-0x8000, -0x8000), 0x0001) is -0x8001,
But adds(adds(-0x8000, 0x0001), -0x8000) is still -0x8000.

So I still don't see really what you're getting at.
Errr... I'm getting at the fact that the result is different?

Your code:
Code:
    src = _mm_adds_epi16(src, dst);
    dst = _mm_adds_epi16(src, vco);
    _mm_store_si128((__m128i *)VD, dst);
Works fine for 99.99% of test cases. It fails some though:

Per my example:
src=0x8000, dst=0x8000, vco=1

As you say in your own post:
So adds(adds(-0x8000, -0x8000), 0x0001) is -0x8001,

I agree, assuming you mean -0x8001 == INT_MIN + 1 (personally I prefer 0x8001 as we are arguing semantics on two's compliment values).

In which case, that is the wrong result. Or at least it is IMO.

The correct result should be, effectively, adds(src, dst, vco), which is 0x8000 or INT_MIN. This is what my source produces by forcing VCO to be added to the minimum of the input operands, and then add that sum to the other operand.

Last edited by MarathonMan; 29th September 2013 at 11:57 PM.
Reply With Quote
  #528  
Old 30th September 2013, 12:09 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

That actually is not my code any longer, as the problem was easily amended, and I think maybe you didn't finish reading my post.

"But adds(adds(-0x8000, 0x0001), -0x8000) is still -0x8000."
That's the correct result, not a bug.
adds(VS, adds(VT, co))

Code:
static INLINE void SIGNED_CLAMP_ADD(short* VD, short* VS, short* VT)
{
    __m128i dst, vco, xmm;

    xmm = _mm_load_si128((__m128i *)VS);
    dst = _mm_load_si128((__m128i *)VT);
    vco = _mm_load_si128((__m128i *)co);

    dst = _mm_adds_epi16(dst, vco);
    xmm = _mm_adds_epi16(xmm, dst);
    _mm_store_si128((__m128i *)VD, xmm);
    return;
}
So until you can give me a pair of 3 numbers to add that breaks that code I just gave you I'm afraid I really can't understand where your max/min argument is going.

The only reason you don't have that bug is because you add vco to VT/VS first, BEFORE doing VS + VT, not because of that redundant-looking MAX/MIN stuff (but forgive me if I am wrong).

I mean, your code looks like,
max(1, 2) + min(1, 2), which is still 1 + 2 ???
I really don't see the purpose of max/min in your file.
You just add the two numbers directly, like I said.
It just has to be in the right order, right?

Quote:
Originally Posted by MarathonMan View Post
Code:
    src = _mm_adds_epi16(src, dst);
    dst = _mm_adds_epi16(src, vco);
    _mm_store_si128((__m128i *)VD, dst);
Works fine for 99.99% of test cases.
I would say much less than 99.99% of the time does that work.

In fact, I think it's more like ~50 % success rate.
I don't know why I didn't get any gfx/sound errors as a result of that change, but like I said, I fixed that with the code I gave you. :P

(-0x8000 + -0xFFFF) + true : -0x8001 (wrong, should be -0x8000)
(-0x8000 + -0xFFFE) + true : -0x8001 (wrong, should be -0x8000)
...
(-0x8000 + -0x8000) + true : -0x8001 (wrong, should be -0x8000)

So any time adds(VS, VT) needs to be clamped for underflow (which has a probability of, like I said, about 50% chance), it would be a bug on my part.

And as I said you are only immune to this bug because you add each VCO flag first, before adding VS+VT.
It has NOTHING to do with the max/min in your program, just the correct order of operations you chose.

If your code looked like this:
Code:
...
    /* VD is the signed sum of the two sources and the carry. Since we */
    /* have to saturate the sum of all three, we have to be clever. */
    __vmin = _mm_min_epi16(__vs, __vt);
    __vmax = _mm_max_epi16(__vs, __vt);
    __vd = _mm_adds_epi16(__vmin, __vmax);
    __vmin = _mm_adds_epi16(__vd, __vco);
instead of this:
Code:
...
    /* VD is the signed sum of the two sources and the carry. Since we */
    /* have to saturate the sum of all three, we have to be clever. */
    __vmin = _mm_min_epi16(__vs, __vt);
    __vmax = _mm_max_epi16(__vs, __vt);
    __vmin = _mm_adds_epi16(__vmin, __vco);
    __vd = _mm_adds_epi16(__vmin, __vmax);
Then you would have exactly the same bug as I just got done fixing , regardless of max/min.

EDIT: Actually that's wrong, it should be 75% rate of success, not 50.
Not only does (VS + VT < -32768), but of course the corresponding flag must also be set.
So we have a 1/2 chance of VCO being set and a 1/2 chance of underflow after adding VS + VT,
so 1/2*1/2 = 1/4, 4/4 - 1/4 = 3/4 = 75% chance of success, 25% chance of failure.

So mysteriously I never once had the bug in mario/quest64 after I used that old buggy code version.
But ideally, there should have been problems 25% of the time.

Last edited by HatCat; 30th September 2013 at 12:31 AM.
Reply With Quote
  #529  
Old 30th September 2013, 01:23 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
That actually is not my code any longer, as the problem was easily amended, and I think maybe you didn't finish reading my post.
To be fair, you just posted a different block of code than "fix" in your OP. Each has a different TC it fails on.

OP:
Code:
static INLINE void SIGNED_CLAMP_ADD(short* VD, short* VS, short* VT)
{
    __m128i dst, src, vco;

    src = _mm_load_si128((__m128i *)VS);
    dst = _mm_load_si128((__m128i *)VT);
    vco = _mm_load_si128((__m128i *)co);

    src = _mm_adds_epi16(src, dst);
    dst = _mm_adds_epi16(src, vco);
    _mm_store_si128((__m128i *)VD, dst);
    return;
}
aka
tmp = adds(vs, vt)
dst = adds(tmp, vco)

Most recent post:
Code:
static INLINE void SIGNED_CLAMP_ADD(short* VD, short* VS, short* VT)
{
    __m128i dst, vco, xmm;

    xmm = _mm_load_si128((__m128i *)VS);
    dst = _mm_load_si128((__m128i *)VT);
    vco = _mm_load_si128((__m128i *)co);

    dst = _mm_adds_epi16(dst, vco);
    xmm = _mm_adds_epi16(xmm, dst);
    _mm_store_si128((__m128i *)VD, xmm);
    return;
}
aka
tmp = adds(vt, vco)
dst = adds(vs, tmp)

Quote:
Originally Posted by BatCat View Post
So until you can give me a pair of 3 numbers to add that breaks that code I just gave you I'm afraid I really can't understand where your max/min argument is going.
Your most recent "fix" breaks with:
VT=INT16_MAX, VCO=1, VS=-1

Your most recent solution would produce:
adds(INT16_MAX, 1) = INT16_MAX
adds(INT16_MAX, -1) = INT16_MAX -1
even though

adds(INT16_MAX, -1, 1) = INT16_MAX.

So winrar is I.

I'm ill so I'm tl;dr-ing over the rest of this post because I'm certain that I'm right.

Last edited by MarathonMan; 30th September 2013 at 01:56 AM.
Reply With Quote
  #530  
Old 30th September 2013, 02:24 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

First off, there is no such intrinsic as "adds(a, b, c)" because you can only have two parameters in this intrinsic, correct?

So why do you keep writing it that way, when the real issue that I am trying to address is your use of max/min SSE2 intrinsics *before* adding "adds(b, c), adds(a, b)", and not the more obvious, should-be-concluded-already issue of adding vco first rather than last?

Because I never questioned you about both of those things, only one of those things.

So why keep arguing about both of them? As I said, the latter issue was already fixed.

Quote:
Originally Posted by MarathonMan View Post
To be fair, you just posted a different block of code than "fix" in your OP.
I thought this would be obvious. :/

What I posted is what my code is, not what my code "used to be".

I fail to see how quoting what "used to be" the case is relevant to your argument against what I am actually challenging.

Quote:
Originally Posted by MarathonMan View Post
Your most recent "fix" breaks with:
VT=INT16_MAX, VCO=1, VS=-1

Your most recent solution would produce:
adds(INT16_MAX, 1) = INT16_MAX
adds(INT16_MAX, -1) = INT16_MAX -1
even though

adds(INT16_MAX, -1, 1) = INT16_MAX.
Thanks; this is what I needed to know.

So then you just need to add VT+(VS+vco) if VT > VS.
Code:
(VS, VT, co) = (-1, +32767, true)

VS + (VT + co):  [-1 + (+32767 + 1)] is wrong.
vs.
VT + (VS + co):  [+32767 + (-1 + 1)] is right.
Conversely, if VS > VT, the order must be inverted because:
Code:
(VS, VT, co) = (+32767, -1, true)

VS + (VT + co):  [+32767 + (-1 + 1)] is right.
vs.
VT + (VS + co):  [-1 + (+32767 + 1)] is wrong.
So that does explain finally your max/min logic of thinking in that source file.
Code:
(VS, VT, co) = (-0x8000, -0x8000, true)

VS + (VT + co):  -32768 + (-32768 + 1)
VT + (VS + co):  -32768 + (-32768 + 1)
Thanks for the uh, "explanation", or lack thereof.
It was fun figuring it out by myself since you wouldn't just come out and tell me that was the issue.
Honestly I guess I just need to get used to this SSE line of thinking a lot more when I do this kind of crap.

I'll see if I reach the same conclusion for your VSUB solution shortly.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 04:12 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.