|
#521
|
||||
|
||||
![]() Quote:
Code:
VMULF : 0.539 s VMACF : 0.865 s VMULU : 0.558 s VMACU : 0.990 s VMUDL : 0.479 s VMADL : 0.969 s VMUDM : 0.539 s VMADM : 0.882 s VMUDN : 0.523 s VMADN : 0.928 s VMUDH : 0.334 s VMADH : 0.579 s Total time spent: 21.812 s I never touched that opcode. I only highlighted the opcodes I revised, which are all faster.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#522
|
||||
|
||||
![]()
I'm not following these manipulations on VADD/VSUB of yours, MarathonMan.
You say it like this: VADD Code:
... /* VD is the signed sum of the two sources and the carry. Since we */ /* have to saturate the sum of all three, we have to be clever. */ __vmin = _mm_min_epi16(__vs, __vt); __vmax = _mm_max_epi16(__vs, __vt); __vmin = _mm_adds_epi16(__vmin, __vco); __vd = _mm_adds_epi16(__vmin, __vmax); VSUB Code:
... /* VD is the signed diff of the two sources and the carry. Since we */ /* have to saturate the diff of all three, we have to be clever. */ __vtneg = _mm_cmplt_epi16(__vt, _mm_setzero_si128()); __vtpos = _mm_cmpeq_epi16(__vtneg, _mm_setzero_si128()); __vd = _mm_subs_epi16(__vs, __vt); __vnof = _mm_cmpeq_epi16(__vchk, __vd); __vnof = _mm_and_si128(__vtneg, __vnof); __vtneg = _mm_and_si128(__vnof, __vco); __vtpos = _mm_and_si128(__vtpos, __vco); __vco = _mm_or_si128(__vtneg, __vtpos); __vd = _mm_subs_epi16(__vd, __vco); Thanks to the references you made to SSE2 `adds` and `subs` I was able to look them up in the Intel manual and find out that I was wrong that you needed 32-bit storage...this can go more directly than I thought! In fact, the clamp to the VR[vd] file can be done even more directly than what you write in that fork of the emulator: Code:
extern short co[N]; static INLINE void SIGNED_CLAMP_ADD(short* VD, short* VS, short* VT) { __m128i dst, src, vco; src = _mm_load_si128((__m128i *)VS); dst = _mm_load_si128((__m128i *)VT); vco = _mm_load_si128((__m128i *)co); src = _mm_adds_epi16(src, dst); dst = _mm_adds_epi16(src, vco); _mm_store_si128((__m128i *)VD, dst); return; } static INLINE void SIGNED_CLAMP_SUB(short* VD, short* VS, short* VT) { __m128i dst, src, vco; src = _mm_load_si128((__m128i *)VS); dst = _mm_load_si128((__m128i *)VT); vco = _mm_load_si128((__m128i *)co); src = _mm_subs_epi16(src, dst); dst = _mm_subs_epi16(src, vco); _mm_store_si128((__m128i *)VD, dst); return; } I still can't believe how smoothly the rate of these optimizations are beginning to accelerate now that I am using SSE. After consuming an entire year of non-SSE, pure x86 machine optimizations! I owe this all to you ![]()
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#523
|
||||
|
||||
![]()
Over 80 lines of crap removed from VSUB:
Code:
_VSUB: LFB1203: .cfi_startproc movb _inst+1, %al shrb $3, %al movl _inst, %edx shrw $6, %dx andl $31, %edx sall $4, %edx movzbl %al, %eax sall $4, %eax movdqu _VR(%eax), %xmm0 psubw _ST, %xmm0 psubw _co, %xmm0 movdqa %xmm0, _VACC+32 movdqu _VR(%eax), %xmm0 psubsw _ST, %xmm0 psubsw _co, %xmm0 movdqa %xmm0, _VR(%edx) pxor %xmm0, %xmm0 movdqa %xmm0, _ne movdqa %xmm0, _co ret .cfi_endproc Code:
_VADD: LFB632: .cfi_startproc movb _inst+1, %al shrb $3, %al movl _inst, %edx shrw $6, %dx andl $31, %edx sall $4, %edx movzbl %al, %eax sall $4, %eax movdqu _VR(%eax), %xmm0 paddw _ST, %xmm0 paddw _co, %xmm0 movdqa %xmm0, _VACC+32 movdqu _VR(%eax), %xmm0 paddsw _ST, %xmm0 paddsw _co, %xmm0 movdqa %xmm0, _VR(%edx) pxor %xmm0, %xmm0 movdqa %xmm0, _ne movdqa %xmm0, _co ret .cfi_endproc Code:
VADD : 0.576 s VSUB : 0.579 s Code:
VADD : 0.225 s VSUB : 0.255 s
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#524
|
||||
|
||||
![]() ![]() ![]() |
#525
|
||||
|
||||
![]() Quote:
VADD and VSUB treat the inputs as signed, correct? The instrinsics, _mm_adds_epi16 and _mm_subs_epi16, also treat inputs as two's complimented signed values. Thus it could be the case that we have: INT16_MIN + INT16_MIN + VCO. Here, the result should be INT16_MIN. If we just do a flat saturating add and then tack on the carry, our result is INT16_MIN + 1, when in reality it should be INT16_MIN. Is this bug substantial enough to cause noticable problems? Almost certainly not, but I believe it is the wrong behaviour. Similar logic follows for VSUB, albeit it's not as easy to correctly maintain the same semantics as the min and max trick will not work there. I was extremely meticulous when writing these opcodes, so they should be sterling clean (at least, everything other than LWC2/SWC2... but those are technically scalar anyways). BatCat, day 1: Quote:
Quote:
![]() Last edited by MarathonMan; 29th September 2013 at 11:00 PM. |
#526
|
||||
|
||||
![]() Quote:
So adds(adds(-0x8000, -0x8000), 0x0001) is -0x8001, But adds(adds(-0x8000, 0x0001), -0x8000) is still -0x8000. So I still don't see really what you're getting at. If you mangled the order of operations yourself, then MAX and MIN would be just as irrelevant factors. If you did (VS=MIN + VT=MIN) + (VCO is 1) then it would come out as -0x8001 and you would still have the bug. ![]() If nothing else, it's the order of operations that matters. The purpose of the intrinsics is only to handle cases where the compiler has no anticipated chance of recognizing them itself. Obviously if it was possible for compilers to easily convert the mechanism into a signed clamp intrinsic I would not need to write the intrinsic. I never really ruled out using intrinsics, just overexerting them. ![]()
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#527
|
||||
|
||||
![]() Quote:
![]() Your code: Code:
src = _mm_adds_epi16(src, dst); dst = _mm_adds_epi16(src, vco); _mm_store_si128((__m128i *)VD, dst); Per my example: src=0x8000, dst=0x8000, vco=1 As you say in your own post: So adds(adds(-0x8000, -0x8000), 0x0001) is -0x8001, I agree, assuming you mean -0x8001 == INT_MIN + 1 (personally I prefer 0x8001 as we are arguing semantics on two's compliment values). In which case, that is the wrong result. Or at least it is IMO. The correct result should be, effectively, adds(src, dst, vco), which is 0x8000 or INT_MIN. This is what my source produces by forcing VCO to be added to the minimum of the input operands, and then add that sum to the other operand. Last edited by MarathonMan; 29th September 2013 at 11:57 PM. |
#528
|
||||
|
||||
![]()
That actually is not my code any longer, as the problem was easily amended, and I think maybe you didn't finish reading my post.
![]() "But adds(adds(-0x8000, 0x0001), -0x8000) is still -0x8000." That's the correct result, not a bug. adds(VS, adds(VT, co)) Code:
static INLINE void SIGNED_CLAMP_ADD(short* VD, short* VS, short* VT) { __m128i dst, vco, xmm; xmm = _mm_load_si128((__m128i *)VS); dst = _mm_load_si128((__m128i *)VT); vco = _mm_load_si128((__m128i *)co); dst = _mm_adds_epi16(dst, vco); xmm = _mm_adds_epi16(xmm, dst); _mm_store_si128((__m128i *)VD, xmm); return; } The only reason you don't have that bug is because you add vco to VT/VS first, BEFORE doing VS + VT, not because of that redundant-looking MAX/MIN stuff (but forgive me if I am wrong). I mean, your code looks like, max(1, 2) + min(1, 2), which is still 1 + 2 ??? I really don't see the purpose of max/min in your file. You just add the two numbers directly, like I said. It just has to be in the right order, right? Quote:
In fact, I think it's more like ~50 % success rate. I don't know why I didn't get any gfx/sound errors as a result of that change, but like I said, I fixed that with the code I gave you. :P (-0x8000 + -0xFFFF) + true : -0x8001 (wrong, should be -0x8000) (-0x8000 + -0xFFFE) + true : -0x8001 (wrong, should be -0x8000) ... (-0x8000 + -0x8000) + true : -0x8001 (wrong, should be -0x8000) So any time adds(VS, VT) needs to be clamped for underflow (which has a probability of, like I said, about 50% chance), it would be a bug on my part. And as I said you are only immune to this bug because you add each VCO flag first, before adding VS+VT. It has NOTHING to do with the max/min in your program, just the correct order of operations you chose. If your code looked like this: Code:
... /* VD is the signed sum of the two sources and the carry. Since we */ /* have to saturate the sum of all three, we have to be clever. */ __vmin = _mm_min_epi16(__vs, __vt); __vmax = _mm_max_epi16(__vs, __vt); __vd = _mm_adds_epi16(__vmin, __vmax); __vmin = _mm_adds_epi16(__vd, __vco); Code:
... /* VD is the signed sum of the two sources and the carry. Since we */ /* have to saturate the sum of all three, we have to be clever. */ __vmin = _mm_min_epi16(__vs, __vt); __vmax = _mm_max_epi16(__vs, __vt); __vmin = _mm_adds_epi16(__vmin, __vco); __vd = _mm_adds_epi16(__vmin, __vmax); ![]() EDIT: Actually that's wrong, it should be 75% rate of success, not 50. Not only does (VS + VT < -32768), but of course the corresponding flag must also be set. So we have a 1/2 chance of VCO being set and a 1/2 chance of underflow after adding VS + VT, so 1/2*1/2 = 1/4, 4/4 - 1/4 = 3/4 = 75% chance of success, 25% chance of failure. So mysteriously I never once had the bug in mario/quest64 after I used that old buggy code version. But ideally, there should have been problems 25% of the time. ![]()
__________________
http://theoatmeal.com/comics/cat_vs_internet Last edited by HatCat; 30th September 2013 at 12:31 AM. |
#529
|
||||
|
||||
![]() Quote:
![]() OP: Code:
static INLINE void SIGNED_CLAMP_ADD(short* VD, short* VS, short* VT) { __m128i dst, src, vco; src = _mm_load_si128((__m128i *)VS); dst = _mm_load_si128((__m128i *)VT); vco = _mm_load_si128((__m128i *)co); src = _mm_adds_epi16(src, dst); dst = _mm_adds_epi16(src, vco); _mm_store_si128((__m128i *)VD, dst); return; } tmp = adds(vs, vt) dst = adds(tmp, vco) Most recent post: Code:
static INLINE void SIGNED_CLAMP_ADD(short* VD, short* VS, short* VT) { __m128i dst, vco, xmm; xmm = _mm_load_si128((__m128i *)VS); dst = _mm_load_si128((__m128i *)VT); vco = _mm_load_si128((__m128i *)co); dst = _mm_adds_epi16(dst, vco); xmm = _mm_adds_epi16(xmm, dst); _mm_store_si128((__m128i *)VD, xmm); return; } tmp = adds(vt, vco) dst = adds(vs, tmp) Quote:
VT=INT16_MAX, VCO=1, VS=-1 Your most recent solution would produce: adds(INT16_MAX, 1) = INT16_MAX adds(INT16_MAX, -1) = INT16_MAX -1 even though adds(INT16_MAX, -1, 1) = INT16_MAX. So winrar is I. ![]() I'm ill so I'm tl;dr-ing over the rest of this post because I'm certain that I'm right. ![]() Last edited by MarathonMan; 30th September 2013 at 01:56 AM. |
#530
|
||||
|
||||
![]()
First off, there is no such intrinsic as "adds(a, b, c)" because you can only have two parameters in this intrinsic, correct?
So why do you keep writing it that way, when the real issue that I am trying to address is your use of max/min SSE2 intrinsics *before* adding "adds(b, c), adds(a, b)", and not the more obvious, should-be-concluded-already issue of adding vco first rather than last? Because I never questioned you about both of those things, only one of those things. So why keep arguing about both of them? As I said, the latter issue was already fixed. Quote:
What I posted is what my code is, not what my code "used to be". I fail to see how quoting what "used to be" the case is relevant to your argument against what I am actually challenging. Quote:
So then you just need to add VT+(VS+vco) if VT > VS. Code:
(VS, VT, co) = (-1, +32767, true) VS + (VT + co): [-1 + (+32767 + 1)] is wrong. vs. VT + (VS + co): [+32767 + (-1 + 1)] is right. Code:
(VS, VT, co) = (+32767, -1, true) VS + (VT + co): [+32767 + (-1 + 1)] is right. vs. VT + (VS + co): [-1 + (+32767 + 1)] is wrong. Code:
(VS, VT, co) = (-0x8000, -0x8000, true) VS + (VT + co): -32768 + (-32768 + 1) VT + (VS + co): -32768 + (-32768 + 1) ![]() It was fun figuring it out by myself since you wouldn't just come out and tell me that was the issue. Honestly I guess I just need to get used to this SSE line of thinking a lot more when I do this kind of crap. I'll see if I reach the same conclusion for your VSUB solution shortly.
__________________
http://theoatmeal.com/comics/cat_vs_internet |