#21  
Old 26th March 2013, 07:27 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,260
Default

vBulletin is powered in part by:

(jk it's not an ad )

from accnotes.txt:

Code:
Which method of sign-clamping is faster?

a) The current method (strict binary masking)
Code:
; #define MD 01 // middle slice of acc
    for (i = 0; i < 8; i++) /* Sign-clamp bits 31..16 of ACC to dest. VR. */
        if (VACC[i].DW & 0x800000000000) /* acc < 0 */
# old code:if (~VACC[i].DW & ~0x00007FFFFFFF) { // security risk
            if ((VACC[i].DW & 0xFFFF80000000) != 0xFFFF80000000)
                VR[vd][i] = 0x8000; /* slice underflow */
            else
                VR[vd][i] = VACC[i].s[MD];
        else
            if ((VACC[i].DW & 0xFFFF80000000) != 0x000000000000)
                VR[vd][i] = 0x7FFF; /* slice overflow */
            else
                VR[vd][i] = VACC[i].s[MD];
b) The arithmetic method (relative blt, bgt, w/ simpler branch tree)
Code:
; #define HI 02 // high slice of acc
    for (i = 0; i < 8; i++) /* One- or zero-extend 48-bit elements to 64b. */
        VACC[i].HW[03] = VACC[i].s[HI] >> 15;
    for (i = 0; i < 8; i++) /* Sign-clamp bits 31..16 of ACC to dest. VR. */
        if (VACC[i].DW < 0xFFFFFFFF80000000)
            VR[vd][i] = 0x8000; /* slice underflow */
        else if (VACC[i].DW > 0x00007FFFFFFF)
            VR[vd][i] = 0x7FFF; /* slice overflow */
        else
            VR[vd][i] = VACC[i].s[MD];
Also, since the vast majority of vector operations never read past the low 16 bits of each 48-bit accumulator (ACC[i] & FFFF), is it possible that destroying the union indexing service (redefining as `signed long long VACC[i]` instead of using a hybrid union type) could be of any speed benefit? (I'm guessing not.) Do questions like these apply only to interpreters or also to recompilers? Another question. Should the speed and size be exactly the same whether I say:
Code:
        if (VACC[i].DW & 0x800000000000) /* acc < 0 */
            if ((VACC[i].DW & 0xFFFF80000000) != 0xFFFF80000000)
                VR[vd][i] = 0x8000; /* slice underflow */
            else
                VR[vd][i] = VACC[i].s[MD];
        else
            if ((VACC[i].DW & 0xFFFF80000000) != 0x000000000000)
                VR[vd][i] = 0x7FFF; /* slice overflow */
            else
                VR[vd][i] = VACC[i].s[MD];
or:
Code:
        if (VACC[i].DW & 0x800000000000) /* acc < 0 */
            if ((VACC[i].DW & 0x7FFF80000000) != 0x7FFF80000000)
                VR[vd][i] = 0x8000; /* slice underflow */
            else
                VR[vd][i] = VACC[i].s[MD];
        else
            if ((VACC[i].DW & 0x7FFF80000000) != 0x000000000000)
                VR[vd][i] = 0x7FFF; /* slice overflow */
            else
                VR[vd][i] = VACC[i].s[MD];
? Again, most likely I already know, but I don't always get the chance to ask. :D Last question I think, for VMUDL we directly assign the accumulator, not add to it. vmudl :: acc[i] == x * y # not acc[i] + x + y We therefore have the option of checking:
Code:
        if (VACC[i].DW < 0)
instead of:
Code:
        if (VACC[i].DW & 0x800000000000) /* acc < 0 */
I compiled both methods, and checking (__int64 < 0) seems a few instructions larger in several cases than (__int64 & 0x800000000000), with some added branch frames to jump to. Is the arithmetic inequality comparison to 0 supposed to be slower than masking out a single bit necessarily on 32-bit dev? eof
Had to have all my questions pretyped to give me more time.

Some of them are duh-ish but I never get to talk to other experienced programmers regarding such levels and I remain without Internet
Reply With Quote
  #22  
Old 26th March 2013, 07:41 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,260
Default

Quote:
Originally Posted by MarathonMan View Post
Ahhh, I didn't know that you were trying to stick to ANSI; I thought you were just using MSVCxx. C99 has inline. I was just trying to say that you should break it up somehow so it's more legible.
One of the issues, is cause MSVC requires `__inline` (or w/e) for C (to use `inline` without the leading __ requires C++); while I would love to defy that rule and use `inline` and break MSVC to get it to compile just on GCC I would love to keep things portable, and inline functions or inline assembly code injections are not my favorite solutions in that regard.

Most likely, even if I don't declare as inline, the settings I set up in MSVC will inline-optimize it for me, but again, this is not entirely portable, and gives me top-level (lowest priority) warning messages per each case of this (I like eliminating even the ANSI compliance warnings PEW PEW! *super man laser vision* POWWWWW)

Quote:
Originally Posted by MarathonMan View Post
Basically, the aliasing rule says that you can't arbitrary cast from one type to another and then go and reference it later. char * and void * are the exceptions to that rule.

Are you sure you're not getting the endian-ness mixed up somewhere along the way?
Actually as it turns out I was not sure.

I was wrong. Your method was exactly correct except I needed to swap the byte address by XOR 1, which modified the method you gave me to succeed in all cases:
Code:
// #define VR_B(v, e) (((unsigned char *)VR[v])[e ^ 0x1])
// #define VR_B(v, e) ((((unsigned char *)(VR + v))[e ^ 0x1]))
#define VR_B(v, e) (*(unsigned char *)(((unsigned char *)(VR + v)) + (e ^ 0x1)))
(just different ways of saying the same thing obviously, I prefer to define it using pointers to signify that it is a bypass to the two-dimensional array original data type, not really important)

Similarly, I was also wrong about my solution to the short * bypass.

Code:
#define VR_S(v, e) (*(short *)((unsigned char *)(*(VR + v)) + e + (e & 01)))
Was the correction I put in to fix the issue.

It was correct in all the even element cases.
I had to make that modification for conker's bfd and zelda mm cause my old way without the + (e&1) broke mfc2 (or mtc2, I forget which) and caused missing triangles.

But !!@

I don't understand!
I thought the endian swapping issues would be fixed by using arrays instead of unions.

If you haven't noticed, bpoint's Project Unreality, zilmar's RSP emulator, and the MAME RSP emulator all use the union dynamic data types for the RSP vector registers. zilmar sort of fixed it because he XOR'd by 07 to make it big-endian (element 0 is the left-most; element 7 is the right-most).

I changed it from little- to big-endian quickly by just using arrays. Before Ville Linde / MAME XOR'd the byte address by 1 in LBV or SBV cause the union broke the endian ... I thought I inverted it so I would no longer have to do this

I'm confused XD

there must be some way I can directly, instantly fetch the left-to-right byte address into a 16-byte VR >.< I'll keep searching for that while hopefully you can answer some of my other questions I just pasted off my flash drive

Also, unsigned clamping is so easy I'm ripping out all the if-else-if branches; it's becoming purely static (like that vmulf sign-clamp) I showed you
Reply With Quote
  #23  
Old 28th March 2013, 02:35 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
vBulletin is powered in part by:

(jk it's not an ad )

from accnotes.txt:

Code:
Which method of sign-clamping is faster?

a) The current method (strict binary masking)
Code:
; #define MD 01 // middle slice of acc
    for (i = 0; i < 8; i++) /* Sign-clamp bits 31..16 of ACC to dest. VR. */
        if (VACC[i].DW & 0x800000000000) /* acc < 0 */
# old code:if (~VACC[i].DW & ~0x00007FFFFFFF) { // security risk
            if ((VACC[i].DW & 0xFFFF80000000) != 0xFFFF80000000)
                VR[vd][i] = 0x8000; /* slice underflow */
            else
                VR[vd][i] = VACC[i].s[MD];
        else
            if ((VACC[i].DW & 0xFFFF80000000) != 0x000000000000)
                VR[vd][i] = 0x7FFF; /* slice overflow */
            else
                VR[vd][i] = VACC[i].s[MD];
b) The arithmetic method (relative blt, bgt, w/ simpler branch tree)
Code:
; #define HI 02 // high slice of acc
    for (i = 0; i < 8; i++) /* One- or zero-extend 48-bit elements to 64b. */
        VACC[i].HW[03] = VACC[i].s[HI] >> 15;
    for (i = 0; i < 8; i++) /* Sign-clamp bits 31..16 of ACC to dest. VR. */
        if (VACC[i].DW < 0xFFFFFFFF80000000)
            VR[vd][i] = 0x8000; /* slice underflow */
        else if (VACC[i].DW > 0x00007FFFFFFF)
            VR[vd][i] = 0x7FFF; /* slice overflow */
        else
            VR[vd][i] = VACC[i].s[MD];
Also, since the vast majority of vector operations never read past the low 16 bits of each 48-bit accumulator (ACC[i] & FFFF), is it possible that destroying the union indexing service (redefining as `signed long long VACC[i]` instead of using a hybrid union type) could be of any speed benefit? (I'm guessing not.) Do questions like these apply only to interpreters or also to recompilers? Another question. Should the speed and size be exactly the same whether I say:
Code:
        if (VACC[i].DW & 0x800000000000) /* acc < 0 */
            if ((VACC[i].DW & 0xFFFF80000000) != 0xFFFF80000000)
                VR[vd][i] = 0x8000; /* slice underflow */
            else
                VR[vd][i] = VACC[i].s[MD];
        else
            if ((VACC[i].DW & 0xFFFF80000000) != 0x000000000000)
                VR[vd][i] = 0x7FFF; /* slice overflow */
            else
                VR[vd][i] = VACC[i].s[MD];
or:
Code:
        if (VACC[i].DW & 0x800000000000) /* acc < 0 */
            if ((VACC[i].DW & 0x7FFF80000000) != 0x7FFF80000000)
                VR[vd][i] = 0x8000; /* slice underflow */
            else
                VR[vd][i] = VACC[i].s[MD];
        else
            if ((VACC[i].DW & 0x7FFF80000000) != 0x000000000000)
                VR[vd][i] = 0x7FFF; /* slice overflow */
            else
                VR[vd][i] = VACC[i].s[MD];
? Again, most likely I already know, but I don't always get the chance to ask. :D Last question I think, for VMUDL we directly assign the accumulator, not add to it. vmudl :: acc[i] == x * y # not acc[i] + x + y We therefore have the option of checking:
Code:
        if (VACC[i].DW < 0)
instead of:
Code:
        if (VACC[i].DW & 0x800000000000) /* acc < 0 */
I compiled both methods, and checking (__int64 < 0) seems a few instructions larger in several cases than (__int64 & 0x800000000000), with some added branch frames to jump to. Is the arithmetic inequality comparison to 0 supposed to be slower than masking out a single bit necessarily on 32-bit dev? eof
Had to have all my questions pretyped to give me more time.

Some of them are duh-ish but I never get to talk to other experienced programmers regarding such levels and I remain without Internet
Whoops, got buried and didn't think to check.

a) You'd have to profile, but I'd guess (?) the first one is by a slight margin. Though, on the other hand, if the second one hits the first case most of the time, it could be faster. Hard to say.

With the union, I don't see how it would help you any, either. Not enough to notice, anyways. This kind of stuff is just as applicable to reinterpreters.

b) I think you'd also need to profile this one, as it depends on what case it taken most often. Mind your braces, though... the indenting and the lack of braces suggest two different, conflicting control paths.

c) The arithmetic comparison would only be a cmp and a conditional jump, so it should be faster ... ? Fewer # of instructions doesn't always equate to higher performance. x86 breaks some instructions into micro-ops (simple instructions) before they get executed in order to increase instruction-level parallelism. A sequence of instructions that need to be broken down into micro-ops could overload the ucode decoders and result in many more instructions produced than if you were to use a few more, simpler, RISC-y instructions.

Last edited by MarathonMan; 28th March 2013 at 02:45 AM.
Reply With Quote
  #24  
Old 28th March 2013, 02:39 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
I'm confused XD

there must be some way I can directly, instantly fetch the left-to-right byte address into a 16-byte VR >.< I'll keep searching for that while hopefully you can answer some of my other questions I just pasted off my flash drive

Also, unsigned clamping is so easy I'm ripping out all the if-else-if branches; it's becoming purely static (like that vmulf sign-clamp) I showed you
Endian-ness can be a royal pain when you're working on writing an emulator for a different target. I remember spending like, an hour, staring at a fragment of code once where I was storing a vector as slices of shorts. The machine was actually flipping the bytes when I didn't expect it to, so I was actually compensating for endian-ness when I shouldn't have been.

Honestly, I ended up grabbing paper and pencil for this kinda stuff and tracing through it all to make sure that what I was thinking was reality. And using a debugger.
Reply With Quote
  #25  
Old 4th April 2013, 06:49 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,260
Default

Son of a bitch, so many updates.
I hate not being up-to-date on things.

Alas, isolating myself from technology was worth celebrating the week of all creation.

Quote:
Originally Posted by MarathonMan View Post
Whoops, got buried and didn't think to check.
There was another thread in this forum where I inquired that you may have missed, but I ask too many questions.

Quote:
Originally Posted by MarathonMan View Post
c) The arithmetic comparison would only be a cmp and a conditional jump, so it should be faster ... ? Fewer # of instructions doesn't always equate to higher performance. x86 breaks some instructions into micro-ops (simple instructions) before they get executed in order to increase instruction-level parallelism. A sequence of instructions that need to be broken down into micro-ops could overload the ucode decoders and result in many more instructions produced than if you were to use a few more, simpler, RISC-y instructions.
Mmmmm although I am familiar with this already, let me show exactly what's produced:

Code:
; Function compile flags: /Ogtpy
; File f:\rsp\vu\vmudh.h
_vd$ = 8						; size = 4
_vs$ = 12						; size = 4
_vt$ = 16						; size = 4
_e$ = 20						; size = 4
_VMUDH	PROC

; 5    :     register int i;
; 6    : 
; 7    :     for (i = 0; i < 8; i++)
; 8    :     {
; 9    :         VACC[i].DW = VR[vs][i] * VR[vt][ei[e][i]];

	mov	ecx, DWORD PTR _e$[esp-4]
	push	ebx
	push	esi
	mov	esi, DWORD PTR _vs$[esp+4]
	push	edi
	mov	edi, DWORD PTR _vt$[esp+8]
	shl	ecx, 5
	mov	eax, DWORD PTR _ei[ecx]
	add	edi, edi
	shl	esi, 4
	movsx	edx, WORD PTR _VR[esi]
	add	edi, edi
	add	edi, edi
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	imul	eax, edx
	cdq

; 10   :         VACC[i].DW <<= 16;

	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC, eax
	mov	eax, DWORD PTR _ei[ecx+4]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	mov	DWORD PTR _VACC+4, edx
	movsx	edx, WORD PTR _VR[esi+2]
	imul	eax, edx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+8, eax
	mov	eax, DWORD PTR _ei[ecx+8]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	mov	DWORD PTR _VACC+12, edx
	movsx	edx, WORD PTR _VR[esi+4]
	imul	eax, edx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+16, eax
	mov	eax, DWORD PTR _ei[ecx+12]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	mov	DWORD PTR _VACC+20, edx
	movsx	edx, WORD PTR _VR[esi+6]
	imul	eax, edx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+24, eax
	mov	eax, DWORD PTR _ei[ecx+16]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	mov	DWORD PTR _VACC+28, edx
	movsx	edx, WORD PTR _VR[esi+8]
	imul	eax, edx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+32, eax
	mov	eax, DWORD PTR _ei[ecx+20]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	mov	DWORD PTR _VACC+36, edx
	movsx	edx, WORD PTR _VR[esi+10]
	imul	eax, edx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+40, eax
	mov	eax, DWORD PTR _ei[ecx+24]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	mov	DWORD PTR _VACC+44, edx
	movsx	edx, WORD PTR _VR[esi+12]
	imul	eax, edx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+48, eax
	mov	DWORD PTR _VACC+52, edx

; 5    :     register int i;
; 6    : 
; 7    :     for (i = 0; i < 8; i++)
; 8    :     {
; 9    :         VACC[i].DW = VR[vs][i] * VR[vt][ei[e][i]];

	mov	eax, DWORD PTR _ei[ecx+28]
	movsx	ecx, WORD PTR _VR[esi+14]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	imul	eax, ecx
	mov	ecx, DWORD PTR _vd$[esp+8]
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	shl	ecx, 4
	mov	DWORD PTR _VACC+56, eax
	mov	DWORD PTR _VACC+60, edx
	mov	ebx, OFFSET _VACC
	add	ecx, OFFSET _VR
$LL9@VMUDH:
	mov	eax, DWORD PTR [ebx+4]
	mov	edx, DWORD PTR [ebx]
	mov	edi, eax
	and	edi, 32768				; 00008000H
	xor	esi, esi
	and	edx, -2147483648			; 80000000H
	and	eax, 65535				; 0000ffffH
	or	esi, edi
	je	SHORT $LN6@VMUDH
	cmp	edx, -2147483648			; 80000000H
	jne	SHORT $LN28@VMUDH
	cmp	eax, 65535				; 0000ffffH
	je	SHORT $LN2@VMUDH
$LN28@VMUDH:
	mov	edx, -32768				; ffff8000H
	mov	WORD PTR [ecx], dx
	jmp	SHORT $LN8@VMUDH
$LN6@VMUDH:
	or	edx, eax
	je	SHORT $LN2@VMUDH
	mov	edx, 32767				; 00007fffH
	mov	WORD PTR [ecx], dx
	jmp	SHORT $LN8@VMUDH
$LN2@VMUDH:
	mov	ax, WORD PTR [ebx+2]
	mov	WORD PTR [ecx], ax
$LN8@VMUDH:
	add	ebx, 8
	add	ecx, 2
	cmp	ebx, OFFSET _VACC+64
	jl	SHORT $LL9@VMUDH
	pop	edi
	pop	esi
	pop	ebx
	ret	0
_VMUDH	ENDP
That was the output when I check:
if (acc & 0x800000000000)

if I instead do the loop checking: if $__int64 < 0 , it looks like this:

Code:
; Function compile flags: /Ogtpy
; File f:\rsp\vu\vmudh.h
_vd$ = 8						; size = 4
_vs$ = 12						; size = 4
_vt$ = 16						; size = 4
_e$ = 20						; size = 4
_VMUDH	PROC

; 5    :     register int i;
; 6    : 
; 7    :     for (i = 0; i < 8; i++)
; 8    :     {
; 9    :         VACC[i].DW = VR[vs][i] * VR[vt][ei[e][i]];

	mov	ecx, DWORD PTR _e$[esp-4]
	push	esi
	mov	esi, DWORD PTR _vs$[esp]
	push	edi
	mov	edi, DWORD PTR _vt$[esp+4]
	shl	ecx, 5
	mov	eax, DWORD PTR _ei[ecx]
	add	edi, edi
	shl	esi, 4
	movsx	edx, WORD PTR _VR[esi]
	add	edi, edi
	add	edi, edi
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	imul	eax, edx
	cdq

; 10   :         VACC[i].DW <<= 16;

	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC, eax
	mov	eax, DWORD PTR _ei[ecx+4]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	mov	DWORD PTR _VACC+4, edx
	movsx	edx, WORD PTR _VR[esi+2]
	imul	eax, edx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+8, eax
	mov	eax, DWORD PTR _ei[ecx+8]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	mov	DWORD PTR _VACC+12, edx
	movsx	edx, WORD PTR _VR[esi+4]
	imul	eax, edx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+16, eax
	mov	eax, DWORD PTR _ei[ecx+12]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	mov	DWORD PTR _VACC+20, edx
	movsx	edx, WORD PTR _VR[esi+6]
	imul	eax, edx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+24, eax
	mov	eax, DWORD PTR _ei[ecx+16]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	mov	DWORD PTR _VACC+28, edx
	movsx	edx, WORD PTR _VR[esi+8]
	imul	eax, edx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+32, eax
	mov	eax, DWORD PTR _ei[ecx+20]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	mov	DWORD PTR _VACC+36, edx
	movsx	edx, WORD PTR _VR[esi+10]
	imul	eax, edx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+40, eax
	mov	eax, DWORD PTR _ei[ecx+24]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	mov	DWORD PTR _VACC+44, edx
	movsx	edx, WORD PTR _VR[esi+12]
	imul	eax, edx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+48, eax
	mov	eax, DWORD PTR _ei[ecx+28]
	mov	DWORD PTR _VACC+52, edx

; 5    :     register int i;
; 6    : 
; 7    :     for (i = 0; i < 8; i++)
; 8    :     {
; 9    :         VACC[i].DW = VR[vs][i] * VR[vt][ei[e][i]];

	movsx	ecx, WORD PTR _VR[esi+14]
	add	eax, edi
	movsx	eax, WORD PTR _VR[eax*2]
	imul	eax, ecx
	cdq
	shld	edx, eax, 16
	shl	eax, 16					; 00000010H
	mov	DWORD PTR _VACC+56, eax
	mov	eax, DWORD PTR _vd$[esp+4]
	shl	eax, 4
	mov	DWORD PTR _VACC+60, edx
	mov	esi, OFFSET _VACC
	add	eax, OFFSET _VR
$LL9@VMUDH:
	mov	ecx, DWORD PTR [esi+4]
	mov	edx, DWORD PTR [esi]
	test	ecx, ecx
	jg	SHORT $LN6@VMUDH
	jl	SHORT $LN28@VMUDH
	test	edx, edx
	jae	SHORT $LN6@VMUDH
$LN28@VMUDH:
	and	edx, -2147483648			; 80000000H
	and	ecx, 65535				; 0000ffffH
	cmp	edx, -2147483648			; 80000000H
	jne	SHORT $LN29@VMUDH
	cmp	ecx, 65535				; 0000ffffH
	je	SHORT $LN2@VMUDH
$LN29@VMUDH:
	mov	edx, -32768				; ffff8000H
	mov	WORD PTR [eax], dx
	jmp	SHORT $LN8@VMUDH
$LN6@VMUDH:
	and	edx, -2147483648			; 80000000H
	and	ecx, 65535				; 0000ffffH
	or	edx, ecx
	je	SHORT $LN2@VMUDH
	mov	edx, 32767				; 00007fffH
	mov	WORD PTR [eax], dx
	jmp	SHORT $LN8@VMUDH
$LN2@VMUDH:
	mov	cx, WORD PTR [esi+2]
	mov	WORD PTR [eax], cx
$LN8@VMUDH:
	add	esi, 8
	add	eax, 2
	cmp	esi, OFFSET _VACC+64
	jl	SHORT $LL9@VMUDH
	pop	edi
	pop	esi
	ret	0
_VMUDH	ENDP
Does that help you point out what I may be missing in this comparison?

Actually that was the MSVC output, where the total lines of instructions is almost the exact same, the only big difference there was checking if (acc < 0) added more branch labels to goto in the listing, whereas on GCC output checking acc & 0x80000000000 turned out much smaller.

My guess, is the latter got optimized to reading a 16-bit portion that was already popped off of the current accumulator union for service elsewhere, and optimizing that to a check if lt. 0, whereas checking an entire 64-bit segment is less than 0 was more strict / harder to optimize.
Reply With Quote
  #26  
Old 4th April 2013, 06:54 AM
Mdkcheatz's Avatar
Mdkcheatz Mdkcheatz is offline
Alpha Tester
Project Supporter
Mr. Syrup
 
Join Date: Apr 2007
Location: the Milky Way, I think...
Posts: 762
Default

You know this reminds me of a time....
__________________
Also, on top of what I said above, you listen here okay?! Look closely at this fist, if you so much as smell it wrong it'll impose itself onto your face okay?! You think you tough? I bet behind that PC you're just a timid old clown capable of nothing but wackin the sackin. You smell me tigga?!
Reply With Quote
  #27  
Old 4th April 2013, 11:29 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,260
Default

Quote:
Originally Posted by MarathonMan View Post
Endian-ness can be a royal pain when you're working on writing an emulator for a different target. I remember spending like, an hour, staring at a fragment of code once where I was storing a vector as slices of shorts. The machine was actually flipping the bytes when I didn't expect it to, so I was actually compensating for endian-ness when I shouldn't have been.

Honestly, I ended up grabbing paper and pencil for this kinda stuff and tracing through it all to make sure that what I was thinking was reality. And using a debugger.
Heh heh heh , I totally understand and have done this on several occassions.

Sometimes being forced to do physics/math on paper is applied fun.
Other times I hate having to do hard work, and I try to train my instincts to avoid this kind of stuff.

But, the issue is totally void once I correct the DMEM endian in an emulator free of the Windows plugin specifications.
Then it's just memcpy, SSE and tricks like that

I've made tons of commits every few weeks btw, it still isn't the correct structure you suggested (globalizing operand data using pointers instead of function call stacks etc.), at least not at the C level
Reply With Quote
  #28  
Old 4th April 2013, 11:34 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,260
Default

Quote:
Originally Posted by MarathonMan View Post
Mind your braces, though... the indenting and the lack of braces suggest two different, conflicting control paths.
I guess I didn't understand what this meant so much as the rest; yes it's true the bracing helps explicitly secure the readable and maintainable syntax, though indentation of braces is just another syntactical controversy I didn't need to answer for such simple loops.

Because the clamping behavior is indeed split:
for signed clamps,

if acc < 0 then:
a. if acc < -32768 then clamp res. to -32768
b. else then write res & FFFF
--
c. [else] if acc > +32767 then clamp to +32767
d. else then write res & FFFF

The exception is VMACQ which I have just now added on github , I delayed implementing that op since no games ever use it it seems
Reply With Quote
  #29  
Old 4th April 2013, 12:06 PM
zilmar zilmar is offline
Core Team
Alpha Tester
Project Supporter
Administrator
 
Join Date: Jun 2005
Posts: 988
Default

did you download the source of all my changes in the rsp since the original source release?
Reply With Quote
  #30  
Old 5th April 2013, 01:39 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,260
Default

I'm still trying to.

As I explained earlier this shitty hotel Internet makes me reload a lot of pages in-between Viewing Forum Index <---> Reading/Replying to a Thread.

It's so intermittent I need to learn some more about Git and get a quick command that will copy everything over so I can look at it better, which I am still in the process of figuring out.


I always felt that you were calculating the approach of open-sourcing Project64.
I just believed you would never do it; personally I don't know that I'd have reacted the same way in your position. But, at least it makes sense. After employing various individuals, you seem to find it better to trust humanity as a whole rather than personal relations and understandings, which was not enough to prevent the original PJ64 team from leaving, whom hadn't contributed so actively until their inspirations of spite and rebellion against you and suddenly working on the source again.

And, you've freed up further from the shadowed legacy of NEMU, which only went open-source in the Direct3D plugin.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 12:38 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.