Project64 Forums "Static" interpreter?
 Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

#21
26th March 2013, 07:27 PM
 HatCat Alpha Tester Project Supporter Senior Member Join Date: Feb 2007 Location: In my hat. Posts: 16,260

vBulletin is powered in part by:

(jk it's not an ad )

from accnotes.txt:

Code:
```Which method of sign-clamping is faster?

a) The current method (strict binary masking)

Code:
; #define MD 01 // middle slice of acc
for (i = 0; i < 8; i++) /* Sign-clamp bits 31..16 of ACC to dest. VR. */
if (VACC[i].DW & 0x800000000000) /* acc < 0 */
# old code:if (~VACC[i].DW & ~0x00007FFFFFFF) { // security risk
if ((VACC[i].DW & 0xFFFF80000000) != 0xFFFF80000000)
VR[vd][i] = 0x8000; /* slice underflow */
else
VR[vd][i] = VACC[i].s[MD];
else
if ((VACC[i].DW & 0xFFFF80000000) != 0x000000000000)
VR[vd][i] = 0x7FFF; /* slice overflow */
else
VR[vd][i] = VACC[i].s[MD];
b) The arithmetic method (relative blt, bgt, w/ simpler branch tree)

Code:
; #define HI 02 // high slice of acc
for (i = 0; i < 8; i++) /* One- or zero-extend 48-bit elements to 64b. */
VACC[i].HW[03] = VACC[i].s[HI] >> 15;
for (i = 0; i < 8; i++) /* Sign-clamp bits 31..16 of ACC to dest. VR. */
if (VACC[i].DW < 0xFFFFFFFF80000000)
VR[vd][i] = 0x8000; /* slice underflow */
else if (VACC[i].DW > 0x00007FFFFFFF)
VR[vd][i] = 0x7FFF; /* slice overflow */
else
VR[vd][i] = VACC[i].s[MD];
Also, since the vast majority of vector operations never read past the low 16
bits of each 48-bit accumulator (ACC[i] & FFFF), is it possible that destroying
the union indexing service (redefining as `signed long long VACC[i]` instead of
using a hybrid union type) could be of any speed benefit?  (I'm guessing not.)

Do questions like these apply only to interpreters or also to recompilers?

Another question.
Should the speed and size be exactly the same whether I say:

Code:
if (VACC[i].DW & 0x800000000000) /* acc < 0 */
if ((VACC[i].DW & 0xFFFF80000000) != 0xFFFF80000000)
VR[vd][i] = 0x8000; /* slice underflow */
else
VR[vd][i] = VACC[i].s[MD];
else
if ((VACC[i].DW & 0xFFFF80000000) != 0x000000000000)
VR[vd][i] = 0x7FFF; /* slice overflow */
else
VR[vd][i] = VACC[i].s[MD];
or:

Code:
if (VACC[i].DW & 0x800000000000) /* acc < 0 */
if ((VACC[i].DW & 0x7FFF80000000) != 0x7FFF80000000)
VR[vd][i] = 0x8000; /* slice underflow */
else
VR[vd][i] = VACC[i].s[MD];
else
if ((VACC[i].DW & 0x7FFF80000000) != 0x000000000000)
VR[vd][i] = 0x7FFF; /* slice overflow */
else
VR[vd][i] = VACC[i].s[MD];
?

Again, most likely I already know, but I don't always get the chance to ask. :D

Last question I think,

for VMUDL we directly assign the accumulator, not add to it.
vmudl :: acc[i] == x * y # not acc[i] + x + y

We therefore have the option of checking:

Code:
if (VACC[i].DW < 0)

Code:
if (VACC[i].DW & 0x800000000000) /* acc < 0 */
I compiled both methods, and checking (__int64 < 0) seems a few instructions
larger in several cases than (__int64 & 0x800000000000), with some added
branch frames to jump to.  Is the arithmetic inequality comparison to 0
supposed to be slower than masking out a single bit necessarily on 32-bit dev?

eof```
Had to have all my questions pretyped to give me more time.

Some of them are duh-ish but I never get to talk to other experienced programmers regarding such levels and I remain without Internet
#22
26th March 2013, 07:41 PM
 HatCat Alpha Tester Project Supporter Senior Member Join Date: Feb 2007 Location: In my hat. Posts: 16,260

Quote:
 Originally Posted by MarathonMan Ahhh, I didn't know that you were trying to stick to ANSI; I thought you were just using MSVCxx. C99 has inline. I was just trying to say that you should break it up somehow so it's more legible.
One of the issues, is cause MSVC requires `__inline` (or w/e) for C (to use `inline` without the leading __ requires C++); while I would love to defy that rule and use `inline` and break MSVC to get it to compile just on GCC I would love to keep things portable, and inline functions or inline assembly code injections are not my favorite solutions in that regard.

Most likely, even if I don't declare as inline, the settings I set up in MSVC will inline-optimize it for me, but again, this is not entirely portable, and gives me top-level (lowest priority) warning messages per each case of this (I like eliminating even the ANSI compliance warnings PEW PEW! *super man laser vision* POWWWWW)

Quote:
 Originally Posted by MarathonMan Basically, the aliasing rule says that you can't arbitrary cast from one type to another and then go and reference it later. char * and void * are the exceptions to that rule. Are you sure you're not getting the endian-ness mixed up somewhere along the way?
Actually as it turns out I was not sure.

I was wrong. Your method was exactly correct except I needed to swap the byte address by XOR 1, which modified the method you gave me to succeed in all cases:
Code:
```// #define VR_B(v, e) (((unsigned char *)VR[v])[e ^ 0x1])
// #define VR_B(v, e) ((((unsigned char *)(VR + v))[e ^ 0x1]))
#define VR_B(v, e) (*(unsigned char *)(((unsigned char *)(VR + v)) + (e ^ 0x1)))```
(just different ways of saying the same thing obviously, I prefer to define it using pointers to signify that it is a bypass to the two-dimensional array original data type, not really important)

Similarly, I was also wrong about my solution to the short * bypass.

Code:
`#define VR_S(v, e) (*(short *)((unsigned char *)(*(VR + v)) + e + (e & 01)))`
Was the correction I put in to fix the issue.

It was correct in all the even element cases.
I had to make that modification for conker's bfd and zelda mm cause my old way without the + (e&1) broke mfc2 (or mtc2, I forget which) and caused missing triangles.

But !!@

I don't understand!
I thought the endian swapping issues would be fixed by using arrays instead of unions.

If you haven't noticed, bpoint's Project Unreality, zilmar's RSP emulator, and the MAME RSP emulator all use the union dynamic data types for the RSP vector registers. zilmar sort of fixed it because he XOR'd by 07 to make it big-endian (element 0 is the left-most; element 7 is the right-most).

I changed it from little- to big-endian quickly by just using arrays. Before Ville Linde / MAME XOR'd the byte address by 1 in LBV or SBV cause the union broke the endian ... I thought I inverted it so I would no longer have to do this

I'm confused XD

there must be some way I can directly, instantly fetch the left-to-right byte address into a 16-byte VR >.< I'll keep searching for that while hopefully you can answer some of my other questions I just pasted off my flash drive

Also, unsigned clamping is so easy I'm ripping out all the if-else-if branches; it's becoming purely static (like that vmulf sign-clamp) I showed you
#23
28th March 2013, 02:35 AM
 MarathonMan Alpha Tester Project Supporter Senior Member Join Date: Jan 2013 Posts: 454

Quote:
 Originally Posted by FatCat vBulletin is powered in part by: (jk it's not an ad ) from accnotes.txt: Code: ```Which method of sign-clamping is faster? a) The current method (strict binary masking) Code: ; #define MD 01 // middle slice of acc for (i = 0; i < 8; i++) /* Sign-clamp bits 31..16 of ACC to dest. VR. */ if (VACC[i].DW & 0x800000000000) /* acc < 0 */ # old code:if (~VACC[i].DW & ~0x00007FFFFFFF) { // security risk if ((VACC[i].DW & 0xFFFF80000000) != 0xFFFF80000000) VR[vd][i] = 0x8000; /* slice underflow */ else VR[vd][i] = VACC[i].s[MD]; else if ((VACC[i].DW & 0xFFFF80000000) != 0x000000000000) VR[vd][i] = 0x7FFF; /* slice overflow */ else VR[vd][i] = VACC[i].s[MD]; b) The arithmetic method (relative blt, bgt, w/ simpler branch tree) Code: ; #define HI 02 // high slice of acc for (i = 0; i < 8; i++) /* One- or zero-extend 48-bit elements to 64b. */ VACC[i].HW[03] = VACC[i].s[HI] >> 15; for (i = 0; i < 8; i++) /* Sign-clamp bits 31..16 of ACC to dest. VR. */ if (VACC[i].DW < 0xFFFFFFFF80000000) VR[vd][i] = 0x8000; /* slice underflow */ else if (VACC[i].DW > 0x00007FFFFFFF) VR[vd][i] = 0x7FFF; /* slice overflow */ else VR[vd][i] = VACC[i].s[MD]; Also, since the vast majority of vector operations never read past the low 16 bits of each 48-bit accumulator (ACC[i] & FFFF), is it possible that destroying the union indexing service (redefining as `signed long long VACC[i]` instead of using a hybrid union type) could be of any speed benefit? (I'm guessing not.) Do questions like these apply only to interpreters or also to recompilers? Another question. Should the speed and size be exactly the same whether I say: Code: if (VACC[i].DW & 0x800000000000) /* acc < 0 */ if ((VACC[i].DW & 0xFFFF80000000) != 0xFFFF80000000) VR[vd][i] = 0x8000; /* slice underflow */ else VR[vd][i] = VACC[i].s[MD]; else if ((VACC[i].DW & 0xFFFF80000000) != 0x000000000000) VR[vd][i] = 0x7FFF; /* slice overflow */ else VR[vd][i] = VACC[i].s[MD]; or: Code: if (VACC[i].DW & 0x800000000000) /* acc < 0 */ if ((VACC[i].DW & 0x7FFF80000000) != 0x7FFF80000000) VR[vd][i] = 0x8000; /* slice underflow */ else VR[vd][i] = VACC[i].s[MD]; else if ((VACC[i].DW & 0x7FFF80000000) != 0x000000000000) VR[vd][i] = 0x7FFF; /* slice overflow */ else VR[vd][i] = VACC[i].s[MD]; ? Again, most likely I already know, but I don't always get the chance to ask. :D Last question I think, for VMUDL we directly assign the accumulator, not add to it. vmudl :: acc[i] == x * y # not acc[i] + x + y We therefore have the option of checking: Code: if (VACC[i].DW < 0) instead of: Code: if (VACC[i].DW & 0x800000000000) /* acc < 0 */ I compiled both methods, and checking (__int64 < 0) seems a few instructions larger in several cases than (__int64 & 0x800000000000), with some added branch frames to jump to. Is the arithmetic inequality comparison to 0 supposed to be slower than masking out a single bit necessarily on 32-bit dev? eof``` Had to have all my questions pretyped to give me more time. Some of them are duh-ish but I never get to talk to other experienced programmers regarding such levels and I remain without Internet
Whoops, got buried and didn't think to check.

a) You'd have to profile, but I'd guess (?) the first one is by a slight margin. Though, on the other hand, if the second one hits the first case most of the time, it could be faster. Hard to say.

With the union, I don't see how it would help you any, either. Not enough to notice, anyways. This kind of stuff is just as applicable to reinterpreters.

b) I think you'd also need to profile this one, as it depends on what case it taken most often. Mind your braces, though... the indenting and the lack of braces suggest two different, conflicting control paths.

c) The arithmetic comparison would only be a cmp and a conditional jump, so it should be faster ... ? Fewer # of instructions doesn't always equate to higher performance. x86 breaks some instructions into micro-ops (simple instructions) before they get executed in order to increase instruction-level parallelism. A sequence of instructions that need to be broken down into micro-ops could overload the ucode decoders and result in many more instructions produced than if you were to use a few more, simpler, RISC-y instructions.

Last edited by MarathonMan; 28th March 2013 at 02:45 AM.
#24
28th March 2013, 02:39 AM
 MarathonMan Alpha Tester Project Supporter Senior Member Join Date: Jan 2013 Posts: 454

Quote:
 Originally Posted by FatCat I'm confused XD there must be some way I can directly, instantly fetch the left-to-right byte address into a 16-byte VR >.< I'll keep searching for that while hopefully you can answer some of my other questions I just pasted off my flash drive Also, unsigned clamping is so easy I'm ripping out all the if-else-if branches; it's becoming purely static (like that vmulf sign-clamp) I showed you
Endian-ness can be a royal pain when you're working on writing an emulator for a different target. I remember spending like, an hour, staring at a fragment of code once where I was storing a vector as slices of shorts. The machine was actually flipping the bytes when I didn't expect it to, so I was actually compensating for endian-ness when I shouldn't have been.

Honestly, I ended up grabbing paper and pencil for this kinda stuff and tracing through it all to make sure that what I was thinking was reality. And using a debugger.
#25
4th April 2013, 06:49 AM
 HatCat Alpha Tester Project Supporter Senior Member Join Date: Feb 2007 Location: In my hat. Posts: 16,260

Son of a bitch, so many updates.
I hate not being up-to-date on things.

Alas, isolating myself from technology was worth celebrating the week of all creation.

Quote:
 Originally Posted by MarathonMan Whoops, got buried and didn't think to check.
There was another thread in this forum where I inquired that you may have missed, but I ask too many questions.

Quote:
 Originally Posted by MarathonMan c) The arithmetic comparison would only be a cmp and a conditional jump, so it should be faster ... ? Fewer # of instructions doesn't always equate to higher performance. x86 breaks some instructions into micro-ops (simple instructions) before they get executed in order to increase instruction-level parallelism. A sequence of instructions that need to be broken down into micro-ops could overload the ucode decoders and result in many more instructions produced than if you were to use a few more, simpler, RISC-y instructions.
Mmmmm although I am familiar with this already, let me show exactly what's produced:

Code:
```; Function compile flags: /Ogtpy
; File f:\rsp\vu\vmudh.h
_vd\$ = 8						; size = 4
_vs\$ = 12						; size = 4
_vt\$ = 16						; size = 4
_e\$ = 20						; size = 4
_VMUDH	PROC

; 5    :     register int i;
; 6    :
; 7    :     for (i = 0; i < 8; i++)
; 8    :     {
; 9    :         VACC[i].DW = VR[vs][i] * VR[vt][ei[e][i]];

mov	ecx, DWORD PTR _e\$[esp-4]
push	ebx
push	esi
mov	esi, DWORD PTR _vs\$[esp+4]
push	edi
mov	edi, DWORD PTR _vt\$[esp+8]
shl	ecx, 5
mov	eax, DWORD PTR _ei[ecx]
shl	esi, 4
movsx	edx, WORD PTR _VR[esi]
movsx	eax, WORD PTR _VR[eax*2]
imul	eax, edx
cdq

; 10   :         VACC[i].DW <<= 16;

shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC, eax
mov	eax, DWORD PTR _ei[ecx+4]
movsx	eax, WORD PTR _VR[eax*2]
mov	DWORD PTR _VACC+4, edx
movsx	edx, WORD PTR _VR[esi+2]
imul	eax, edx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+8, eax
mov	eax, DWORD PTR _ei[ecx+8]
movsx	eax, WORD PTR _VR[eax*2]
mov	DWORD PTR _VACC+12, edx
movsx	edx, WORD PTR _VR[esi+4]
imul	eax, edx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+16, eax
mov	eax, DWORD PTR _ei[ecx+12]
movsx	eax, WORD PTR _VR[eax*2]
mov	DWORD PTR _VACC+20, edx
movsx	edx, WORD PTR _VR[esi+6]
imul	eax, edx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+24, eax
mov	eax, DWORD PTR _ei[ecx+16]
movsx	eax, WORD PTR _VR[eax*2]
mov	DWORD PTR _VACC+28, edx
movsx	edx, WORD PTR _VR[esi+8]
imul	eax, edx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+32, eax
mov	eax, DWORD PTR _ei[ecx+20]
movsx	eax, WORD PTR _VR[eax*2]
mov	DWORD PTR _VACC+36, edx
movsx	edx, WORD PTR _VR[esi+10]
imul	eax, edx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+40, eax
mov	eax, DWORD PTR _ei[ecx+24]
movsx	eax, WORD PTR _VR[eax*2]
mov	DWORD PTR _VACC+44, edx
movsx	edx, WORD PTR _VR[esi+12]
imul	eax, edx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+48, eax
mov	DWORD PTR _VACC+52, edx

; 5    :     register int i;
; 6    :
; 7    :     for (i = 0; i < 8; i++)
; 8    :     {
; 9    :         VACC[i].DW = VR[vs][i] * VR[vt][ei[e][i]];

mov	eax, DWORD PTR _ei[ecx+28]
movsx	ecx, WORD PTR _VR[esi+14]
movsx	eax, WORD PTR _VR[eax*2]
imul	eax, ecx
mov	ecx, DWORD PTR _vd\$[esp+8]
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
shl	ecx, 4
mov	DWORD PTR _VACC+56, eax
mov	DWORD PTR _VACC+60, edx
mov	ebx, OFFSET _VACC
\$LL9@VMUDH:
mov	eax, DWORD PTR [ebx+4]
mov	edx, DWORD PTR [ebx]
mov	edi, eax
and	edi, 32768				; 00008000H
xor	esi, esi
and	edx, -2147483648			; 80000000H
and	eax, 65535				; 0000ffffH
or	esi, edi
je	SHORT \$LN6@VMUDH
cmp	edx, -2147483648			; 80000000H
jne	SHORT \$LN28@VMUDH
cmp	eax, 65535				; 0000ffffH
je	SHORT \$LN2@VMUDH
\$LN28@VMUDH:
mov	edx, -32768				; ffff8000H
mov	WORD PTR [ecx], dx
jmp	SHORT \$LN8@VMUDH
\$LN6@VMUDH:
or	edx, eax
je	SHORT \$LN2@VMUDH
mov	edx, 32767				; 00007fffH
mov	WORD PTR [ecx], dx
jmp	SHORT \$LN8@VMUDH
\$LN2@VMUDH:
mov	ax, WORD PTR [ebx+2]
mov	WORD PTR [ecx], ax
\$LN8@VMUDH:
cmp	ebx, OFFSET _VACC+64
jl	SHORT \$LL9@VMUDH
pop	edi
pop	esi
pop	ebx
ret	0
_VMUDH	ENDP```
That was the output when I check:
if (acc & 0x800000000000)

if I instead do the loop checking: if \$__int64 < 0 , it looks like this:

Code:
```; Function compile flags: /Ogtpy
; File f:\rsp\vu\vmudh.h
_vd\$ = 8						; size = 4
_vs\$ = 12						; size = 4
_vt\$ = 16						; size = 4
_e\$ = 20						; size = 4
_VMUDH	PROC

; 5    :     register int i;
; 6    :
; 7    :     for (i = 0; i < 8; i++)
; 8    :     {
; 9    :         VACC[i].DW = VR[vs][i] * VR[vt][ei[e][i]];

mov	ecx, DWORD PTR _e\$[esp-4]
push	esi
mov	esi, DWORD PTR _vs\$[esp]
push	edi
mov	edi, DWORD PTR _vt\$[esp+4]
shl	ecx, 5
mov	eax, DWORD PTR _ei[ecx]
shl	esi, 4
movsx	edx, WORD PTR _VR[esi]
movsx	eax, WORD PTR _VR[eax*2]
imul	eax, edx
cdq

; 10   :         VACC[i].DW <<= 16;

shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC, eax
mov	eax, DWORD PTR _ei[ecx+4]
movsx	eax, WORD PTR _VR[eax*2]
mov	DWORD PTR _VACC+4, edx
movsx	edx, WORD PTR _VR[esi+2]
imul	eax, edx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+8, eax
mov	eax, DWORD PTR _ei[ecx+8]
movsx	eax, WORD PTR _VR[eax*2]
mov	DWORD PTR _VACC+12, edx
movsx	edx, WORD PTR _VR[esi+4]
imul	eax, edx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+16, eax
mov	eax, DWORD PTR _ei[ecx+12]
movsx	eax, WORD PTR _VR[eax*2]
mov	DWORD PTR _VACC+20, edx
movsx	edx, WORD PTR _VR[esi+6]
imul	eax, edx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+24, eax
mov	eax, DWORD PTR _ei[ecx+16]
movsx	eax, WORD PTR _VR[eax*2]
mov	DWORD PTR _VACC+28, edx
movsx	edx, WORD PTR _VR[esi+8]
imul	eax, edx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+32, eax
mov	eax, DWORD PTR _ei[ecx+20]
movsx	eax, WORD PTR _VR[eax*2]
mov	DWORD PTR _VACC+36, edx
movsx	edx, WORD PTR _VR[esi+10]
imul	eax, edx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+40, eax
mov	eax, DWORD PTR _ei[ecx+24]
movsx	eax, WORD PTR _VR[eax*2]
mov	DWORD PTR _VACC+44, edx
movsx	edx, WORD PTR _VR[esi+12]
imul	eax, edx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+48, eax
mov	eax, DWORD PTR _ei[ecx+28]
mov	DWORD PTR _VACC+52, edx

; 5    :     register int i;
; 6    :
; 7    :     for (i = 0; i < 8; i++)
; 8    :     {
; 9    :         VACC[i].DW = VR[vs][i] * VR[vt][ei[e][i]];

movsx	ecx, WORD PTR _VR[esi+14]
movsx	eax, WORD PTR _VR[eax*2]
imul	eax, ecx
cdq
shld	edx, eax, 16
shl	eax, 16					; 00000010H
mov	DWORD PTR _VACC+56, eax
mov	eax, DWORD PTR _vd\$[esp+4]
shl	eax, 4
mov	DWORD PTR _VACC+60, edx
mov	esi, OFFSET _VACC
\$LL9@VMUDH:
mov	ecx, DWORD PTR [esi+4]
mov	edx, DWORD PTR [esi]
test	ecx, ecx
jg	SHORT \$LN6@VMUDH
jl	SHORT \$LN28@VMUDH
test	edx, edx
jae	SHORT \$LN6@VMUDH
\$LN28@VMUDH:
and	edx, -2147483648			; 80000000H
and	ecx, 65535				; 0000ffffH
cmp	edx, -2147483648			; 80000000H
jne	SHORT \$LN29@VMUDH
cmp	ecx, 65535				; 0000ffffH
je	SHORT \$LN2@VMUDH
\$LN29@VMUDH:
mov	edx, -32768				; ffff8000H
mov	WORD PTR [eax], dx
jmp	SHORT \$LN8@VMUDH
\$LN6@VMUDH:
and	edx, -2147483648			; 80000000H
and	ecx, 65535				; 0000ffffH
or	edx, ecx
je	SHORT \$LN2@VMUDH
mov	edx, 32767				; 00007fffH
mov	WORD PTR [eax], dx
jmp	SHORT \$LN8@VMUDH
\$LN2@VMUDH:
mov	cx, WORD PTR [esi+2]
mov	WORD PTR [eax], cx
\$LN8@VMUDH:
cmp	esi, OFFSET _VACC+64
jl	SHORT \$LL9@VMUDH
pop	edi
pop	esi
ret	0
_VMUDH	ENDP```
Does that help you point out what I may be missing in this comparison?

Actually that was the MSVC output, where the total lines of instructions is almost the exact same, the only big difference there was checking if (acc < 0) added more branch labels to goto in the listing, whereas on GCC output checking acc & 0x80000000000 turned out much smaller.

My guess, is the latter got optimized to reading a 16-bit portion that was already popped off of the current accumulator union for service elsewhere, and optimizing that to a check if lt. 0, whereas checking an entire 64-bit segment is less than 0 was more strict / harder to optimize.
#26
4th April 2013, 06:54 AM
 Mdkcheatz Alpha Tester Project Supporter Mr. Syrup Join Date: Apr 2007 Location: the Milky Way, I think... Posts: 762

You know this reminds me of a time....
__________________
Also, on top of what I said above, you listen here okay?! Look closely at this fist, if you so much as smell it wrong it'll impose itself onto your face okay?! You think you tough? I bet behind that PC you're just a timid old clown capable of nothing but wackin the sackin. You smell me tigga?!
#27
4th April 2013, 11:29 AM
 HatCat Alpha Tester Project Supporter Senior Member Join Date: Feb 2007 Location: In my hat. Posts: 16,260

Quote:
 Originally Posted by MarathonMan Endian-ness can be a royal pain when you're working on writing an emulator for a different target. I remember spending like, an hour, staring at a fragment of code once where I was storing a vector as slices of shorts. The machine was actually flipping the bytes when I didn't expect it to, so I was actually compensating for endian-ness when I shouldn't have been. Honestly, I ended up grabbing paper and pencil for this kinda stuff and tracing through it all to make sure that what I was thinking was reality. And using a debugger.
Heh heh heh , I totally understand and have done this on several occassions.

Sometimes being forced to do physics/math on paper is applied fun.
Other times I hate having to do hard work, and I try to train my instincts to avoid this kind of stuff.

But, the issue is totally void once I correct the DMEM endian in an emulator free of the Windows plugin specifications.
Then it's just memcpy, SSE and tricks like that

I've made tons of commits every few weeks btw, it still isn't the correct structure you suggested (globalizing operand data using pointers instead of function call stacks etc.), at least not at the C level
#28
4th April 2013, 11:34 AM
 HatCat Alpha Tester Project Supporter Senior Member Join Date: Feb 2007 Location: In my hat. Posts: 16,260

Quote:
 Originally Posted by MarathonMan Mind your braces, though... the indenting and the lack of braces suggest two different, conflicting control paths.
I guess I didn't understand what this meant so much as the rest; yes it's true the bracing helps explicitly secure the readable and maintainable syntax, though indentation of braces is just another syntactical controversy I didn't need to answer for such simple loops.

Because the clamping behavior is indeed split:
for signed clamps,

if acc < 0 then:
a. if acc < -32768 then clamp res. to -32768
b. else then write res & FFFF
--
c. [else] if acc > +32767 then clamp to +32767
d. else then write res & FFFF

The exception is VMACQ which I have just now added on github , I delayed implementing that op since no games ever use it it seems
#29
4th April 2013, 12:06 PM
 zilmar Core Team Alpha Tester Project Supporter Administrator Join Date: Jun 2005 Posts: 988

did you download the source of all my changes in the rsp since the original source release?
#30
5th April 2013, 01:39 PM
 HatCat Alpha Tester Project Supporter Senior Member Join Date: Feb 2007 Location: In my hat. Posts: 16,260

I'm still trying to.

As I explained earlier this shitty hotel Internet makes me reload a lot of pages in-between Viewing Forum Index <---> Reading/Replying to a Thread.

It's so intermittent I need to learn some more about Git and get a quick command that will copy everything over so I can look at it better, which I am still in the process of figuring out.

I always felt that you were calculating the approach of open-sourcing Project64.
I just believed you would never do it; personally I don't know that I'd have reacted the same way in your position. But, at least it makes sense. After employing various individuals, you seem to find it better to trust humanity as a whole rather than personal relations and understandings, which was not enough to prevent the original PJ64 team from leaving, whom hadn't contributed so actively until their inspirations of spite and rebellion against you and suddenly working on the source again.

And, you've freed up further from the shadowed legacy of NEMU, which only went open-source in the Direct3D plugin.

 Thread Tools Display Modes Linear Mode

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home General Discussion     Site News     Open Discussion Public Version     Project 64 - v2.x - Suggestions     Project 64 - v2.x - Issues     Project64 - Android     Project 64 - v1.6

All times are GMT. The time now is 12:38 PM.