Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #561  
Old 4th October 2013, 01:26 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
Yeah, yeah, I know.

I tl;dr'd the byte shift amount from 7 to 15.

Doesn't matter; it's tested and stable now so I could commit it to Git.

AT&T listing before:
Code:
_get_VCO:
LFB605:
	.cfi_startproc
	movswl	_ne+14, %edx
	sall	$15, %edx
	movswl	_ne+12, %eax
	sall	$14, %eax
	orl	%edx, %eax
	orl	_co, %eax
	movswl	_ne+10, %edx
	sall	$13, %edx
	orl	%edx, %eax
	movswl	_ne+8, %edx
	sall	$12, %edx
	orl	%edx, %eax
	movswl	_ne+6, %edx
	sall	$11, %edx
	orl	%edx, %eax
	movswl	_ne+4, %edx
	sall	$10, %edx
	orl	%edx, %eax
	movswl	_ne+2, %edx
	sall	$9, %edx
	orl	%edx, %eax
	movswl	_ne, %edx
	sall	$8, %edx
	orl	%edx, %eax
	movswl	_co+14, %edx
	sall	$7, %edx
	orl	%edx, %eax
	movswl	_co+12, %edx
	sall	$6, %edx
	orl	%edx, %eax
	movswl	_co+10, %edx
	sall	$5, %edx
	orl	%edx, %eax
	movswl	_co+8, %edx
	sall	$4, %edx
	orl	%edx, %eax
	movswl	_co+6, %edx
	sall	$3, %edx
	orl	%edx, %eax
	movswl	_co+4, %edx
	sall	$2, %edx
	orl	%edx, %eax
	movswl	_co+2, %edx
	sall	%edx
	orl	%edx, %eax
	ret
	.cfi_endproc
AT&T output after:
Code:
_get_VCO:
LFB605:
	.cfi_startproc
	movdqa	_ne, %xmm0
	psllw	$15, %xmm0
	movdqa	_co, %xmm1
	psllw	$15, %xmm1
	packsswb	%xmm0, %xmm1
	pmovmskb	%xmm1, %eax
	ret
	.cfi_endproc
Btw, I'm guessing there isn't a way to do CTC2 moves (from a scalar GPR in the MIPS ISA, into an RSP flags register) using SSE2 equivalents?

There is movemask but I guess that's just one direction.

Oh well, no big deal.
In fact no deal at all!
I have never seen a game use CTC2 before, at least not according to my opcode population counter logger I wrote.
There are LUTs...

But nothing with SSE that will help, that I can think of anyways.
Reply With Quote
  #562  
Old 4th October 2013, 02:11 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

At some point long ago, I contemplated using LUTs for converting in CFC2/CTC2 (which one was it? had to have been CTC2 I guess), but the rate counts were just so damn small that I didn't feel like complicating the C source, as well as the cache usage, with LUTs on something to use more RAM for instructions that almost never get used anyway.

I'll be damned if I see a game ever write to OR read from $vce, either, since absolutely no instructions use it except VCL. Anybody who learned RSP programming "by the book" would probably have never been capable of engaging in this practice, from their perspective.

Now that I have the SSE2 thing set though, I could use LUTs for defining which pointer to use to read out the XMM registers from...but I'm feeling lazy and I'll probably get around to that after I finish playing the N64's worst-ever RPG game ever to see the ass of this earth in full screen some more.
Reply With Quote
  #563  
Old 5th October 2013, 03:21 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Well I did a quick copy-pasta installation of CEN64 SSE2 backport of how to emulate VMADN.

Code:
INLINE static void do_madn(short* VD, short* VS, short* VT)
{
    __m128i __vacc, __vacclo, __vaccmid, __vacchi, __prodlo, __prodhi;
    __m128i __vvslo, __vvshi, __vvtlo, __vvthi, __vdlo, __vdhi;
    __m128i __vs, __vt, __vd;


    __vacclo = _mm_load_si128((__m128i*) VACC_L);
    __vaccmid = _mm_load_si128((__m128i*) VACC_M);
    __vs = _mm_load_si128((__m128i*) VS);
    __vt = _mm_load_si128((__m128i*) VT);

    /* Unpack to obtain for 32-bit precision. */
    RSPZeroExtend16to32(__vs, &__vvslo, &__vvshi);
    RSPSignExtend16to32(__vt, &__vvtlo, &__vvthi);
    RSPZeroExtend16to32(__vacclo, &__vacclo, &__vacchi);


    /* Begin accumulating the products. */
    __prodlo = _mm_mullo_epi32(__vvslo, __vvtlo);
    __prodhi = _mm_mullo_epi32(__vvshi, __vvthi);


    __vdlo = _mm_srli_epi32(__prodlo, 16);
    __vdhi = _mm_srli_epi32(__prodhi, 16);
    __vdlo = _mm_slli_epi32(__vdlo, 16);
    __vdhi = _mm_slli_epi32(__vdhi, 16);
    __vdlo = _mm_xor_si128(__vdlo, __prodlo);
    __vdhi = _mm_xor_si128(__vdhi, __prodhi);


    __vacclo = _mm_add_epi32(__vacclo, __vdlo);
    __vacchi = _mm_add_epi32(__vacchi, __vdhi);

    __vd = RSPPackLo32to16(__vacclo, __vacchi);
    _mm_store_si128((__m128i*)VACC_L, __vd);

    /* Multiply the MSB of sources, accumulate the product. */
    __vacc = _mm_load_si128((__m128i*)VACC_H);
    __vdlo = _mm_unpacklo_epi16(__vaccmid, __vacc);
    __vdhi = _mm_unpackhi_epi16(__vaccmid, __vacc);

    __prodlo = _mm_srai_epi32(__prodlo, 16);
    __prodhi = _mm_srai_epi32(__prodhi, 16);
    __vacclo = _mm_srai_epi32(__vacclo, 16);
    __vacchi = _mm_srai_epi32(__vacchi, 16);

    __vacclo = _mm_add_epi32(__prodlo, __vacclo);
    __vacchi = _mm_add_epi32(__prodhi, __vacchi);
    __vacclo = _mm_add_epi32(__vdlo, __vacclo);
    __vacchi = _mm_add_epi32(__vdhi, __vacchi);

    /* Clamp the accumulator and write it all out. */
    __vaccmid = RSPPackLo32to16(__vacclo, __vacchi);
    __vacchi = RSPPackHi32to16(__vacclo, __vacchi);
    __vd = RSPClampLowToVal(__vd, __vaccmid, __vacchi);

    _mm_store_si128((__m128i*)VACC_M, __vaccmid);
    _mm_store_si128((__m128i*)VACC_H, __vacchi);
    _mm_store_si128((__m128i*)VD, __vd);

    return;
}
The result of this is unfortunately 14 Intel instructions more than what happens if I let the compiler convert the ANSI C to SSE2 for me.

Code:
INLINE static void do_madn(short* VD, short* VS, short* VT)
{
    unsigned long addend[N];
    register int i;

    for (i = 0; i < N; i++)
        addend[i] = (unsigned short)(VACC_L[i]) + (unsigned short)(VS[i]*VT[i]);
    for (i = 0; i < N; i++)
        VACC_L[i] += (short)(VS[i] * VT[i]);
    for (i = 0; i < N; i++)
        addend[i] = (addend[i] >> 16) + ((unsigned short)(VS[i])*VT[i] >> 16);
    for (i = 0; i < N; i++)
        addend[i] = (unsigned short)(VACC_M[i]) + addend[i];
    for (i = 0; i < N; i++)
        VACC_M[i] = (short)addend[i];
    for (i = 0; i < N; i++)
        VACC_H[i] += addend[i] >> 16;
    SIGNED_CLAMP_AL(VD);
    return;
}
So I think, for the painful and annoying multiply-accumulate operations (but maybe not the regular multiplies), I seem to be better off with the ANSI version. I hate all these stupid packs and unpacks. But, when I compile with -msse4, it is cut down further by 9 instructions, so I can have people compile it on SSE4 machines for better performance than I can run.

SSE4 version converted from ANSI C:
Code:
_VMADN:
LFB870:
	.cfi_startproc
	pushl	%ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	movl	%esp, %ebp
	.cfi_def_cfa_register 5
	andl	$-16, %esp
	subl	$80, %esp
	movl	_inst, %edx
	shrw	$6, %dx
	movb	_inst+1, %al
	shrb	$3, %al
	movzbl	%al, %eax
	sall	$4, %eax
	movdqa	_VACC+32, %xmm3
	movdqu	_VR(%eax), %xmm0
	movdqa	_ST, %xmm1
	pmullw	%xmm0, %xmm1
	pmovzxwd	%xmm3, %xmm2
	pmovzxwd	%xmm1, %xmm4
	paddd	%xmm4, %xmm2
	psrldq	$8, %xmm3
	pmovzxwd	%xmm3, %xmm3
	psrldq	$8, %xmm1
	pmovzxwd	%xmm1, %xmm1
	paddd	%xmm1, %xmm3
	movdqa	_ST, %xmm4
	pmullw	%xmm0, %xmm4
	paddw	_VACC+32, %xmm4
	movdqa	%xmm4, _VACC+32
	movdqa	_ST, %xmm5
	movdqa	%xmm0, %xmm1
	psrldq	$8, %xmm1
	pmovzxwd	%xmm1, %xmm1
	movdqa	%xmm5, %xmm6
	psrldq	$8, %xmm6
	pmovsxwd	%xmm6, %xmm6
	pmulld	%xmm6, %xmm1
	psrad	$16, %xmm1
	psrld	$16, %xmm3
	paddd	%xmm3, %xmm1
	pmovzxwd	%xmm0, %xmm0
	pmovsxwd	%xmm5, %xmm5
	pmulld	%xmm5, %xmm0
	psrad	$16, %xmm0
	psrld	$16, %xmm2
	paddd	%xmm2, %xmm0
	movdqa	_VACC+16, %xmm3
	movdqa	%xmm3, %xmm2
	psrldq	$8, %xmm2
	pmovzxwd	%xmm2, %xmm2
	paddd	%xmm1, %xmm2
	pmovzxwd	%xmm3, %xmm1
	paddd	%xmm0, %xmm1
	movdqa	LC2, %xmm5
	movdqa	%xmm1, %xmm6
	pshufb	%xmm5, %xmm6
	movdqa	LC3, %xmm3
	movdqa	%xmm2, %xmm0
	pshufb	%xmm3, %xmm0
	por	%xmm6, %xmm0
	movdqa	%xmm0, _VACC+16
	psrld	$16, %xmm1
	psrld	$16, %xmm2
	pshufb	%xmm5, %xmm1
	pshufb	%xmm3, %xmm2
	por	%xmm2, %xmm1
	paddw	_VACC, %xmm1
	movdqa	%xmm1, _VACC
	movdqa	%xmm0, %xmm2
	punpcklwd	%xmm1, %xmm2
	movdqa	%xmm0, %xmm3
	punpckhwd	%xmm1, %xmm3
	packssdw	%xmm3, %xmm2
	movdqa	%xmm2, %xmm1
	pcmpeqw	%xmm0, %xmm1
	pand	LC1, %xmm1
	pxor	LC6, %xmm2
	movdqa	%xmm4, %xmm0
	psubw	%xmm2, %xmm0
	pmullw	%xmm1, %xmm0
	paddw	%xmm2, %xmm0
	movl	%edx, %eax
	andl	$31, %eax
	sall	$4, %eax
	movdqu	%xmm0, _VR(%eax)
	leave
	.cfi_restore 5
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc
Reply With Quote
  #564  
Old 7th October 2013, 08:04 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Time for more bench!

Old multiply log times:
Code:

VMULF  :  0.553 s
VMACF  :  0.950 s
VMULU  :  0.605 s
VMACU  :  1.048 s
VMUDL  :  0.540 s
VMADL  :  0.993 s
VMUDM  :  0.609 s
VMADM  :  0.880 s
VMUDN  :  0.584 s
VMADN  :  1.004 s
VMUDH  :  0.391 s
VMADH  :  0.578 s
New multiply log times:
Code:

VMULF  :  0.537 s
VMACF  :  0.802 s
VMULU  :  0.612 s
VMACU  :  0.849 s
VMUDL  :  0.486 s
VMADL  :  0.905 s
VMUDM  :  0.535 s
VMADM  :  0.819 s
VMUDN  :  0.505 s
VMADN  :  0.951 s
VMUDH  :  0.325 s
VMADH  :  0.525 s
[Note that I had to fix an audio bug in VMULF (and VMULU) exploited by MusyX microcode, so those opcodes are not faster, even though the 5-centisecond offset in single-threaded wall clock time offsetting the unchanged functions makes them deceivingly appear faster.]

I used two basic methods to speed up VMACF (and VMULF, after slowing it down with the bug fix) and VMACU (and VMULU).

First, for VM??F, I invent a new macro called the "semi-fraction" accumulator offset/result.
Code:
#define SEMIFRAC    (VS[i]*VT[i]*2/2 + 0x8000/2)
The real result of course is [(VS*VT << 1) + 0x8000], but standard High School algebra school work is enough to see that, divided in 2, my macro is correct.

Instead of storing 2*VS*VT + 0x8000 low 16 bits and mid 16 bits, use shifts of 15 and -1 instead of 16 and 0 after redundantly shifting the full fractional result.

Second, the other speed-up I did was for VMACU (and VMULU) UNSIGNED_CLAMP method.

Before the unsigned clamp used MSB comparisons on equal or sign bit set, comining them all with masks to check if [ACC]47..32 or [ACC]31..16 indicated overflow, underflow, or neither.

Now, it has been rewritten to be based entirely on the SIGNED_CLAMP method, since there is no SSE2 (or SSE4 even, I think) way to unsigned-clamp 32-bit Intel doublewords (can only be sign-clamped), and since the tricky thing about the RSP is that actually overflow is sign-clamped, even though the official term for the algorithm used in VMULU/VMACU is "unsigned clamp". Underflow is still clamped unsigned.

Code:
static INLINE void SIGNED_CLAMP_AM(short* VD)
{ /* typical sign-clamp of accumulator-mid (bits 31:16) */
    __m128i dst, src;
    __m128i pvd, pvs;

    pvs = _mm_load_si128((__m128i *)VACC_H);
    pvd = _mm_load_si128((__m128i *)VACC_M);
    dst = _mm_unpacklo_epi16(pvd, pvs);
    src = _mm_unpackhi_epi16(pvd, pvs);

    dst = _mm_packs_epi32(dst, src);
    _mm_store_si128((__m128i *)VD, dst);
    return;
}

[...]

static INLINE void UNSIGNED_CLAMP(short* VD)
{ /* sign-zero hybrid clamp of accumulator-mid (bits 31:16) */
    short temp[N];
    register int i;

    SIGNED_CLAMP_AM(temp); /* no direct map in SSE, but closely based on this */
    for (i = 0; i < N; i++)
        VD[i] = temp[i] & ~(temp[i] >> 15); /* Only this clamp is unsigned. */
    return;
}
There were also some other changes but I forgot them.
Try not to spam this thread over all the other ones on the forum too much.
Reply With Quote
  #565  
Old 7th October 2013, 11:03 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

(Trying*)

Sorry I mis-conveyed.
It's not exactly so much that UNSIGNED_CLAMP on the RSP, uses signed clamp for overflow.
It's more like, it detects overflow as in >= +32768, instead of >= +65536, which is the range for signed clamping; however, the mask for overwriting the limit is still 0xFFFF === +65535 which is the unsigned clamp mask.

Guess I got carried away for not getting much hours of sleep.

I made some slight change to the function, and it comes out now to the exact same number of op-codes (but different algorithm) as before I started optimizing that.

It's still faster though !! Because apart from clamping speed-ups, I made a couple micro-optimizations to MULF, MACF, MULU, and MACU 16-bit segmented 48-bit virtualization and adding. I just didn't post those here cause that was trivial/lexical compiler interpretation theory.

I also just made a change that cuts over 10 instructions for VMULU so that should be extra faster now.
Reply With Quote
  #566  
Old 10th October 2013, 07:37 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

I've decided to give the vector unit a rest. I think I've done everything I can with the multiply operations *at least, for the VU op-codes that need optimizing the most*. Even when I paste in MarathonMan's SSE intrinsics for something like VMADN I can't seem to compact it any further than it already is without assuming SSE4.

So I took a troll back with the Scalar Unit definitions header.
I looked back at my pseudo-instructions tree at SW/SH/SB, remembering how I thought I couldn't create any useful pseudo-op-code optimization for that particular primary op-code.

Well as it turns out!!
I did more tests.
And SW (Store Word) has the highest use of RS==0 (addr = offset + 0), and a significantly (in a few ROMs, more like, slightly) lesser use of RT==0.

So rs == 0 is most common with SW.
Do I have time to check every single game, AND with both rs AND rt, for SH/SB/LW/LH/etc.? No.
But let's take extra looks at the speed side of things:

Code:
RS==0:  SW     :  0.119 s
RT==0:  SW     :  0.138 s
plain:  SW     :  0.155 s
According to the average results of at least 8 different test sessions,
SW with rs=0 is the fastest.
SW with SR[rt]=0 is almost as fast, but not quite as commonly done by games anyway.

SO I think I know what pseudo-instructions I am going to be implementing tonight.
Reply With Quote
  #567  
Old 11th October 2013, 07:23 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by BatCat View Post
So I took a stroll back with the Scalar Unit definitions header.
lol, sorry, guess staying up all night makes you forget the extra S's.


Anyway, all the opcode algorithms themselves, at the core, may be 100% optimized on the C level, but it doesn't mean I couldn't go back to the SU/VU split threading and find global optimizations in the main interpreter CPU core....

For example, I experimented with the risk of the DLL size going up, by moving the SHUFFLE_VECTOR function outside the vector instruction task scheduler in the interpreter CPU, into each of the vector functions. (My idea was to cut out the "ST" global pointer indexing in every instruction by merging it with local variable register specifiers for the shuffle in-line.)

The results were MOST successful. VI/s went up by like 4-16, like everywhere, depending which part of the game, and file size went up by just about 16 KB (maybe less?).

Also, I discovered that GCC seems to decode rather deficient this:
Code:
inst.R.sa
into this:
Code:
(inst.I.imm >> 7) & 31
By forcing this to be a 32-bit operation, exactly 1 instruction was removed from every single op-code function, both scalars and vectors, which use the SA shift amount bitmask.

Overall this dropped the DLL size by 2 KB.
And with SSSE3 shuffling method by MarathonMan enabled, size increase will be a lot less than 16 KB....

Last edited by HatCat; 11th October 2013 at 07:27 AM.
Reply With Quote
  #568  
Old 12th October 2013, 12:49 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Desperate to find more global ideas in the main interpreter loop to free out the reserves for, I investigated in a little experiment of cutting down the 2-D jump table for scalar instructions.

Code:
static void (*EX_SCALAR[64][64])(void) = {
    { /* SPECIAL */
        SLL    ,res_S  ,SRL    ,SRA    ,SLLV   ,res_S  ,SRLV   ,SRAV   ,
        JR     ,JALR   ,res_S  ,res_S  ,res_S  ,BREAK  ,res_S  ,res_S  ,
        res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,
        res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,
        ADD    ,ADDU   ,SUB    ,SUBU   ,AND    ,OR     ,XOR    ,NOR    ,
        res_S  ,res_S  ,SLT    ,SLTU   ,res_S  ,res_S  ,res_S  ,res_S  ,
        res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,
        res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S
    },
 /* REGIMM, */
 /* 64 pointers to J, */
 /* 64 pointers to JAL, */
 /* 64 pointers to BEQ, etc. */
My DLL size was 102 KB before starting.

It annoyed me much that the SPECIAL primary opcode, was the one and only, singular reason why I could not make this a 64x32 array of function pointers, instead of a 64x64 array as I'd done.

So I decided to converge both 64x32 halves of the 64-element SPECIAL sub-op pointer portion:

Code:
static void (*EX_SCALAR[64][32])(void) = {
    { /* SPECIAL (custom convergence of both halves to fit table) */
        ADD    ,ADDU   ,SUB    ,SUBU   ,AND    ,OR     ,XOR    ,NOR    ,
        JR     ,JALR   ,SLT    ,SLTU   ,res_S  ,BREAK  ,res_S  ,res_S  ,
        res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,
        res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,res_S  ,
    },
My choices for the convergence overrides (namely, using all 8 of the arithmetic/logical register ops in place of the 6 shift ops) are mostly founded upon dynamic branch prediction.
More particularly, when the emulator does ((inst % 64) % 32) and jumps to AND, when it should really be SLLV, my modified AND function will first test bit (inst.W & 0x00000020) to see if it should cancel and do SLLV instead.
This order of branch prediction is better because my long-time analysis of opcode frequencies concludes that SLLV is, compared to AND, very rarely encountered.

The same can be said of SLL/NOP replaced by ADD, except that ADD just checks if RD==0 first to completely exit everything (NOP), then checks if it's really a non-NOP SLL that should cancel out the ADD procedure.

Is it perfect/completely agreeable? Not in each case.

But the results are rather inconclusive.
I notice neither any noticeable slow down in emulation speed, nor a speed-up.
I genuinely cannot decide, but if I had to take a wild guess, I believe this change may be < 0.25% slower?


Either way!!
This massive division of the jump table in half, has just saved me 12 KB of DLL size or cache data.
And I was arbitrarily hoping I could get the plugin size down from 100 KB to 80 or 64 KB anyway.

So it's 90 KB now. (It's 55 KB if you rip out the entire SU. Most of the problem with SU is all the switch tables for ?WC2 addr alignment.)
Reply With Quote
  #569  
Old 26th October 2013, 07:07 AM
thefowles1 thefowles1 is offline
Junior Member
 
Join Date: Mar 2010
Posts: 2
Default

BatCat - what video plugin do you use for Gauntlet Legends? Ziggy's z64gl doesn't show up in my list of GFX plugins in PJ64 (despite it being in /plugin/GFX), Jabo's D3D8 still gives me missing/black textures out the wazoo, and Glide64 shows environments but is missing the HUD. :C
Glide's is the closest to playable, but the missing HUD makes it not so. Otherwise, it simply briefly flashes black every few seconds, which I could look past if I could see my HUD.

I'm so close to getting it playable, I can taste it...
Reply With Quote
  #570  
Old 26th October 2013, 07:18 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

The only way to fix all of the bugs is to use a LLE graphics plugin.
Incidentally, for this game, only the pixel-accurate plugin by angrylion fixes everything.

You can nab mudlord's build of it from here off the link in his post:
http://forum.pj64-emu.com/showthread...0466#post50466

Also, you wouldn't want to use ziggy's z64gl for Gauntlet Legends, anyway. The issues are the same as with Jabo's LLE. (The original MAME rasterizer SDL backend ported by ziggy does do a pretty good job for a non-pixel-accurate plugin, though....)

By the way, pixel-accurate plugins are slow. Fair warning.
Make sure you use the `rsp_pj64.dll` file in this thread, and only with Project64 2.x, or the game won't work.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 05:57 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.