Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #311  
Old 17th June 2013, 03:52 AM
angrylion angrylion is offline
Member
 
Join Date: Oct 2008
Location: Moscow, Russia
Posts: 36
Default

Thx, I've found the .c file where you've read this info. Yeah, I meant the MIPS CPU.
Reply With Quote
  #312  
Old 17th June 2013, 04:01 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Yeah, I really wish I could test it out for myself, as the only 100% way to know for sure is hardware tests.

I'm just grateful you were able to figure out what I was referring to, without me having to go into so much detail.

Some of the bits were harder to write if() statements to debug than others (like whether CMD_END is also written to) so I just assumed doc was right and games would bitch if I was wrong.

I have no idea what the MIPS main CPU host does however, but since it is the master processor (not the RSP slave), it cannot be exception-free like the RCP so I imagine that it's easy to expect you're right about not forcing alignment.
Reply With Quote
  #313  
Old 18th June 2013, 03:11 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

So as to prevent disorganization of the RDP questions I'm making sure with a link to my question here:
http://forum.pj64-emu.com/showthread...7443#post47443

Too many updates to incomplete HLE SoftGraphic attempt, it diverts my LLE RDP questions.
I might have to make a new thread eventually.
Reply With Quote
  #314  
Old 1st July 2013, 02:44 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

I have a small to-do for things to do in the next version of this plugin.
Every time I do one of these releases I always think it's the last, then somebody proves me wrong.

I'm having trouble keeping notes, so I'm taking them here.

First, the build removes the error angrylion warned about MTC0 command start DMAs with fcube.

Everything else is speed-related.

I was just thinking about how Intel architecture only reads in the lo 5 bits of requested shift amounts.
If so, then SLLV/SRLV/SRAV on the RSP is emulated redundantly.
`SR[rd] = SR[rt] << (SR[rs] & 31);`

It is a micro- speed-up but possible that & 31 is unnecessary here, if the H/W takes care of bit-mask trapping for us with legal shift values.

Thirdly,
Project64 1.7+ sometimes crashes when I close the emulator using my RSP plugin. I probably forgot to initiate some zilmar-spec-related memory buffer routine, so pj64 complaints.

Lastly, there is too much if-else with signed-clamping.
MarathonMan's arguments over LUTs have rather inspired me to look more into staticizing signed-clamping code for Vector Multiply-Accumulate operations. The performance differences concerned with this, if successful, should be noticeable just by looking at the VI/s, without a profiler, as VMADN/VMADH are the most frequently executed.

EDIT, okay so maybe there is a fifth thing.
Unaligned DMEM addresses for semi-instantaneous, parallel byte writes for SW RSP CPU op.
This is hard to test since so few games do this I think.
Reply With Quote
  #315  
Old 7th August 2013, 11:21 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by FatCat View Post
I was just thinking about how Intel architecture only reads in the lo 5 bits of requested shift amounts.
If so, then SLLV/SRLV/SRAV on the RSP is emulated redundantly.
`SR[rd] = SR[rt] << (SR[rs] & 31);`
World Driver Championship does this at start-up before drawing anything.

I added this debug code to SRLV:
Code:
                    case 006: /* SRLV */
                        SR[rd] = (unsigned)(SR[rt]) >> (SR[rs] & 31);
if (SR[rs] > 31)
    if (((unsigned)(SR[rt]) >> (SR[rs] & 31)) != ((unsigned)(SR[rt]) >> SR[rs]))
	{
		char text[8192];
		
		sprintf(text,
			"RS:  %08X\n"\
			"RT:  %08X\n"\
			"RD1:  %08X\n"\
			"RD2:  %08X",
			SR[rs], SR[rt], (unsigned)(SR[rt]) >> (SR[rs] & 31), (unsigned)(SR[rt]) >> SR[rs]);
		message(text, 3);
	}
                        continue;
The results check.

On Intel, saying (RT >> (RS & 31)), is the same thing as just saying (RT >> RS).

Similarly, for SLL, SRL, and SRA,
Code:
RD = RT <<>> (sa & 31);
RD = RT <<>> sa;
Because Intel CPU already does the AND-clamp to 0b11111 for us, ignoring all upper bits.

This should primarily yield a small speed-up to scalar NOP since it's used so much.
I updated the repo with a commit on this.
Reply With Quote
  #316  
Old 15th August 2013, 04:39 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

I figured better update this if people think it dead.
Just lots of new things, none of which were as fundamentally interesting as the above research post.
But probably faster than the research used in the above post.

Since there is currently a mysterious similarity of emulation speed benchmarks between my [pretty much] SSE-free plugin and MarathonMan's SSE2+ enhancements to it, I am losing interest in finding imperfections in this plugin as it is already as fast as it can get and as fast as MarathonMan's version currently anyway.

The good news is, incidentally, I think I may have very well run out of nit-pick imperfections to optimise. I think everything about the scalar unit interpreter is impossible to make any faster, but I am disappointed I could never make the 90% goal of ratio to the speed of recompiler. Any new work on the RSP emulators will not be normal interpreters...it needs to be something way majorly different than this code base, like a static-indexed interpreter or a dynarec.

So unless people nag more errors now, I don't see why I should anticipate another release after this next one.
All that's really left for me to do at this point is finish removing the USER32/SHELL32 winapi dependencies, because those show up really erroneous in Dependency Walker ever since, like, XP.

It's all being re-done in MSVCRT since MinGW forces that as a dependency of my plugin anyway, so no point in evading the use of CRT functions for a module inevitably linking it.

This requires the new, GUI for Configure RSP Plugin (which is being introduced in the next version) to be separate...I was going to use ShellExecuteEx for this but that didn't give me enough control for async thread waiting, so instead I'm using system().

I don't know anyone freely adept/open to making a Windows GUI for my config interface, but dsx! thinks he can have one done in a week or two, so go nag him about it!
The major purposes of this are a) help MarathonMan not have to work with multiple RSP DLL plugin files, b) user needs to install one DLL, not 7, c) 1964 will now be able to load th eplugin.
Reply With Quote
  #317  
Old 15th August 2013, 05:46 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Just to clarify, I'm not sure how much more work I'll put into vectorizing that plugin.

The main reason for porting it to PJ64 was to reduce the entropy of all the problems I was having with CEN64. Now that CEN64 is able to execute non-trivial RSP/RDP tasks and most of the RSP functions are vectorized, I'm more interested in porting the changes to my own emulator and working on them there.

I'll probably still do all my tests in this plugin and update it, but I'm not trying to fork it or anything. Don't expect releases often anyways.
Reply With Quote
  #318  
Old 15th August 2013, 06:51 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

PS, you should be able to speed up integer execution a good deal if you use function pointers instead of the giant switch statement.

With a switch statement, the compiler is almost certainly going to generate a bunch of "checks" to see if there is a case which matches the variable, if the variable is in range of any of the cases, etc.

i.e.:
Code:
switch(x) {
   case 1:
   ...
   case 3:
   ...
   case 5:
   ...
}
compiler will generate:

Code:
if (x > 5) goto default
if (x == 2) goto default
if (x == 0) goto default
x (x == 4) goto default
jumptable[x]();
a really smart one would do:

Code:
if (x > 5 || x & 1 == 0) goto default
jumptable[x]();
In my experiences, gcc will never make a full-blown jump table out of a switch statement. You could save yourself the work of computing all the conditional branches prior to the switch and jump straight into the jumptable. You would just have to have a bunch of the jumptable entries map to a function which handles "invalid" or the default case.

I bet this would equate to another 5-10% increase in performance.
Reply With Quote
  #319  
Old 16th August 2013, 02:23 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by MarathonMan View Post
Just to clarify, I'm not sure how much more work I'll put into vectorizing that plugin.

The main reason for porting it to PJ64 was to reduce the entropy of all the problems I was having with CEN64. Now that CEN64 is able to execute non-trivial RSP/RDP tasks and most of the RSP functions are vectorized, I'm more interested in porting the changes to my own emulator and working on them there.

I'll probably still do all my tests in this plugin and update it, but I'm not trying to fork it or anything. Don't expect releases often anyways.
True.
It's a double-edged sword I guess.

On one hand, you could use the version of your RSP emulator ported to zilmar's spec to promote your CEN64 RSP emulator, which will be even faster and free of the endian issues of zilmar's plugin spec (but also cycle-accurate then again). That's partly why I suggested calling this plugin "CEN64 Prototype", so people would see you modified my plugin to include just some of your code from CEN64, which would be the complete version and interest more people into checking it out.

On the other hand, this wouldn't be an effective method of advertisement to everyone, because some people would be like, oh cool I have the Windows version of the CEN64 goodies, now I'm no longer forced to install a decent OS and run the native CEN64 emulator.
[Nah, to hell with those assholes lol.]

So there are different ways of viewing it.

But yes, the biggest thing is having a stable base where you can plug it into all the N64 games and easily find bugs in your native CEN64 version.
That was the very first reason I would think you'd ever consider doing a version in zilmar plugin spec.

Quote:
Originally Posted by MarathonMan View Post
PS, you should be able to speed up integer execution a good deal if you use function pointers instead of the giant switch statement.

With a switch statement, the compiler is almost certainly going to generate a bunch of "checks" to see if there is a case which matches the variable, if the variable is in range of any of the cases, etc.

In my experiences, gcc will never make a full-blown jump table out of a switch statement. You could save yourself the work of computing all the conditional branches prior to the switch and jump straight into the jumptable. You would just have to have a bunch of the jumptable entries map to a function which handles "invalid" or the default case.

I bet this would equate to another 5-10% increase in performance.
I forgot, that I was going to re-investigate this issue one last time before the last release.

Actually, I used to have everything in function pointer tables, then I rewrote it as a big chain of switch() statements for all the opcodes, and at first there was no really noticeable speed difference. In fact, using all the switch() jumps made the compiler take way, way longer to generate the code, probably allocating a conversion into not doing so many damn annoying if's.

Generally, MSVS almost always converts switch() to an optimized jump.

The older versions of MinGW were unlikely to do this for me, but these days I think they've vastly improved on that.

GCC output for doing a switch on a smaller range of values in MFC0:

Code:
void MFC0(int rt, int rd)
{
    switch (rd)
    {
        case 0x0:
            SR[rt] = *RSP.SP_MEM_ADDR_REG;
            return;
        case 0x1: ...
Code:
_MFC0:
LFB21:
	.cfi_startproc
	pushl	%ebx
	.cfi_def_cfa_offset 8
	.cfi_offset 3, -8
	subl	$24, %esp
	.cfi_def_cfa_offset 32
	movl	32(%esp), %ebx
	movl	36(%esp), %eax
	cmpl	$15, %eax
	ja	L701
	jmp	*L719(,%eax,4)
	.section .rdata,"dr"
	.align 4
L719:
...
It does similar output for the scalar unit execution loop.
If I'm not mistaken, procedure CALL's are slower than plain branches, and this switch code seems to not do a function call.

The only real reason I kept vector unit operations (including stuff from COP2, LWC2 and SWC2) as function pointer tables and not switch statements was because they were the most complex and used the most memory code space in the cache, so I allocated them to separated function space.

Still, you're right that I should definitely check it one last time and write down the VI/s results for a change this major and see if it matters, or at least makes the plugin size smaller.

Last edited by HatCat; 16th August 2013 at 02:27 AM.
Reply With Quote
  #320  
Old 16th August 2013, 02:45 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
On the other hand, this wouldn't be an effective method of advertisement to everyone, because some people would be like, oh cool I have the Windows version of the CEN64 goodies, now I'm no longer forced to install a decent OS and run the native CEN64 emulator.
I'm like the moon from Majora's Mask. Ain't nobody stopping me.

Code:
void MFC0(int rt, int rd)
{
    switch (rd)
    {
        case 0x0:
            SR[rt] = *RSP.SP_MEM_ADDR_REG;
            return;
        case 0x1: ...
Code:
_MFC0:
LFB21:
	.cfi_startproc
	pushl	%ebx
	.cfi_def_cfa_offset 8
	.cfi_offset 3, -8
	subl	$24, %esp
	.cfi_def_cfa_offset 32
	movl	32(%esp), %ebx
	movl	36(%esp), %eax
	cmpl	$15, %eax
	ja	L701
	jmp	*L719(,%eax,4)
	.section .rdata,"dr"
	.align 4
L719:
...
Notice the "ja .L701"; a jump table wouldn't require that and would be replaced with a simple ANDL imm, reg. And if there's holes in the switch statement, things get uglier (usually?). Maybe there are there and MSVC is just better at generating jump tables.

Yeah, CALLs will be slower. CALL is a macro-op that boils down to a push and a jump, whereas JMP is just a uop.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 03:30 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.