|
#311
|
|||
|
|||
![]()
Thx, I've found the .c file where you've read this info. Yeah, I meant the MIPS CPU.
|
#312
|
||||
|
||||
![]()
Yeah, I really wish I could test it out for myself, as the only 100% way to know for sure is hardware tests.
I'm just grateful you were able to figure out what I was referring to, without me having to go into so much detail. ![]() Some of the bits were harder to write if() statements to debug than others (like whether CMD_END is also written to) so I just assumed doc was right and games would bitch if I was wrong. I have no idea what the MIPS main CPU host does however, but since it is the master processor (not the RSP slave), it cannot be exception-free like the RCP so I imagine that it's easy to expect you're right about not forcing alignment.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#313
|
||||
|
||||
![]()
So as to prevent disorganization of the RDP questions I'm making sure with a link to my question here:
http://forum.pj64-emu.com/showthread...7443#post47443 Too many updates to incomplete HLE SoftGraphic attempt, it diverts my LLE RDP questions. I might have to make a new thread eventually.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#314
|
||||
|
||||
![]()
I have a small to-do for things to do in the next version of this plugin.
Every time I do one of these releases I always think it's the last, then somebody proves me wrong. I'm having trouble keeping notes, so I'm taking them here. First, the build removes the error angrylion warned about MTC0 command start DMAs with fcube. Everything else is speed-related. I was just thinking about how Intel architecture only reads in the lo 5 bits of requested shift amounts. If so, then SLLV/SRLV/SRAV on the RSP is emulated redundantly. `SR[rd] = SR[rt] << (SR[rs] & 31);` It is a micro- speed-up but possible that & 31 is unnecessary here, if the H/W takes care of bit-mask trapping for us with legal shift values. Thirdly, Project64 1.7+ sometimes crashes when I close the emulator using my RSP plugin. I probably forgot to initiate some zilmar-spec-related memory buffer routine, so pj64 complaints. Lastly, there is too much if-else with signed-clamping. MarathonMan's arguments over LUTs have rather inspired me to look more into staticizing signed-clamping code for Vector Multiply-Accumulate operations. The performance differences concerned with this, if successful, should be noticeable just by looking at the VI/s, without a profiler, as VMADN/VMADH are the most frequently executed. EDIT, okay so maybe there is a fifth thing. Unaligned DMEM addresses for semi-instantaneous, parallel byte writes for SW RSP CPU op. This is hard to test since so few games do this I think.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#315
|
||||
|
||||
![]() Quote:
I added this debug code to SRLV: Code:
case 006: /* SRLV */ SR[rd] = (unsigned)(SR[rt]) >> (SR[rs] & 31); if (SR[rs] > 31) if (((unsigned)(SR[rt]) >> (SR[rs] & 31)) != ((unsigned)(SR[rt]) >> SR[rs])) { char text[8192]; sprintf(text, "RS: %08X\n"\ "RT: %08X\n"\ "RD1: %08X\n"\ "RD2: %08X", SR[rs], SR[rt], (unsigned)(SR[rt]) >> (SR[rs] & 31), (unsigned)(SR[rt]) >> SR[rs]); message(text, 3); } continue; On Intel, saying (RT >> (RS & 31)), is the same thing as just saying (RT >> RS). Similarly, for SLL, SRL, and SRA, Code:
RD = RT <<>> (sa & 31); RD = RT <<>> sa; This should primarily yield a small speed-up to scalar NOP since it's used so much. I updated the repo with a commit on this.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#316
|
||||
|
||||
![]()
I figured better update this if people think it dead.
Just lots of new things, none of which were as fundamentally interesting as the above research post. But probably faster than the research used in the above post. Since there is currently a mysterious similarity of emulation speed benchmarks between my [pretty much] SSE-free plugin and MarathonMan's SSE2+ enhancements to it, I am losing interest in finding imperfections in this plugin as it is already as fast as it can get and as fast as MarathonMan's version currently anyway. The good news is, incidentally, I think I may have very well run out of nit-pick imperfections to optimise. I think everything about the scalar unit interpreter is impossible to make any faster, but I am disappointed I could never make the 90% goal of ratio to the speed of recompiler. Any new work on the RSP emulators will not be normal interpreters...it needs to be something way majorly different than this code base, like a static-indexed interpreter or a dynarec. So unless people nag more errors now, I don't see why I should anticipate another release after this next one. All that's really left for me to do at this point is finish removing the USER32/SHELL32 winapi dependencies, because those show up really erroneous in Dependency Walker ever since, like, XP. It's all being re-done in MSVCRT since MinGW forces that as a dependency of my plugin anyway, so no point in evading the use of CRT functions for a module inevitably linking it. This requires the new, GUI for Configure RSP Plugin (which is being introduced in the next version) to be separate...I was going to use ShellExecuteEx for this but that didn't give me enough control for async thread waiting, so instead I'm using system(). I don't know anyone freely adept/open to making a Windows GUI for my config interface, but dsx! thinks he can have one done in a week or two, so go nag him about it! The major purposes of this are a) help MarathonMan not have to work with multiple RSP DLL plugin files, b) user needs to install one DLL, not 7, c) 1964 will now be able to load th eplugin.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#317
|
||||
|
||||
![]()
Just to clarify, I'm not sure how much more work I'll put into vectorizing that plugin.
The main reason for porting it to PJ64 was to reduce the entropy of all the problems I was having with CEN64. Now that CEN64 is able to execute non-trivial RSP/RDP tasks and most of the RSP functions are vectorized, I'm more interested in porting the changes to my own emulator and working on them there. I'll probably still do all my tests in this plugin and update it, but I'm not trying to fork it or anything. Don't expect releases often anyways. ![]() |
#318
|
||||
|
||||
![]()
PS, you should be able to speed up integer execution a good deal if you use function pointers instead of the giant switch statement.
With a switch statement, the compiler is almost certainly going to generate a bunch of "checks" to see if there is a case which matches the variable, if the variable is in range of any of the cases, etc. i.e.: Code:
switch(x) { case 1: ... case 3: ... case 5: ... } Code:
if (x > 5) goto default if (x == 2) goto default if (x == 0) goto default x (x == 4) goto default jumptable[x](); Code:
if (x > 5 || x & 1 == 0) goto default jumptable[x](); I bet this would equate to another 5-10% increase in performance. |
#319
|
||||
|
||||
![]() Quote:
It's a double-edged sword I guess. On one hand, you could use the version of your RSP emulator ported to zilmar's spec to promote your CEN64 RSP emulator, which will be even faster and free of the endian issues of zilmar's plugin spec (but also cycle-accurate then again). That's partly why I suggested calling this plugin "CEN64 Prototype", so people would see you modified my plugin to include just some of your code from CEN64, which would be the complete version and interest more people into checking it out. On the other hand, this wouldn't be an effective method of advertisement to everyone, because some people would be like, oh cool I have the Windows version of the CEN64 goodies, now I'm no longer forced to install a decent OS and run the native CEN64 emulator. ![]() [Nah, to hell with those assholes lol.] So there are different ways of viewing it. But yes, the biggest thing is having a stable base where you can plug it into all the N64 games and easily find bugs in your native CEN64 version. That was the very first reason I would think you'd ever consider doing a version in zilmar plugin spec. Quote:
Actually, I used to have everything in function pointer tables, then I rewrote it as a big chain of switch() statements for all the opcodes, and at first there was no really noticeable speed difference. In fact, using all the switch() jumps made the compiler take way, way longer to generate the code, probably allocating a conversion into not doing so many damn annoying if's. Generally, MSVS almost always converts switch() to an optimized jump. The older versions of MinGW were unlikely to do this for me, but these days I think they've vastly improved on that. GCC output for doing a switch on a smaller range of values in MFC0: Code:
void MFC0(int rt, int rd) { switch (rd) { case 0x0: SR[rt] = *RSP.SP_MEM_ADDR_REG; return; case 0x1: ... Code:
_MFC0: LFB21: .cfi_startproc pushl %ebx .cfi_def_cfa_offset 8 .cfi_offset 3, -8 subl $24, %esp .cfi_def_cfa_offset 32 movl 32(%esp), %ebx movl 36(%esp), %eax cmpl $15, %eax ja L701 jmp *L719(,%eax,4) .section .rdata,"dr" .align 4 L719: ... If I'm not mistaken, procedure CALL's are slower than plain branches, and this switch code seems to not do a function call. The only real reason I kept vector unit operations (including stuff from COP2, LWC2 and SWC2) as function pointer tables and not switch statements was because they were the most complex and used the most memory code space in the cache, so I allocated them to separated function space. Still, you're right that I should definitely check it one last time and write down the VI/s results for a change this major and see if it matters, or at least makes the plugin size smaller.
__________________
http://theoatmeal.com/comics/cat_vs_internet Last edited by HatCat; 16th August 2013 at 02:27 AM. |
#320
|
||||
|
||||
![]() Quote:
Code:
void MFC0(int rt, int rd) { switch (rd) { case 0x0: SR[rt] = *RSP.SP_MEM_ADDR_REG; return; case 0x1: ... Code:
_MFC0: LFB21: .cfi_startproc pushl %ebx .cfi_def_cfa_offset 8 .cfi_offset 3, -8 subl $24, %esp .cfi_def_cfa_offset 32 movl 32(%esp), %ebx movl 36(%esp), %eax cmpl $15, %eax ja L701 jmp *L719(,%eax,4) .section .rdata,"dr" .align 4 L719: ... Yeah, CALLs will be slower. CALL is a macro-op that boils down to a push and a jump, whereas JMP is just a uop. |