View Single Post
Old 18th September 2013, 04:31 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236

Originally Posted by MarathonMan View Post
If I get my hands on something, I"m usually more than willing to support it for a small minority (so long as it doesn't involve pushing the majority off the ship).

No, there is no 32-bit signed multiple support prior to SSE4.1. My thoughts for making SSSE3 a minimum were CEN64 are primarily based on the fact that CEN64 requires a processor recent enough to get respectable performance anyways. Why go out of my way to support SSE2 when those processors are only going to ever see around 10-20VI/s? Is there a benefit to SSE2 over ANSI C? Absolutely, and I'm glad to see you're taking advantage of it. But for a cycle-accurate simulator, I don't see a place for it between ANSI C and SSSE3, given it's lack of flexible shuffles and extra fancy things like _mm_sign_epiX.

I actually thought about seeing if I could leverage TSX somehow, even though I don't have it available on my own machine at home (as it just arrived in Haswell).
To raise requirements and add new things like this, for me I would say it depends on some different factors:
  • What part of the RSP emulator does it affect? Does it improve areas where a relatively large amount of time is wasted using inferior and outdated instructions, or does it just give a little boost by replacing code that already accurately and/or readably reflects the RSP's basic behavior to begin with?
  • Is the speed difference from implementing the change big enough to have an impact on really RSP-intensive spots that are slow for most people, or does it only matter for games that already run at full speed, with the risk of only raising system requirements to benefit those speed-ups at already full-speed spots?
  • How specialized is the upgraded SSE intrinsic? Is it something very, very difficult to remodel using ANSI C for past, current, or future compiler interpretations, or is it very similar to a loop in ANSI C and easily enough left to a few possibly more deficient SSE alternatives, until they can improve later?
  • When do the features of this upgraded intrinsic get used? Is it an essential part of the RSP's basic algorithm used for several vector op-codes, or is it an edge trimmer of sorts with just a few op-codes of no particular category, with no incidental resemblance to the realistic math the RSP behavior intended to use?

The only thing I know right now is that all the stuff under SSE2 falls under the positives for all of those questions.
It might be that the multiply operations are all significantly slower because of the missing 32-bit multiply storage or unpacking SSE4 may use.

But I'm inclined to focus on initiate.
SGI had this vector unit prototyped since before 1996.
And SSE2 doesn't come out until 1999?
That's not too least 3 years difference of time to contemplate some competitive ideas for vector extensions on most PCs.

But should I acknowledge that it took 11 years and use SSSE3?
Not without some sort of shame, especially if less than 25% of all that SSSE3 entails is used by the RCP, yet the RCP emulator in SSE2 yields at least 95% of the speed obtained by upgrading to SSSE3. Don't like them ratios.

There are also several things going both ways, between the preference of SSE intrinsics over ANSI C, and the preference of ANSI C over SSE intrinsics.
Some compilers optimize the SSE to become better SSE, by looking at SSE2 code on a SSE4 dev environment.
Other algorithms might give the compiler an even better clue if they were done in ANSI C.
Like I said I think it just really depends on the algorithm in question ultimately.

So yeah, I could finish writing a huge essay over how much I wanna be a little bitch about it XD.

I must say though, it's starting to feel pretty fun to explicitly write out the SSE2 for some of this stuff (shuffling, clamping I need to do too I think).
More I keep having to do that though, more I have to do that basic #ifdef USE_SSE style of things.

In the RTFM for this baby I'll be sure to advise everyone to re-build the plugin using -mssse3, -msse4, -mavx, or whatever is the highest their system supports. I could maybe release such builds myself, but that would be awfully deficient as new upgrades to the GCC compiler will always still get posted afterwards.

Originally Posted by MarathonMan View Post
EDIT: As you probably (certainly ) know, many of the RSP multiply functions (VMA**/VMU**/etc) use 32-bit multiplication.
The phrase keeps losing me, though I'm pretty sure you're referring to 32-bit storage = 16-bit vector slice * 16-bit vector-slice.
Remember that from an accuracy point of view, there is no 32-bit VU math on the RSP.

That could be why the VMUDH output was so damn large compared to yours.

Still, it's nothing compared to how tremendously big it was before I put any SSE in there at all, and that's enough to satisfy me just for now.

Last edited by HatCat; 18th September 2013 at 04:34 AM.
Reply With Quote