View Single Post
Old 18th September 2013, 04:42 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
Join Date: Jan 2013
Posts: 454

Originally Posted by BatCat View Post
To raise requirements and add new things like this, for me I would say it depends on some different factors:
  • What part of the RSP emulator does it affect? Does it improve areas where a relatively large amount of time is wasted using inferior and outdated instructions, or does it just give a little boost by replacing code that already accurately and/or readably reflects the RSP's basic behavior to begin with?
  • Is the speed difference from implementing the change big enough to have an impact on really RSP-intensive spots that are slow for most people, or does it only matter for games that already run at full speed, with the risk of only raising system requirements to benefit those speed-ups at already full-speed spots?
  • How specialized is the upgraded SSE intrinsic? Is it something very, very difficult to remodel using ANSI C for past, current, or future compiler interpretations, or is it very similar to a loop in ANSI C and easily enough left to a few possibly more deficient SSE alternatives, until they can improve later?
  • When do the features of this upgraded intrinsic get used? Is it an essential part of the RSP's basic algorithm used for several vector op-codes, or is it an edge trimmer of sorts with just a few op-codes of no particular category, with no incidental resemblance to the realistic math the RSP behavior intended to use?

The only thing I know right now is that all the stuff under SSE2 falls under the positives for all of those questions.
It might be that the multiply operations are all significantly slower because of the missing 32-bit multiply storage or unpacking SSE4 may use.

But I'm inclined to focus on initiate.
SGI had this vector unit prototyped since before 1996.
And SSE2 doesn't come out until 1999?
That's not too least 3 years difference of time to contemplate some competitive ideas for vector extensions on most PCs.

But should I acknowledge that it took 11 years and use SSSE3?
Not without some sort of shame, especially if less than 25% of all that SSSE3 entails is used by the RCP, yet the RCP emulator in SSE2 yields at least 95% of the speed obtained by upgrading to SSSE3. Don't like them ratios.

There are also several things going both ways, between the preference of SSE intrinsics over ANSI C, and the preference of ANSI C over SSE intrinsics.
Some compilers optimize the SSE to become better SSE, by looking at SSE2 code on a SSE4 dev environment.
Other algorithms might give the compiler an even better clue if they were done in ANSI C.
Like I said I think it just really depends on the algorithm in question ultimately.

So yeah, I could finish writing a huge essay over how much I wanna be a little bitch about it XD.

I must say though, it's starting to feel pretty fun to explicitly write out the SSE2 for some of this stuff (shuffling, clamping I need to do too I think).
More I keep having to do that though, more I have to do that basic #ifdef USE_SSE style of things.

In the RTFM for this baby I'll be sure to advise everyone to re-build the plugin using -mssse3, -msse4, -mavx, or whatever is the highest their system supports. I could maybe release such builds myself, but that would be awfully deficient as new upgrades to the GCC compiler will always still get posted afterwards.

The phrase keeps losing me, though I'm pretty sure you're referring to 32-bit storage = 16-bit vector slice * 16-bit vector-slice.
Remember that from an accuracy point of view, there is no 32-bit VU math on the RSP.

That could be why the VMUDH output was so damn large compared to yours.

Still, it's nothing compared to how tremendously big it was before I put any SSE in there at all, and that's enough to satisfy me just for now.
Sorry for the late ninja.

You and I just differ on SSE viewpoints. You'd rather support just about everyone, where I'd rather force everyone that doesn't have the minimum requirements that I see fit to use ANSI C. No harm in that. Programming is all about trade-offs.

IMO, vectorization is always ignored by the big guns to some extent. Core 2 was a 64-bit processor but only had 128-bit vector registers? Seriously? Does it really take us another 2-3 generations of processors until we can get AVX, which actually provides us with vector registers that are 4x the width of the native register size?

It wasn't until AVX2 that Intel even started giving some basic scatter/gather operations. IMO, vectorization still isn't that useful for most applications, sans the embarrassingly parallel, vectorizable simulators and graphics/audio/your-multimedia-task-here operations. It's kind of a shame, in a way, but it very well may be due to hardware limitations.

Don't advise -myourISAhere, by the way... let GCC do it. -march=native was designed specifically for this purpose. You can even do -march=core2 or something other micro-architecture to enable the latest-and-greatest for that uarch.
Reply With Quote