Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #471  
Old 14th September 2013, 09:01 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

And I'd say you're certainly right.
Explicitly laying out the intrinsics gives all that we'd need now, not later.


The only definite thing I hate about it is that people can't compile it for older computers.
I want anybody to be able to compile the source, but even if they have all the headers, they'd have to massively rewrite my entire plugin if their machine didn't have SSE, but I explicitly write it.

That's probably the only real reason I'm adverse to writing it in.

So some SSE stuff will be hardcoded, others will be forwards-extensible but rely on oddball compiler interpretations.

Case in point though, I was really hoping that at least someday, a SSE type of instruction would exist, that would move the upper 16 bits of all 8 32-bit elements, to the 8 16-bit elements of a destination XMM register or something.

If that ever became possible, then VMUDL would be greatly simplified.
Because ACC47..33 is always 0; ACC32..16 is always 0; ACC15..0 is always said result above.

Ideally, I thought it should be VERY short, but it's not.
I really had no idea there could be so many struggles for compiler authors to vectorize things that, at least by my intentions, should be really straightforward, and really simple.

Last edited by HatCat; 14th September 2013 at 09:04 PM.
Reply With Quote
  #472  
Old 14th September 2013, 09:13 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
The only definite thing I hate about it is that people can't compile it for older computers.
I want anybody to be able to compile the source, but even if they have all the headers, they'd have to massively rewrite my entire plugin if their machine didn't have SSE, but I explicitly write it.
Hence my egregious use of #ifdef USE_SSE ... #endif.

Ugly, certainly? However, it'll give me the ability to switch between open-ness and performance later on.

Quote:
Originally Posted by BatCat View Post
Case in point though, I was really hoping that at least someday, a SSE type of instruction would exist, that would move the upper 16 bits of all 8 32-bit elements, to the 8 16-bit elements of a destination XMM register or something.
Heh, you and I think alike on that one. A lot of what I wrote uses inline functions for things that I could see viable in the future to faciliate updating to newer versions of SSE:

Code:
/* ============================================================================
 *  RSPPackHi32to16: Pack MSBs of 32-bit vectors to 16-bits without saturation.
 * ========================================================================= */
static __m128i
RSPPackHi32to16(__m128i vectorLow, __m128i vectorHigh) {
  vectorLow = _mm_srai_epi32(vectorLow, 16); 
  vectorHigh = _mm_srai_epi32(vectorHigh, 16); 
  return _mm_packs_epi32(vectorLow, vectorHigh);
}
There's _mm_packus_epi32, but it treats the input as 32-bit signed and clamps to 16-bit unsigned. Chad Warden don't want that...

Quote:
Originally Posted by BatCat View Post
Ideally, I thought it should be VERY short, but it's not.
I really had no idea there could be so many struggles for compiler authors to vectorize things that, at least by my intentions, should be really straightforward, and really simple.
Life inside a compiler is hard, man.

EDIT: Oh yeah, and there exist SSE instructions which will do only give the 16 most significant bits of the result of the multiplication and such, so I use those a lot too.

Last edited by MarathonMan; 14th September 2013 at 09:17 PM.
Reply With Quote
  #473  
Old 14th September 2013, 09:39 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

lol, and I thought having 8 different DLL plugins was bad.

I could have to have 4 different source archives.
To cover, each possible generation of SSE.

Or #ifdef USE_SSE works like you said. When I have to do things like that to every single file I just get the distinct feeling that I can't be doing everything the right way. Normally I would adhere to a bunch of inline functions shared between all the RSP vector ops and merge all the `#ifdef USE_SSE`'s into there, but sometimes the differences are so extreme (like directly emulating the $vco control register using a uint_16 VCO, or by using 2 arrays of 8 Boolean ints to vectorize VCO) that this isn't always feasible.

So many damn things.
But, never something I would rule out permanently. For now I'm too impatient to learn how SSE2 functions work, but I'm sure I'll have to do it ultimately, like for shuffling 0q-, 1q-, and (n)h-encoded vector coefficients.

Quote:
Originally Posted by MarathonMan View Post
Life inside a compiler is hard, man.

EDIT: Oh yeah, and there exist SSE instructions which will do only give the 16 most significant bits of the result of the multiplication and such, so I use those a lot too.
Then it should have generated something like that for VMUDH.

I mean, I'm guessing it's hard to vectorize everything to simplest form, because of the huge variety of choices to use for SSE methods?
But that's only an impulse guess.

Ideally the array of 32-bit products should be unsigned, since (unsigned short)(0xFFFF) * (unsigned short)(0xFFFF) would set the sign bit of the corresponding results element.
Reply With Quote
  #474  
Old 16th September 2013, 03:21 PM
uyjulian uyjulian is offline
Junior Member
 
Join Date: Sep 2013
Posts: 29
Default

modern computers nowdays have SSE2, why drop support for it? (dolphin requires sse2)
also... are you compiling this on windows? try cross-compiling on linux. (I do that all the time, simply because the compiler on linux makes the program faster.)
Reply With Quote
  #475  
Old 16th September 2013, 05:19 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by uyjulian View Post
modern computers nowdays have SSE2, why drop support for it? (dolphin requires sse2)
Where do you get these hallucinations?

I said I'll never accept going any further *past* SSE2, not "SSE2 is unacceptable. I'm changing the maximum limit to SSE1."

It still is SSE2.
zilmar's RSP emulator also uses SSE2, but it affects less than 1% of the vector opcodes so is completely insignificant, due to his choice of Microsoft's optimizing compiler.

Quote:
Originally Posted by uyjulian View Post
also... are you compiling this on windows? try cross-compiling on linux. (I do that all the time, simply because the compiler on linux makes the program faster.)
I use MinGW, native Windows port of the Linux compiler.
Are you sure they're very different?
I know there are a few things MinGW authors didn't port over from native GCC, but I've never had to use those.

Agreed, the best thing to do would be to use the latest GCC on Linux officially, but then how is this portable over to Windows or people who don't use Linux?
That's why I set up the MAKEFILE or things for the more deficient MinGW setup.

There is already a Mupen64Plus port of this plugin, so they maintain the native GCC compiler/linker upgrades to this plugin for Linux only.
Reply With Quote
  #476  
Old 17th September 2013, 06:46 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Not great.

I just upgraded finally from GCC 4.7.2 to 4.8.1, and the vectorization output is even worse than before.

VADD is 10 more lines of SSE2 plus extra branches that didn't need to be added.
Same thing for VMUDL, VXOR, and the inline SIGNED_CLAMP method.

The only opcode I checked that isn't worse than before (It's the same.) is VSAWM/VSAWH/VSAWL, and I had to change the array of accumulator sections to (unsigned short) type, as opposed to just plain (short), so that it would continue to use SSE2 to move them from now on instead of normal Intel 32-bit mov's.

I really hope that this doesn't continue to go downhill with future GCC releases.
Reply With Quote
  #477  
Old 17th September 2013, 04:48 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
I use MinGW, native Windows port of the Linux compiler.
Are you sure they're very different?
They're not. At all. Whatsoever. They'll generate the same code sans edge cases where calling conventions between platforms differ and whatnot.

Quote:
Originally Posted by BatCat View Post
Not great.

I just upgraded finally from GCC 4.7.2 to 4.8.1, and the vectorization output is even worse than before.
...

Quote:
Originally Posted by MarathonMan View Post
I guess I should be careful what I say and elaborate. Let me rephrase to: trust compilers in the absence of vectorization. Compilers are still awful at vectorizing things;
Quote:
Originally Posted by MarathonMan View Post
Also, CEN64: I need performance now, not in ten years, so I wrote out algorithms using intrinsics by hand. I agree that they are less portable, uglier, and in general a nuisance in the future, but I don't really have any other options given the extent of my target goal.
Reply With Quote
  #478  
Old 17th September 2013, 05:36 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Or I can just roll GCC back to 4.7.2 like I was using earlier and keep playing with that until newer versions start to improve it again.
Reply With Quote
  #479  
Old 17th September 2013, 06:04 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
Or I can just roll GCC back to 4.7.2 like I was using earlier and keep playing with that until newer versions start to improve it again.
Not trying to trollolol, just sayin' vectorization is a world of hurt sometimes when relying on the compiler.
Reply With Quote
  #480  
Old 17th September 2013, 06:38 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Such a talented opera singer that man is.
Not quite as ABAP/Ballin' as some others, but still pretty nifty.


Anyway, no matter what I want this code to be as portable as possible.
So what I'll do here is temporarily downgrade just the CC unit back to 4.7.2, while keeping the newer version of everything else.

I'll also see about writing some SSE2 intrinsic explicitly for handling the vector register shuffling. If entirely successful there should be no more need to have the vector execute table be 2-D which should free up some more cache space.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 04:46 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.