|
#471
|
||||
|
||||
![]()
And I'd say you're certainly right.
Explicitly laying out the intrinsics gives all that we'd need now, not later. The only definite thing I hate about it is that people can't compile it for older computers. I want anybody to be able to compile the source, but even if they have all the headers, they'd have to massively rewrite my entire plugin if their machine didn't have SSE, but I explicitly write it. That's probably the only real reason I'm adverse to writing it in. So some SSE stuff will be hardcoded, others will be forwards-extensible but rely on oddball compiler interpretations. Case in point though, I was really hoping that at least someday, a SSE type of instruction would exist, that would move the upper 16 bits of all 8 32-bit elements, to the 8 16-bit elements of a destination XMM register or something. If that ever became possible, then VMUDL would be greatly simplified. Because ACC47..33 is always 0; ACC32..16 is always 0; ACC15..0 is always said result above. Ideally, I thought it should be VERY short, but it's not. ![]() I really had no idea there could be so many struggles for compiler authors to vectorize things that, at least by my intentions, should be really straightforward, and really simple.
__________________
http://theoatmeal.com/comics/cat_vs_internet Last edited by HatCat; 14th September 2013 at 09:04 PM. |
#472
|
||||
|
||||
![]() Quote:
Ugly, certainly? However, it'll give me the ability to switch between open-ness and performance later on. Quote:
Code:
/* ============================================================================ * RSPPackHi32to16: Pack MSBs of 32-bit vectors to 16-bits without saturation. * ========================================================================= */ static __m128i RSPPackHi32to16(__m128i vectorLow, __m128i vectorHigh) { vectorLow = _mm_srai_epi32(vectorLow, 16); vectorHigh = _mm_srai_epi32(vectorHigh, 16); return _mm_packs_epi32(vectorLow, vectorHigh); } Quote:
![]() EDIT: Oh yeah, and there exist SSE instructions which will do only give the 16 most significant bits of the result of the multiplication and such, so I use those a lot too. Last edited by MarathonMan; 14th September 2013 at 09:17 PM. |
#473
|
||||
|
||||
![]()
lol, and I thought having 8 different DLL plugins was bad.
I could have to have 4 different source archives. ![]() To cover, each possible generation of SSE. Or #ifdef USE_SSE works like you said. When I have to do things like that to every single file I just get the distinct feeling that I can't be doing everything the right way. Normally I would adhere to a bunch of inline functions shared between all the RSP vector ops and merge all the `#ifdef USE_SSE`'s into there, but sometimes the differences are so extreme (like directly emulating the $vco control register using a uint_16 VCO, or by using 2 arrays of 8 Boolean ints to vectorize VCO) that this isn't always feasible. So many damn things. But, never something I would rule out permanently. For now I'm too impatient to learn how SSE2 functions work, but I'm sure I'll have to do it ultimately, like for shuffling 0q-, 1q-, and (n)h-encoded vector coefficients. Quote:
![]() I mean, I'm guessing it's hard to vectorize everything to simplest form, because of the huge variety of choices to use for SSE methods? But that's only an impulse guess. Ideally the array of 32-bit products should be unsigned, since (unsigned short)(0xFFFF) * (unsigned short)(0xFFFF) would set the sign bit of the corresponding results element.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#474
|
|||
|
|||
![]()
modern computers nowdays have SSE2, why drop support for it? (dolphin requires sse2)
also... are you compiling this on windows? try cross-compiling on linux. (I do that all the time, simply because the compiler on linux makes the program faster.) |
#475
|
||||
|
||||
![]() Quote:
I said I'll never accept going any further *past* SSE2, not "SSE2 is unacceptable. I'm changing the maximum limit to SSE1." It still is SSE2. zilmar's RSP emulator also uses SSE2, but it affects less than 1% of the vector opcodes so is completely insignificant, due to his choice of Microsoft's optimizing compiler. Quote:
Are you sure they're very different? I know there are a few things MinGW authors didn't port over from native GCC, but I've never had to use those. Agreed, the best thing to do would be to use the latest GCC on Linux officially, but then how is this portable over to Windows or people who don't use Linux? That's why I set up the MAKEFILE or things for the more deficient MinGW setup. There is already a Mupen64Plus port of this plugin, so they maintain the native GCC compiler/linker upgrades to this plugin for Linux only.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#476
|
||||
|
||||
![]()
Not great.
I just upgraded finally from GCC 4.7.2 to 4.8.1, and the vectorization output is even worse than before. VADD is 10 more lines of SSE2 plus extra branches that didn't need to be added. Same thing for VMUDL, VXOR, and the inline SIGNED_CLAMP method. The only opcode I checked that isn't worse than before (It's the same.) is VSAWM/VSAWH/VSAWL, and I had to change the array of accumulator sections to (unsigned short) type, as opposed to just plain (short), so that it would continue to use SSE2 to move them from now on instead of normal Intel 32-bit mov's. I really hope that this doesn't continue to go downhill with future GCC releases.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#477
|
||||
|
||||
![]() Quote:
Quote:
Quote:
Quote:
![]() |
#478
|
||||
|
||||
![]()
Or I can just roll GCC back to 4.7.2 like I was using earlier and keep playing with that until newer versions start to improve it again.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#479
|
||||
|
||||
![]() Quote:
|
#480
|
||||
|
||||
![]()
Such a talented opera singer that man is.
Not quite as ABAP/Ballin' as some others, but still pretty nifty. ![]() Anyway, no matter what I want this code to be as portable as possible. So what I'll do here is temporarily downgrade just the CC unit back to 4.7.2, while keeping the newer version of everything else. I'll also see about writing some SSE2 intrinsic explicitly for handling the vector register shuffling. If entirely successful there should be no more need to have the vector execute table be 2-D which should free up some more cache space.
__________________
http://theoatmeal.com/comics/cat_vs_internet |