|
#501
|
||||
|
||||
![]() Quote:
The only thing I know right now is that all the stuff under SSE2 falls under the positives for all of those questions. It might be that the multiply operations are all significantly slower because of the missing 32-bit multiply storage or unpacking SSE4 may use. But I'm inclined to focus on initiate. SGI had this vector unit prototyped since before 1996. And SSE2 doesn't come out until 1999? That's not too bad...at least 3 years difference of time to contemplate some competitive ideas for vector extensions on most PCs. But should I acknowledge that it took 11 years and use SSSE3? Not without some sort of shame, especially if less than 25% of all that SSSE3 entails is used by the RCP, yet the RCP emulator in SSE2 yields at least 95% of the speed obtained by upgrading to SSSE3. Don't like them ratios. ![]() There are also several things going both ways, between the preference of SSE intrinsics over ANSI C, and the preference of ANSI C over SSE intrinsics. Some compilers optimize the SSE to become better SSE, by looking at SSE2 code on a SSE4 dev environment. Other algorithms might give the compiler an even better clue if they were done in ANSI C. Like I said I think it just really depends on the algorithm in question ultimately. So yeah, I could finish writing a huge essay over how much I wanna be a little bitch about it XD. I must say though, it's starting to feel pretty fun to explicitly write out the SSE2 for some of this stuff (shuffling, clamping I need to do too I think). More I keep having to do that though, more I have to do that basic #ifdef USE_SSE style of things. In the RTFM for this baby I'll be sure to advise everyone to re-build the plugin using -mssse3, -msse4, -mavx, or whatever is the highest their system supports. I could maybe release such builds myself, but that would be awfully deficient as new upgrades to the GCC compiler will always still get posted afterwards. Quote:
Remember that from an accuracy point of view, there is no 32-bit VU math on the RSP. That could be why the VMUDH output was so damn large compared to yours. Still, it's nothing compared to how tremendously big it was before I put any SSE in there at all, and that's enough to satisfy me just for now.
__________________
http://theoatmeal.com/comics/cat_vs_internet Last edited by HatCat; 18th September 2013 at 04:34 AM. |
#502
|
||||
|
||||
![]()
Stupid typos.
Go edit yourselves, fuckers! And about why bother to optimize for PCs that won't get past 10-20% full speed anyway. The important thing to me is that it's still the fastest emulator. They might download the emulator/plugin and find that it won't even work on their system. They might download it and find it's even slower than all the other plugins they can choose. Best case: It's still only 20% speed, but it's faster than all the other plugins they have access to. It's no reason for me to deliberately break the ability to use it on older hardware, just because it's not full speed. They'll complain that it's slow, but they can't complain that it isn't the best they've got, because they will pursue that in every way until they learn to finally do it the right way and get some dough and buy some real hardware dammit!
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#503
|
||||
|
||||
![]() Quote:
You and I just differ on SSE viewpoints. You'd rather support just about everyone, where I'd rather force everyone that doesn't have the minimum requirements that I see fit to use ANSI C. No harm in that. ![]() IMO, vectorization is always ignored by the big guns to some extent. Core 2 was a 64-bit processor but only had 128-bit vector registers? Seriously? Does it really take us another 2-3 generations of processors until we can get AVX, which actually provides us with vector registers that are 4x the width of the native register size? It wasn't until AVX2 that Intel even started giving some basic scatter/gather operations. IMO, vectorization still isn't that useful for most applications, sans the embarrassingly parallel, vectorizable simulators and graphics/audio/your-multimedia-task-here operations. It's kind of a shame, in a way, but it very well may be due to hardware limitations. Don't advise -myourISAhere, by the way... let GCC do it. -march=native was designed specifically for this purpose. You can even do -march=core2 or something other micro-architecture to enable the latest-and-greatest for that uarch. |
#504
|
||||
|
||||
![]() Quote:
Chad Warden want some of that SSSE triple. |
#505
|
||||
|
||||
![]()
^ That was before SSE4!
"Chad Warden? Wipes his ASS, with SSE2! I'm talkin' bout that: SSSE-TRIPLE!" After SSE4: "Come on, man...SSSE3? What kind of poor, bitch-ass, cardboard box are you LIVIN' out of, baby?!" Yeah that's true it's actually supposed to be -march=native, not -msse9001. The only reason I say -msse2 in my builds script instead of -march=native (besides easily being able to quickly change the "2" to something else) is because it's shorter text and maintains the command being all in one line of text, without going past the 80-characters-per-line tradition I semi-foolishly abide by. And yeah, it's retarded how freaking late some of these enhancements arrive. I don't know much about AVX (never even heard of AVX2 yet...), but Nintendo is kinda world-famous! They should have taken a more open-source attitude with hardware ISA improvements, and this kind of stuff would have all been done sooner and better. But it looks like SSE2 came out in 2001-2003 range, which is the range of years where the N64 officially came to its gaming halt, so, I view at least that as a remotely acceptable initiative for stating as the requirement for my emulator.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#506
|
||||
|
||||
![]()
Now that's funny.
Opcodes like VNAND0q, only have 2 extra instructions more, than VNAND_v. In other words, if (e == 0x0) and you don't need to shuffle anything, it's only 2 x86/SSE instructions less than the size of the function where (e == 0x2). Code:
pshufhw $160, _VR(%ecx), %xmm0 pshuflw $160, %xmm0, %xmm1 I kid you not. And you wanted to convince me to upgrade to at least SSSE3 so I can use pshufb for a single shuffle instruction and shit? ![]() I have a feeling the multiplies are going to be my biggest hurdle with SSE2, and the current clamping loop sucks, too.
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#507
|
||||
|
||||
![]()
Man, with shuffling that damned instantaneous (only 2 added op-codes), I'm really starting to hate myself for using this split vector functions tree (15 copies of every RSP VU instruction function, one for each legal element encoder integer).
Then again you did say it's not possible to statically shuffle the target vector in SSE2. Though it seems from that horsing around pages back I have just a partial hope you could be wrong about it. The bigger trick is that only 0-1q and 0-3h use those exact functions you wrote. If it's 0-7w then it uses some other function to shuffle instead, which makes it somewhat trickier. What an exciting turn of events it will be if and when I can determine how to do the shuffle outside the RSP vector opcodes in the main RSP execute loop under the VU block, so that I can merge these redundant-ass functions back into one and save all that space!
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#508
|
||||
|
||||
![]()
You can't. Just simply just cannot do this without using either SSSE3 or jump tables to direct to the specific kind of shuffle. `pshufXw` takes an imm8 as it's operand that instructs it how to shuffle. The only possible way that you could inline it is by doing dynamic recompilation and generating the opcode at runtime, along with the desired RSP function, and calling that. Short of that, however, there's no way to control the immediate value of the SSE2 shuffle intrinsic without using some form of indirection in a static binary. This is why I require SSSE3. If this were not the case, I'd happily support SSE2 instead of SSSE3.
Last edited by MarathonMan; 18th September 2013 at 12:04 PM. |
#509
|
||||
|
||||
![]()
Hummmz.
Seems legit. ![]() Oh well, at least I've ruled out the possibility entirely from my conscious, rather than remaining confused at how their function works. So what I'll do is just have a macro alternative version of the VU jump table that only does a one-dimensional jump to all the _v functions only after slower dynamic shuffling, which can easily be rewritten by somebody with SSSE3 hardware in the future if they have it. Otherwise it's not too bad having split functions: voids doing any shuffling at all if it's one of those typical pure-vector operands. ![]()
__________________
http://theoatmeal.com/comics/cat_vs_internet |
#510
|
||||
|
||||
![]() Quote:
VSAW-middle: Code:
static void VSAWM(void) { const int vd = inst.R.sa; memcpy(VR[vd], VACC_M, N*sizeof(short)); return; } Code:
_VSAWM: LFB1159: .cfi_startproc movzwl _inst, %eax movl _VACC+16, %ecx shrw $6, %ax andl $31, %eax sall $4, %eax leal _VR(%eax), %edx movl %ecx, _VR(%eax) movl _VACC+20, %eax movl %eax, 4(%edx) movl _VACC+24, %eax movl %eax, 8(%edx) movl _VACC+28, %eax movl %eax, 12(%edx) ret .cfi_endproc So it should hardly be a "world of hurt" for the latest GCC to do it, just some temporary bug that hopefully goes away in later versions. I don't know why 4.8.2 has the bug. I've tried everything to get it to make it MOVDQA the ACC_M over to VR[vd], and the only solution that works besides downgrading to 4.7.2 is changing the definition of the accumulator array from (short) to (unsigned short), for some stupid-ass reason.
__________________
http://theoatmeal.com/comics/cat_vs_internet |