Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #501  
Old 18th September 2013, 04:31 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by MarathonMan View Post
If I get my hands on something, I"m usually more than willing to support it for a small minority (so long as it doesn't involve pushing the majority off the ship).

No, there is no 32-bit signed multiple support prior to SSE4.1. My thoughts for making SSSE3 a minimum were CEN64 are primarily based on the fact that CEN64 requires a processor recent enough to get respectable performance anyways. Why go out of my way to support SSE2 when those processors are only going to ever see around 10-20VI/s? Is there a benefit to SSE2 over ANSI C? Absolutely, and I'm glad to see you're taking advantage of it. But for a cycle-accurate simulator, I don't see a place for it between ANSI C and SSSE3, given it's lack of flexible shuffles and extra fancy things like _mm_sign_epiX.

I actually thought about seeing if I could leverage TSX somehow, even though I don't have it available on my own machine at home (as it just arrived in Haswell).
To raise requirements and add new things like this, for me I would say it depends on some different factors:
  • What part of the RSP emulator does it affect? Does it improve areas where a relatively large amount of time is wasted using inferior and outdated instructions, or does it just give a little boost by replacing code that already accurately and/or readably reflects the RSP's basic behavior to begin with?
  • Is the speed difference from implementing the change big enough to have an impact on really RSP-intensive spots that are slow for most people, or does it only matter for games that already run at full speed, with the risk of only raising system requirements to benefit those speed-ups at already full-speed spots?
  • How specialized is the upgraded SSE intrinsic? Is it something very, very difficult to remodel using ANSI C for past, current, or future compiler interpretations, or is it very similar to a loop in ANSI C and easily enough left to a few possibly more deficient SSE alternatives, until they can improve later?
  • When do the features of this upgraded intrinsic get used? Is it an essential part of the RSP's basic algorithm used for several vector op-codes, or is it an edge trimmer of sorts with just a few op-codes of no particular category, with no incidental resemblance to the realistic math the RSP behavior intended to use?

The only thing I know right now is that all the stuff under SSE2 falls under the positives for all of those questions.
It might be that the multiply operations are all significantly slower because of the missing 32-bit multiply storage or unpacking SSE4 may use.

But I'm inclined to focus on initiate.
SGI had this vector unit prototyped since before 1996.
And SSE2 doesn't come out until 1999?
That's not too bad...at least 3 years difference of time to contemplate some competitive ideas for vector extensions on most PCs.

But should I acknowledge that it took 11 years and use SSSE3?
Not without some sort of shame, especially if less than 25% of all that SSSE3 entails is used by the RCP, yet the RCP emulator in SSE2 yields at least 95% of the speed obtained by upgrading to SSSE3. Don't like them ratios.

There are also several things going both ways, between the preference of SSE intrinsics over ANSI C, and the preference of ANSI C over SSE intrinsics.
Some compilers optimize the SSE to become better SSE, by looking at SSE2 code on a SSE4 dev environment.
Other algorithms might give the compiler an even better clue if they were done in ANSI C.
Like I said I think it just really depends on the algorithm in question ultimately.

So yeah, I could finish writing a huge essay over how much I wanna be a little bitch about it XD.

I must say though, it's starting to feel pretty fun to explicitly write out the SSE2 for some of this stuff (shuffling, clamping I need to do too I think).
More I keep having to do that though, more I have to do that basic #ifdef USE_SSE style of things.

In the RTFM for this baby I'll be sure to advise everyone to re-build the plugin using -mssse3, -msse4, -mavx, or whatever is the highest their system supports. I could maybe release such builds myself, but that would be awfully deficient as new upgrades to the GCC compiler will always still get posted afterwards.

Quote:
Originally Posted by MarathonMan View Post
EDIT: As you probably (certainly ) know, many of the RSP multiply functions (VMA**/VMU**/etc) use 32-bit multiplication.
The phrase keeps losing me, though I'm pretty sure you're referring to 32-bit storage = 16-bit vector slice * 16-bit vector-slice.
Remember that from an accuracy point of view, there is no 32-bit VU math on the RSP.

That could be why the VMUDH output was so damn large compared to yours.

Still, it's nothing compared to how tremendously big it was before I put any SSE in there at all, and that's enough to satisfy me just for now.

Last edited by HatCat; 18th September 2013 at 04:34 AM.
Reply With Quote
  #502  
Old 18th September 2013, 04:39 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Stupid typos.

Go edit yourselves, fuckers!

And about why bother to optimize for PCs that won't get past 10-20% full speed anyway.

The important thing to me is that it's still the fastest emulator.

They might download the emulator/plugin and find that it won't even work on their system.
They might download it and find it's even slower than all the other plugins they can choose.

Best case: It's still only 20% speed, but it's faster than all the other plugins they have access to.

It's no reason for me to deliberately break the ability to use it on older hardware, just because it's not full speed.
They'll complain that it's slow, but they can't complain that it isn't the best they've got, because they will pursue that in every way until they learn to finally do it the right way and get some dough and buy some real hardware dammit!
Reply With Quote
  #503  
Old 18th September 2013, 04:42 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
To raise requirements and add new things like this, for me I would say it depends on some different factors:
  • What part of the RSP emulator does it affect? Does it improve areas where a relatively large amount of time is wasted using inferior and outdated instructions, or does it just give a little boost by replacing code that already accurately and/or readably reflects the RSP's basic behavior to begin with?
  • Is the speed difference from implementing the change big enough to have an impact on really RSP-intensive spots that are slow for most people, or does it only matter for games that already run at full speed, with the risk of only raising system requirements to benefit those speed-ups at already full-speed spots?
  • How specialized is the upgraded SSE intrinsic? Is it something very, very difficult to remodel using ANSI C for past, current, or future compiler interpretations, or is it very similar to a loop in ANSI C and easily enough left to a few possibly more deficient SSE alternatives, until they can improve later?
  • When do the features of this upgraded intrinsic get used? Is it an essential part of the RSP's basic algorithm used for several vector op-codes, or is it an edge trimmer of sorts with just a few op-codes of no particular category, with no incidental resemblance to the realistic math the RSP behavior intended to use?

The only thing I know right now is that all the stuff under SSE2 falls under the positives for all of those questions.
It might be that the multiply operations are all significantly slower because of the missing 32-bit multiply storage or unpacking SSE4 may use.

But I'm inclined to focus on initiate.
SGI had this vector unit prototyped since before 1996.
And SSE2 doesn't come out until 1999?
That's not too bad...at least 3 years difference of time to contemplate some competitive ideas for vector extensions on most PCs.

But should I acknowledge that it took 11 years and use SSSE3?
Not without some sort of shame, especially if less than 25% of all that SSSE3 entails is used by the RCP, yet the RCP emulator in SSE2 yields at least 95% of the speed obtained by upgrading to SSSE3. Don't like them ratios.

There are also several things going both ways, between the preference of SSE intrinsics over ANSI C, and the preference of ANSI C over SSE intrinsics.
Some compilers optimize the SSE to become better SSE, by looking at SSE2 code on a SSE4 dev environment.
Other algorithms might give the compiler an even better clue if they were done in ANSI C.
Like I said I think it just really depends on the algorithm in question ultimately.

So yeah, I could finish writing a huge essay over how much I wanna be a little bitch about it XD.

I must say though, it's starting to feel pretty fun to explicitly write out the SSE2 for some of this stuff (shuffling, clamping I need to do too I think).
More I keep having to do that though, more I have to do that basic #ifdef USE_SSE style of things.

In the RTFM for this baby I'll be sure to advise everyone to re-build the plugin using -mssse3, -msse4, -mavx, or whatever is the highest their system supports. I could maybe release such builds myself, but that would be awfully deficient as new upgrades to the GCC compiler will always still get posted afterwards.



The phrase keeps losing me, though I'm pretty sure you're referring to 32-bit storage = 16-bit vector slice * 16-bit vector-slice.
Remember that from an accuracy point of view, there is no 32-bit VU math on the RSP.

That could be why the VMUDH output was so damn large compared to yours.

Still, it's nothing compared to how tremendously big it was before I put any SSE in there at all, and that's enough to satisfy me just for now.
Sorry for the late ninja.

You and I just differ on SSE viewpoints. You'd rather support just about everyone, where I'd rather force everyone that doesn't have the minimum requirements that I see fit to use ANSI C. No harm in that. Programming is all about trade-offs.

IMO, vectorization is always ignored by the big guns to some extent. Core 2 was a 64-bit processor but only had 128-bit vector registers? Seriously? Does it really take us another 2-3 generations of processors until we can get AVX, which actually provides us with vector registers that are 4x the width of the native register size?

It wasn't until AVX2 that Intel even started giving some basic scatter/gather operations. IMO, vectorization still isn't that useful for most applications, sans the embarrassingly parallel, vectorizable simulators and graphics/audio/your-multimedia-task-here operations. It's kind of a shame, in a way, but it very well may be due to hardware limitations.

Don't advise -myourISAhere, by the way... let GCC do it. -march=native was designed specifically for this purpose. You can even do -march=core2 or something other micro-architecture to enable the latest-and-greatest for that uarch.
Reply With Quote
  #504  
Old 18th September 2013, 04:53 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
They'll complain that it's slow, but they can't complain that it isn't the best they've got, because they will pursue that in every way until they learn to finally do it the right way and get some dough and buy some real hardware dammit!
Com'on... Com'on now...that SSE2?

Chad Warden want some of that SSSE triple.
Reply With Quote
  #505  
Old 18th September 2013, 05:00 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

^ That was before SSE4!
"Chad Warden? Wipes his ASS, with SSE2! I'm talkin' bout that: SSSE-TRIPLE!"
After SSE4:
"Come on, man...SSSE3? What kind of poor, bitch-ass, cardboard box are you LIVIN' out of, baby?!"

Yeah that's true it's actually supposed to be -march=native, not -msse9001.
The only reason I say -msse2 in my builds script instead of -march=native (besides easily being able to quickly change the "2" to something else) is because it's shorter text and maintains the command being all in one line of text, without going past the 80-characters-per-line tradition I semi-foolishly abide by.

And yeah, it's retarded how freaking late some of these enhancements arrive. I don't know much about AVX (never even heard of AVX2 yet...), but Nintendo is kinda world-famous! They should have taken a more open-source attitude with hardware ISA improvements, and this kind of stuff would have all been done sooner and better.

But it looks like SSE2 came out in 2001-2003 range, which is the range of years where the N64 officially came to its gaming halt, so, I view at least that as a remotely acceptable initiative for stating as the requirement for my emulator.
Reply With Quote
  #506  
Old 18th September 2013, 07:56 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Now that's funny.

Opcodes like VNAND0q, only have 2 extra instructions more, than VNAND_v.

In other words, if (e == 0x0) and you don't need to shuffle anything,
it's only 2 x86/SSE instructions less than the size of the function where (e == 0x2).

Code:
	pshufhw	$160, _VR(%ecx), %xmm0
	pshuflw	$160, %xmm0, %xmm1
That seriously is all it adds.
I kid you not.


And you wanted to convince me to upgrade to at least SSSE3 so I can use pshufb for a single shuffle instruction and shit?

I have a feeling the multiplies are going to be my biggest hurdle with SSE2, and the current clamping loop sucks, too.
Reply With Quote
  #507  
Old 18th September 2013, 08:04 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Man, with shuffling that damned instantaneous (only 2 added op-codes), I'm really starting to hate myself for using this split vector functions tree (15 copies of every RSP VU instruction function, one for each legal element encoder integer).

Then again you did say it's not possible to statically shuffle the target vector in SSE2.
Though it seems from that horsing around pages back I have just a partial hope you could be wrong about it.

The bigger trick is that only 0-1q and 0-3h use those exact functions you wrote.
If it's 0-7w then it uses some other function to shuffle instead, which makes it somewhat trickier.

What an exciting turn of events it will be if and when I can determine how to do the shuffle outside the RSP vector opcodes in the main RSP execute loop under the VU block, so that I can merge these redundant-ass functions back into one and save all that space!
Reply With Quote
  #508  
Old 18th September 2013, 12:00 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by BatCat View Post
What an exciting turn of events it will be if and when I can determine how to do the shuffle outside the RSP vector opcodes in the main RSP execute loop under the VU block, so that I can merge these redundant-ass functions back into one and save all that space!
You can't. Just simply just cannot do this without using either SSSE3 or jump tables to direct to the specific kind of shuffle. `pshufXw` takes an imm8 as it's operand that instructs it how to shuffle. The only possible way that you could inline it is by doing dynamic recompilation and generating the opcode at runtime, along with the desired RSP function, and calling that. Short of that, however, there's no way to control the immediate value of the SSE2 shuffle intrinsic without using some form of indirection in a static binary. This is why I require SSSE3. If this were not the case, I'd happily support SSE2 instead of SSSE3.

Last edited by MarathonMan; 18th September 2013 at 12:04 PM.
Reply With Quote
  #509  
Old 18th September 2013, 06:35 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Hummmz.


Seems legit.
Oh well, at least I've ruled out the possibility entirely from my conscious, rather than remaining confused at how their function works.

So what I'll do is just have a macro alternative version of the VU jump table that only does a one-dimensional jump to all the _v functions only after slower dynamic shuffling, which can easily be rewritten by somebody with SSSE3 hardware in the future if they have it.

Otherwise it's not too bad having split functions: voids doing any shuffling at all if it's one of those typical pure-vector operands.
Reply With Quote
  #510  
Old 19th September 2013, 04:00 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,236
Default

Quote:
Originally Posted by MarathonMan View Post
Not trying to trollolol, just sayin' vectorization is a world of hurt sometimes when relying on the compiler.
Actually it looks like the only vectorization that got hurt in GCC 4.8.2, that was working fine in 4.7.2, was basic accumulator write-back.

VSAW-middle:
Code:
static void VSAWM(void)
{
    const int vd = inst.R.sa;

    memcpy(VR[vd], VACC_M, N*sizeof(short));
    return;
}
Code:
_VSAWM:
LFB1159:
	.cfi_startproc
	movzwl	_inst, %eax
	movl	_VACC+16, %ecx
	shrw	$6, %ax
	andl	$31, %eax
	sall	$4, %eax
	leal	_VR(%eax), %edx
	movl	%ecx, _VR(%eax)
	movl	_VACC+20, %eax
	movl	%eax, 4(%edx)
	movl	_VACC+24, %eax
	movl	%eax, 8(%edx)
	movl	_VACC+28, %eax
	movl	%eax, 12(%edx)
	ret
	.cfi_endproc
Even Microsoft Visual Studio is intelligent enough to vectorize this simple memcpy as SSE2.

So it should hardly be a "world of hurt" for the latest GCC to do it, just some temporary bug that hopefully goes away in later versions.

I don't know why 4.8.2 has the bug.
I've tried everything to get it to make it MOVDQA the ACC_M over to VR[vd], and the only solution that works besides downgrading to 4.7.2 is changing the definition of the accumulator array from (short) to (unsigned short), for some stupid-ass reason.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 03:40 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.