#11  
Old 25th January 2013, 06:18 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
Yes MarathonMan thanks for letting me know; I have been following your work.

I have never looked into a cycle-accurate emulator before, so I'm sure I will be analyzing what you have done for help as well to give me some clues.
It would also be nice to start programming small RSP ucodes for testing on a 64Drive / NeoMyth, even if I have to send it to someone else who has one of those.

And I'm sure I probably don't need to explain this to you, by the way, since now you have access to the OS documentation, too, but my cycle count analysis was based on the optimization technique of interweaving SU and VU instructions side by side.

If for one SU inst, there is a VU inst in the delay slot, both instructions are executed in a single cycle (hence the purpose of VNOP instead of just using NOP all the time, I believe), and vice versa.

But I have not extracted much information besides that as of yet, so with the opportunity for reversing I'm sure I'll pick up from your progress as well.
Yeah, this is my first time crafting a cycle-accurate simulator from scratch as well.

The issue logic is actually more complex than that, but that's something alone the lines of what it does (I'm assuming you were generalizing?). I appreciate those statistics, by the way; I am using SSE2/3 intrinsics for the heavy-called opcodes to wiggle out extra performance.

To those who 'complain' about cycle-accuracy: I come from a very academic background. Zilmar and others have nearly perfected emulation already; I'm doing this because I enjoy it. I don't preach cycle accuracy, but I do believe it is one means of solving the few, minor bugs that still exist in emulators today.
Reply With Quote
  #12  
Old 25th January 2013, 09:43 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Yeah, I was generalizing. I have paid more attention to other things than the cycles and issuing due to having a relative experience/interest in other factors over those, but that is very important.

The original RSP assembler source by the way, permits $cn (system control CP0 registers) for 0 <= int n <= 31, implying the assembler allows MFC0 and MTC0 on 32 CP0 registers, yet the programmer reference and the simulator source code constrain the support to only the standard 16 CP0 registers we all know of. That's one of the oddities I haven't had a chance to figure out yet.



I read your comment the day before about the lack of attention to SP memory byte order in existing RSP source.
As I remember, zilmar/Jabo were originally questioning what to do later in how the RSP plugin specs should be integrated, but corrections to the endianness for e.g. DMEM writes and reads, well nobody had the initiative to set that in their emulator as it would not work with every RSP plugin. The system had its limitations. The Windows plugin specs give too much control to the emulator so that we cannot control the byte endian that Project64 Win32 EXE sends to my RSP module, though we can still swap the SP memory endian reads and writes within DMA transfers.

In the meantime, since statistics show that over 90% of the time, the address for HW, W, double etc. transfers are usually aligned, we can do a single write to avoid split, non-parallel transfers.

As an example LW:
Code:
void LW(int rs, int rt, short imm)
{
    register unsigned int addr;

    if (rt == 0)
    {
        message("LW\nTried to overwrite $zero.", 1);
        return;
    }
    addr  = SR[rs] + (signed short)imm;
    addr &= 0xFFF;
    if (addr & 0x003) /* less than 10% of the time this will be a concern */
    {
        register int word;

        word   = RSP.DMEM[addr ^ 03];
        word <<= 8;
        ++addr;
        addr &= 0x00000FFF;
        word  |= RSP.DMEM[addr ^ 03];
        word <<= 8;
        ++addr;
        addr &= 0x00000FFF;
        word  |= RSP.DMEM[addr ^ 03];
        word <<= 8;
        ++addr;
        addr &= 0x00000FFF;
        word  |= RSP.DMEM[addr ^ 03];
        SR[rt] = word;
        return;
    }
    SR[rt] = *(int *)(RSP.DMEM + addr);
    return;
}
This was never implemented by any other zilmar plugin system RSP authors that I can see, but it temporarily voids the concern in most cases of endianness.

Quote:
Originally Posted by MarathonMan
I appreciate those statistics, by the way; I am using SSE2/3 intrinsics for the heavy-called opcodes to wiggle out extra performance.
The interesting thing I find...in most cases (but not all, usually the vector computational divides except VNOP duh), when I have done all the loop expansion and inline rewrites myself, SSE2 or SSE1 opcodes were never used in emulation of those ops by Microsoft's CL.EXE (I do not have an option though for SSE3.). The GNU GCC or MinGW however, I get a so-so speed boost over jump tables and possibly the SSE3 as well, significant enough for me to keep reminding myself to ditch the Microsoft suite and get back to dev on Linux sometime. =P

Last edited by HatCat; 25th January 2013 at 10:07 PM. Reason: forgot a space :D
Reply With Quote
  #13  
Old 25th January 2013, 09:51 PM
mudlord_ mudlord_ is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Dec 2012
Posts: 383
Default

Quote:
Originally Posted by FatCat View Post
You already had your chat with Hacktarux about cycle accuracy, back when you used to respond against it in terms of logic, but nowadays it seems just the name itself is hateful to you.
that is true.
Reply With Quote
  #14  
Old 25th January 2013, 09:55 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Well I don't know very much about byuu.

I have probably seen his forums once or twice.

I know that ignorance is one of the ways to annoy me, but I do have a genuine care for challenging the balance of accuracy and speed. I don't want to inconvenience everyone by focusing only on just one or the other, as that would detract from both the learning experience as well as the directness and algorithmic beauty of the solution.
Reply With Quote
  #15  
Old 25th January 2013, 10:25 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
Yeah, I was generalizing. I have paid more attention to other things than the cycles and issuing due to having a relative experience/interest in other factors over those, but that is very important.

The original RSP assembler source by the way, permits $cn (system control CP0 registers) for 0 <= int n <= 31, implying the assembler allows MFC0 and MTC0 on 32 CP0 registers, yet the programmer reference and the simulator source code constrain the support to only the standard 16 CP0 registers we all know of. That's one of the oddities I haven't had a chance to figure out yet.



I read your comment the day before about the lack of attention to SP memory byte order in existing RSP source.
As I remember, zilmar/Jabo were originally questioning what to do later in how the RSP plugin specs should be integrated, but corrections to the endianness for e.g. DMEM writes and reads, well nobody had the initiative to set that in their emulator as it would not work with every RSP plugin. The system had its limitations. The Windows plugin specs give too much control to the emulator so that we cannot control the byte endian that Project64 Win32 EXE sends to my RSP module, though we can still swap the SP memory endian reads and writes within DMA transfers.

In the meantime, since statistics show that over 90% of the time, the address for HW, W, double etc. transfers are usually aligned, we can do a single write to avoid split, non-parallel transfers.

As an example LW:
Code:
void LW(int rs, int rt, short imm)
{
    register unsigned int addr;

    if (rt == 0)
    {
        message("LW\nTried to overwrite $zero.", 1);
        return;
    }
    addr  = SR[rs] + (signed short)imm;
    addr &= 0xFFF;
    if (addr & 0x003) /* less than 10% of the time this will be a concern */
    {
        register int word;

        word = RSP.DMEM[addr ^ 03];
        word <<= 8;
        ++addr;
        addr &= 0x00000FFF;
        word |= RSP.DMEM[addr ^ 03];
        word <<= 8;
        ++addr;
        addr &= 0x00000FFF;
        word |= RSP.DMEM[addr ^ 03];
        word <<= 8;
        ++addr;
        addr &= 0x00000FFF;
        word |= RSP.DMEM[addr ^ 03];
        SR[rt] = word;
        return;
    }
    SR[rt] = *(int *)(RSP.DMEM + addr);
    return;
}
This was never implemented by any other zilmar plugin system RSP authors that I can see, but it temporarily voids the concern in most cases of endianness.



The interesting thing I find...in most cases (but not all, usually the vector computational divides except VNOP duh), when I have done all the loop expansion and inline rewrites myself, SSE2 or SSE1 opcodes were never used in emulation of those ops by Microsoft's CL.EXE (I do not have an option though for SSE3.). The GNU GCC or MinGW however, I get a so-so speed boost over jump tables and possibly the SSE3 as well, significant enough for me to keep reminding myself to ditch the Microsoft suite and get back to dev on Linux sometime. =P
Well, at least we won't be stepping on each other. I'm taking the byuu-accuracy-at-any-cost approach while trying to squeeze out any performance I can. I've done everything I know, and so far the RSP seems like it's going to hold it's own still; time will tell...

Anyways, interesting tidbit on the plugin spec. and endian-ness. Sucks that you have to emit a mask + conditional branch to skirt around the issue, but clever. Haven't seen anyone else do that.

My two cents on the 'LW' operation:
1) Use branch weights. I don't know if MSVC supports them, but they gave me a nice little speed boost for things I was doing every cycle (especially in the instruction fetch stage, IIRC).
2) Don't do the byteswap by hand! If you use builtins, the compiler will emit a bswap on x86 instead of a lots of rotates and masks. Again, maybe MSVC is smart enough to figure it out, but IIRC gcc was not.

And lastly, on your last comment: compilers are horrible at vectorizing thing. I used the Intel intrinsic functions to point the compiler in the right direction. i.e. ...

Code:
 62 /* ==========================================================================
 63  *  Loads a big-endian 128-bit vector from DMEM.
 64  * ======================================================================= */
 65 static inline void ShuffleVector(const uint16_t* src, uint16_t *dest) {
 66 #ifdef USE_SSE
 67   __m128i mask, vector, temp;
 68 
 69   static const uint8_t swapmask[] = {
 70     0x0F, 0x0E, 0x0D, 0x0C, 0x0B, 0x0A, 0x09, 0x08,
 71     0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01, 0x00
 72   };
 73 
 74   mask = _mm_load_si128((__m128i*) swapmask);
 75   vector = _mm_load_si128((__m128i*) src);
 76   temp = _mm_shuffle_epi8(vector, mask);
 77   _mm_store_si128((__m128i*) dest, temp);
 78 
 79 #else
 80 #ifdef LITTLE_ENDIAN
 81   dest[0] = ByteSwap16(src[7]);
 82   dest[1] = ByteSwap16(src[6]);
 83   dest[2] = ByteSwap16(src[5]);
 84   dest[3] = ByteSwap16(src[4]);
 85   dest[4] = ByteSwap16(src[3]);
 86   dest[5] = ByteSwap16(src[2]);
 87   dest[6] = ByteSwap16(src[1]);
 88   dest[7] = ByteSwap16(src[0]);
 89 #else
 90   memcpy(dest, src, 8 * sizeof(uint16_t));
 91 #endif
 92 #endif
 93 }
I take it you're cxd4 on github?
Reply With Quote
  #16  
Old 25th January 2013, 10:53 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Quote:
Originally Posted by MarathonMan View Post
Well, at least we won't be stepping on each other. I'm taking the byuu-accuracy-at-any-cost approach while trying to squeeze out any performance I can. I've done everything I know, and so far the RSP seems like it's going to hold it's own still; time will tell...

Anyways, interesting tidbit on the plugin spec. and endian-ness. Sucks that you have to emit a mask + conditional branch to skirt around the issue, but clever. Haven't seen anyone else do that.

My two cents on the 'LW' operation:
1) Use branch weights. I don't know if MSVC supports them, but they gave me a nice little speed boost for things I was doing every cycle (especially in the instruction fetch stage, IIRC).
2) Don't do the byteswap by hand! If you use builtins, the compiler will emit a bswap on x86 instead of a lots of rotates and masks. Again, maybe MSVC is smart enough to figure it out, but IIRC gcc was not.
Although I don't know what a branch weight is (literally?), I do find lots of times that when I'm trying to average out the cyclical times on the emulator, MSVC at least half of the time does not organize the branch in the way that I'm hoping.

For example in that exemplary LW I posted, I know that the larger, more complex algorithm is inside the if(), rather than outside. I did that for cleanliness/organization hoping that the main algorithm (the address being 32-bit aligned) would become the most useful focus.
In reality though, I believe it should be taking the branch *out* of the normal code path, to the faster, instant 32-bit write if the address is aligned, and not using more time doing the branch on not equal if the address is unaligned so that the slower code can be accessed faster and the faster code can be accessed slower.
Not sure if perhaps that has something to do with branch weights, though that has been one of my goals.

Basically in the absence of hacks favoring what complex N64 ROMs like to do "most often" statistically, I try to average out the execution times for more predictable results.

And no you're right, MS suite does not figure it out for me.
I know that my straight-C byteswap is not optimized compared to bit-wise exchange operations, but I don't know how to access them because I am not a very good API programmer or a very wise programmer at that; I'm more talented at algorithms and direct mathematics. Though I never was a very wise programmer in any language; if there is a way to better signal the exchange and other advanced bit manipulation operations though I would like to look into that.

Quote:
Originally Posted by MarathonMan View Post
And lastly, on your last comment: compilers are horrible at vectorizing thing. I used the Intel intrinsic functions to point the compiler in the right direction. i.e. ...

Code:
 62 /* ==========================================================================
 63  *  Loads a big-endian 128-bit vector from DMEM.
 64  * ======================================================================= */
 65 static inline void ShuffleVector(const uint16_t* src, uint16_t *dest) {
 66 #ifdef USE_SSE
 67   __m128i mask, vector, temp;
 68 
 69   static const uint8_t swapmask[] = {
 70     0x0F, 0x0E, 0x0D, 0x0C, 0x0B, 0x0A, 0x09, 0x08,
 71     0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01, 0x00
 72   };
 73 
 74   mask = _mm_load_si128((__m128i*) swapmask);
 75   vector = _mm_load_si128((__m128i*) src);
 76   temp = _mm_shuffle_epi8(vector, mask);
 77   _mm_store_si128((__m128i*) dest, temp);
 78 
 79 #else
 80 #ifdef LITTLE_ENDIAN
 81   dest[0] = ByteSwap16(src[7]);
 82   dest[1] = ByteSwap16(src[6]);
 83   dest[2] = ByteSwap16(src[5]);
 84   dest[3] = ByteSwap16(src[4]);
 85   dest[4] = ByteSwap16(src[3]);
 86   dest[5] = ByteSwap16(src[2]);
 87   dest[6] = ByteSwap16(src[1]);
 88   dest[7] = ByteSwap16(src[0]);
 89 #else
 90   memcpy(dest, src, 8 * sizeof(uint16_t));
 91 #endif
 92 #endif
 93 }
I take it you're cxd4 on github?
I made my GitHub account as a new learning experience. I found out about GitHub off of a chess website project manager, though I don't know very many high-level languages.

Haven't linked publicly, anyone to it yet, just because, well I haven't done it yet. I am somewhat shy about the publicity of all my work and haven't had plans for publications and such; I mostly just like to talk about progress. And so I had not publicly linked to my account, but that doesn't modify my continuum of open-sourcing everything I have done so far.

Code:
 74   mask = _mm_load_si128((__m128i*) swapmask);
 75   vector = _mm_load_si128((__m128i*) src);
 76   temp = _mm_shuffle_epi8(vector, mask);
 77   _mm_store_si128((__m128i*) dest, temp);
That kind of thing I know nothing about. I take it that those are some of the "Intel intrinsic functions" you were referring to. I was always hoping for a way to do 128-bit transfers in C though I assumed it was just impossible due to the environment architecture. I have a tendency to default to favoring plain, API-free straight C all the time, but I would love to look into this more if it means being able to manage 128-bit segments.
Reply With Quote
  #17  
Old 25th January 2013, 11:14 PM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
Although I don't know what a branch weight is (literally?), I do find lots of times that when I'm trying to average out the cyclical times on the emulator, MSVC at least half of the time does not organize the branch in the way that I'm hoping.

For example in that exemplary LW I posted, I know that the larger, more complex algorithm is inside the if(), rather than outside. I did that for cleanliness/organization hoping that the main algorithm (the address being 32-bit aligned) would become the most useful focus.
In reality though, I believe it should be taking the branch *out* of the normal code path, to the faster, instant 32-bit write if the address is aligned, and not using more time doing the branch on not equal if the address is unaligned so that the slower code can be accessed faster and the faster code can be accessed slower.
Not sure if perhaps that has something to do with branch weights, though that has been one of my goals.

Basically in the absence of hacks favoring what complex N64 ROMs like to do "most often" statistically, I try to average out the execution times for more predictable results.

And no you're right, MS suite does not figure it out for me.
I know that my straight-C byteswap is not optimized compared to bit-wise exchange operations, but I don't know how to access them because I am not a very good API programmer or a very wise programmer at that; I'm more talented at algorithms and direct mathematics. Though I never was a very wise programmer in any language; if there is a way to better signal the exchange and other advanced bit manipulation operations though I would like to look into that.



I made my GitHub account as a new learning experience. I found out about GitHub off of a chess website project manager, though I don't know very many high-level languages.

Haven't linked publicly, anyone to it yet, just because, well I haven't done it yet. I am somewhat shy about the publicity of all my work and haven't had plans for publications and such; I mostly just like to talk about progress. And so I had not publicly linked to my account, but that doesn't modify my continuum of open-sourcing everything I have done so far.

Code:
 74   mask = _mm_load_si128((__m128i*) swapmask);
 75   vector = _mm_load_si128((__m128i*) src);
 76   temp = _mm_shuffle_epi8(vector, mask);
 77   _mm_store_si128((__m128i*) dest, temp);
That kind of thing I know nothing about. I take it that those are some of the "Intel intrinsic functions" you were referring to. I was always hoping for a way to do 128-bit transfers in C though I assumed it was just impossible due to the environment architecture. I have a tendency to default to favoring plain, API-free straight C all the time, but I would love to look into this more if it means being able to manage 128-bit segments.
The branch weights are exactly what you're looking to do. They basically serve as a hint to the compiler that something is unlikely. A good compiler would emit an instruction that hints to the processor "hey, speculate that this branch isn't going to be taken!". Unfortunately, it seems that MSVC lacks these "hints".

Yes, those were the intrinsic functions. You should really look into using them. My vand instruction is basically a 16-byte aligned load, one x86 vector operation, and a 16-byte aligned stored. Same with all the other bitwise operators.

As for 128-bit transfers, just use memcpy(...). The compiler will emit very efficient code when you give it a constant size to copy (possibly using those SSE* instructions).

EDIT: I had links to everything, but I don't have enough posts on the forums to be able to use links in my responses yet :$.

EDIT 2: For the byteswap functions, Google:
Code:
_byteswap_uint64
_byteswap_ulong
_byteswap_ushort
they are self explanatory.

Last edited by MarathonMan; 25th January 2013 at 11:28 PM.
Reply With Quote
  #18  
Old 25th January 2013, 11:28 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

Meh, forgot about trying memcpy() sometimes.

So far my win DLL is 100% CRT free, except I used strcpy to copy the plugin name but as you would expect the compiler optimizes that using SSE/2 instead of having a dependency on the Microsoft C runtime.

But memcpy is another one of those ways that gets inline-optimized (well at least any compiler with the attitude honoring the "inline" keyword should hope to support it); I guess I'll get to implementing that more.

Oh yeah, and since you mentioned V[N]AND, and the other logical ops,
That was one of the opcodes where I determined, in this case try violating accuracy for a moment if element == 0:

Code:
void VAND(int vd, int vs, int vt, int element)
{
    switch (element)
    {
        case 0x0: /* none: { 00, 01, 02, 03, 04, 05, 06, 07 } */
        case 0x1:
            VR[vd].d[00] = VR[vs].d[00] & VR[vt].d[00];
            VR[vd].d[01] = VR[vs].d[01] & VR[vt].d[01];
            VACC[00].w[01] = VR[vd].s[00];
            VACC[01].w[01] = VR[vd].s[01];
            VACC[02].w[01] = VR[vd].s[02];
            VACC[03].w[01] = VR[vd].s[03];
            VACC[04].w[01] = VR[vd].s[04];
            VACC[05].w[01] = VR[vd].s[05];
            VACC[06].w[01] = VR[vd].s[06];
            VACC[07].w[01] = VR[vd].s[07];
            return;
..by implying two 64-bit writes, and then loading to the accumulator. Though from the perspective of accuracy, we should always manage the accumulator first, and then update the destination vector register file afterwards. (I prefer to do this by using the accumulator itself so that we don't risk premature element overwrites caused by updating elements mid-way.)

What you are describing though, it sounds much faster. Best way to emulate 8 parallel data transactions, is to use one transaction or an x86 vector instruction? If I am understanding what you wrote correctly.
Yes would love to look into that.

Quote:
Originally Posted by MarathonMan View Post
EDIT: I had links to everything, but I don't have enough posts on the forums to be able to use links in my responses yet :$.
haha, vBulletin is an overpriced forum, coded in PHP

One of the ways you can get around that no-links script is if you uncheck "Automatically parse links in text" in the "Miscellaneous Options" in the posting interface.

(or use BB tags in-between to make the urls, or HTML escape entities)

[EDIT] Got your edit. In the absence of all else I must do some looking up and copy these informations off to my flash drive and take them home to my internet-free PC.
Reply With Quote
  #19  
Old 25th January 2013, 11:50 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,256
Default

So thanks to that info about the stdlib API byteswap methods,

I think I can do better like this:

Code:
#include <stdlib.h>

void LW(int rs, int it, short imm)
{
   register unsigned int addr;

    if (rt == 0)
    {
        message("LW\nTried to overwrite $zero.", 1);
        return;
    }
    addr  = SR[rs] + (signed short)imm;
    addr &= 0xFFF;
    SR[rt] = *(int *)(RSP.DMEM + addr);
    if ((addr & 0x003) == 0x000) return;
    SR[rt] = _byteswap_ulong(SR[rt]);
    return;
}
Something like that.

Not just faster; a bit easier on the eyes. =D
Reply With Quote
  #20  
Old 26th January 2013, 12:00 AM
MarathonMan's Avatar
MarathonMan MarathonMan is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jan 2013
Posts: 454
Default

Quote:
Originally Posted by FatCat View Post
Meh, forgot about trying memcpy() sometimes.

So far my win DLL is 100% CRT free, except I used strcpy to copy the plugin name but as you would expect the compiler optimizes that using SSE/2 instead of having a dependency on the Microsoft C runtime.

But memcpy is another one of those ways that gets inline-optimized (well at least any compiler with the attitude honoring the "inline" keyword should hope to support it); I guess I'll get to implementing that more.

Oh yeah, and since you mentioned V[N]AND, and the other logical ops,
That was one of the opcodes where I determined, in this case try violating accuracy for a moment if element == 0:

Code:
void VAND(int vd, int vs, int vt, int element)
{
    switch (element)
    {
        case 0x0: /* none: { 00, 01, 02, 03, 04, 05, 06, 07 } */
        case 0x1:
            VR[vd].d[00] = VR[vs].d[00] & VR[vt].d[00];
            VR[vd].d[01] = VR[vs].d[01] & VR[vt].d[01];
            VACC[00].w[01] = VR[vd].s[00];
            VACC[01].w[01] = VR[vd].s[01];
            VACC[02].w[01] = VR[vd].s[02];
            VACC[03].w[01] = VR[vd].s[03];
            VACC[04].w[01] = VR[vd].s[04];
            VACC[05].w[01] = VR[vd].s[05];
            VACC[06].w[01] = VR[vd].s[06];
            VACC[07].w[01] = VR[vd].s[07];
            return;
..by implying two 64-bit writes, and then loading to the accumulator. Though from the perspective of accuracy, we should always manage the accumulator first, and then update the destination vector register file afterwards. (I prefer to do this by using the accumulator itself so that we don't risk premature element overwrites caused by updating elements mid-way.)

What you are describing though, it sounds much faster. Best way to emulate 8 parallel data transactions, is to use one transaction or an x86 vector instruction? If I am understanding what you wrote correctly.
Yes would love to look into that.



haha, vBulletin is an overpriced forum, coded in PHP

One of the ways you can get around that no-links script is if you uncheck "Automatically parse links in text" in the "Miscellaneous Options" in the posting interface.

(or use BB tags in-between to make the urls, or HTML escape entities)

[EDIT] Got your edit. In the absence of all else I must do some looking up and copy these informations off to my flash drive and take them home to my internet-free PC.
Yeah... look at SSE ops : The code below isn't right (I'm not using the intrinsics correctly), but you get the idea...

Code:
 28 /* ==========================================================================
 29  *  Bitwise AND of two vectors.
 30  * ======================================================================= */
 31 static inline void AndVector(const uint16_t *src1, const uint16_t *src2, uint16_t *dest) {
 32 #ifdef USE_SSE
 33   __m128i vsrc1, vsrc2, vdest;
 34 
 35   vsrc1 = _mm_load_si128((__m128i*) src1);
 36   vsrc2 = _mm_load_si128((__m128i*) src2);
 37   vdest = _mm_and_ps(vsrc1, vsrc2);
 38   _mm_store_si128((__m128i*) dest, vdest);
 39 #endif
Thanks for the 'auto parse links in text' tip!
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 03:10 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.