Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #1  
Old 6th February 2013, 04:59 AM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,260
Default (Test Sample) Vector Multiply Fraction

By far the most-executed RSP instruction (under any sub-op-code matrix) for audio tasks on the RSP is VMULF, a rudimentary base for comparison to the basic template I am still working on getting the other op tables to comply to:

Code:
void VMULF(int vd, int vs, int vt, int element)
{
    register int i;

    if (element == 0x0) /* if (element >> 1 == 00) */
    {
        for (i = 0; i < 8; i++)
        {
            register int product = VR[vs].s[i] * VR[vt].s[i];

            product <<= 1; /* shift of partial product */
            VACC[i].q = product + 0x8000; /* fraction rounding */
        }
    }
    else if ((element & 0xE) == 02) /* scalar quarter */
// removed for shortness, see (element >> 1 == 0x0) for basic alg.
    else if ((element & 0xC) == 04) /* scalar half */
// [...]
    else /* if ((element & 0b1000) == 0b1000) /* scalar whole */
// [...]
    }

    for (i = 0; i < 8; i++)
    { /* Sign-clamp bits 31..16 of ACC file to destination VR file. */
        if (VACC[i].q & 0x800000000000) /* acc < 0 */
        {
            if (~VACC[i].q & ~0x00007FFFFFFF) /* short underflow */
                VR[vd].s[i] = 0x8000;
            else
                VR[vd].s[i] = (short)(VACC[i].q >> 16);
        }
        else
        {
            if (VACC[i].q & ~0x00007FFFFFFF) /* short overflow */
                VR[vd].s[i] = 0x7FFF;
            else
                VR[vd].s[i] = (short)(VACC[i].q >> 16);
        }
    }

    for (i = 0; i < 8; i++) /* 48 bits left by 16 to use high DW sign bit */
        VACC[i].q <<= 16; /*
    for (i = 0; i < 8; i++)
        VACC[i].q >>= 16; /* reverse zilmar's VACC sign-extension hack */
    return;
}
At the moment I'm in the process of rewriting all of the VU ops (currently the multiplies; adds are all finished) for smaller function block size (don't use a switch on element with 16 case values, use a natural if-else chain for 4 basic element codes), updated accuracy (before it was my revisions of the MAME source, now it is entire rewrites off of the standard, suggested algorithm of each vector op in the Ultra64 informations), and possibly slower code (but, shorter, and more likely to let the compiler choose to expand each categorized loop and, therefore incidentally faster and more accurate, not slower!).

Using VMULF as an example interpreter, the basic emulation table structure for each VU op (multiply or not) is classifiable:
  1. Use an if-else chain to solve for the element encoding type. element is either == 0, between 2 and 3 (quarter intervals), between 4 and 7 (half overlays), or greater than 7 (single-element "broadcast mode" as defined in other, public domain VU manuals by non-SGI vendors).
  2. Do a `for (i = 0; i < N_elements_in_SIMD; i += 1)` per each source element, loaded first to the vector accumulator file of acc. elements. If it is a vector multiply instruction `(opcode < 16)`, divide the opcode bits to determine whether to round and mov (multiply fraction), or += the accumulator (multiply-accumulate VMAC*) with no round.
  3. You typically need to find whether saturated arithmetic (where applied) is conducted on bits 0..47, 16..47, or 32..47 of each accumulator element, to [un-]signed "clamp" the final result over to the destination VR file. In some operations, outputs read to both register files may be safely assumed equal.
  4. Last, update the accumulator elements file. You either need to store the 48-bit accumulators as a LO subset of a 64-bit register in C, or use zilmar's technique and shift 0..47 to the left by 16 over to 16..63 and let the C register initializations handle sign-extension for you. The former method is more accurate (not to mention faster I believe).
When I finish templating all the other VU ops to have the behavior in step 4 I will have a much easier time taking out the accumulator hack that shifts them all to the left 16 without breaking a crap load of things.

Last edited by HatCat; 6th February 2013 at 05:06 AM.
Reply With Quote
  #2  
Old 6th February 2013, 03:59 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,260
Default

None of that talk is really copyrights, btw.

The op-codes for vector multiplies (VMUL* and VMAC*) are public domain informations discussed in non-SGI vector unit manuals and patents. It is traditional to use the basic operation schematic discussed above.

Many other vector systems the * in VMUL* or VMAC* is the "condition" sub-op-code ("F" meaning "fraction" or "false", for example).

What is unique to SGI it seems are VMUD* and VMAD*. In particular, VMAD* is totally undiscussed in other vector unit references (except for references to "multiply-add" which is inaccurate (we use that term under "accumulation")), while "VMUDz" is usually described as "multiply double" (slightly accurate, but in this case the multiplication is double-precision, not the operand quantities).
Reply With Quote
  #3  
Old 7th February 2013, 05:00 PM
HatCat's Avatar
HatCat HatCat is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Feb 2007
Location: In my hat.
Posts: 16,260
Default

God the appendix is so full of bugs.

It keeps saying things like, clamp the least-significant accumulator element, while defining clamp masks for 32/48 of the accumulator bits (making it impossible to clamp accurately). It just keeps finding ways to contradict itself. It's incredible how unorganized....

One of the examples of that is VMUDL, but since we have an unsigned 32-bit product shifted to the right by 1 16-bit halfword, clamping by element is applied in a situation where there is absolutely no chance it can affect the arithmetic result, so emulating the phase is wasteful.

And, if you use 32-bit clamp masks for the accumulator, then why detect clamping by comparing LT zero (negative), if you only sign-extend a 16-bit short by another 16 bits (described in the appendix but not the tests for the standard simulator)? If the accumulator is 48 bits then it always skips that condition blissfully!

This thing is full of shit, but I'll try to adhere to it as much as possible regardless for readability and accuracy.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 06:56 PM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2018, Jelsoft Enterprises Ltd.