Saturday, 20 December 2014

Introduction

Here I'm going to show and explain how to copy data really fast and how to use assembly under C# and .Net. In my case I use it in a video creating application from images, video and sound.
Also if you have an assembly method or function that you need to use under C# it will show you how to do it in quick and simple way.

Background

To understand it all it would be great for you to know assembly language, memory alignment and some c#, windows and .net advanced techniques.
To be able to copy-past data really fast you need it to has 16 byte aligned memory address in other way it will have almost the same speed (in my case about 1.02 time faster )

The code uses SSE instructions that are supported by processors from Pentium III+ (KNI/MMX2), AMD Athlon (AMD EMMX).
I have tested it on my Pentium Dual-Core E5800 3.2GHz with 4GB RAM in dual mode.
For me the fast copy method is 1.5 times faster than the standard with 16 byte memory aligned and
almost the same (1.02 times faster) with non-aligned memory addresses.

Using the code

This is a complete performance test that will show you performance measurements and how to use it all.
The FastMemCopy class contains all things for fast memory copy logic.
First thing you need is to create a default Windows Forms application project and put two buttons on the form and the PictureBox control as we will test it on images.
Lets declare some fields:
Now we will create two methods to handle click events for our buttons,
for standard method:
private void btnStandard_Click(object sender, EventArgs e)
{
        using (OpenFileDialog ofd = new OpenFileDialog())
        {
            if (ofd.ShowDialog() != System.Windows.Forms.DialogResult.OK)
                return;

            bitmapPath = ofd.FileName;
        }

 //open a selected image and create an empty image with the same size
        OpenImage();

 //unlock for read and write images
        UnlockBitmap();

 //copy data from one image to another by standard method
        CopyImage();

 //lock images to be able to see them
        LockBitmap();

 //lets see what we have
        pictureBox1.Image = bmp2;
}
and for fast method:
private void btnFast_Click(object sender, EventArgs e)
{
 using (OpenFileDialog ofd = new OpenFileDialog())
        {
            if (ofd.ShowDialog() != System.Windows.Forms.DialogResult.OK)
                return;
            bitmapPath = ofd.FileName;
        }

 //open a selected image and create an empty image with the same size
        OpenImage();

 //unlock for read and write images
        UnlockBitmap();

 //copy data from one image to another with our fast method
        FastCopyImage();

 //lock images to be able to see them
        LockBitmap();

 //lets see what we have
        pictureBox1.Image = bmp2;
}
Ok, now we have buttons and event handlers so lets implement methods that will open images, lock, unlock them and standard copy method:
open an image:
void OpenImage()
{
 pictureBox1.Image = null;
 buffer = null;
 if (bmp != null)
 {
  bmp.Dispose();
  bmp = null;
 }
 if (bmp2 != null)
 {
  bmp2.Dispose();
  bmp2 = null;
 }
 GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);

 bmp = (Bitmap)Bitmap.FromFile(bitmapPath);

 buffer = new byte[bmp.Width * 4 * bmp.Height];
 bmp2 = new Bitmap(bmp.Width, bmp.Height, bmp.Width * 4, PixelFormat.Format32bppArgb,
  Marshal.UnsafeAddrOfPinnedArrayElement(buffer, 0));
}
lock and unlock bitmaps:
void UnlockBitmap()
{
 bmpd = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadWrite, 
  PixelFormat.Format32bppArgb);
 bmpd2 = bmp2.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadWrite, 
  PixelFormat.Format32bppArgb);
}

void LockBitmap()
{
 bmp.UnlockBits(bmpd);
 bmp2.UnlockBits(bmpd2);
}
and copy data from one image to another and show measured time:
void CopyImage()
{
 //start stopwatch
 Stopwatch sw = new Stopwatch();
 sw.Start();

 //copy-past data 10 times
 for (int i = 0; i < 10; i++)
 {
  System.Runtime.InteropServices.Marshal.Copy(bmpd.Scan0, buffer, 0, buffer.Length);
 }

 //stop stopwatch
 sw.Stop();

 //show measured time
 MessageBox.Show(sw.ElapsedTicks.ToString());
}
Thats it for the standard copy-past method. Actually there is nothing too complex, we use well-known System.Runtime.InteropServices.Marshal.Copy method.
And one more "middle-method" for the fast copy logic:
void FastCopyImage()
{
 FastMemCopy.FastMemoryCopy(bmpd.Scan0, bmpd2.Scan0, buffer.Length);
}
Now, lets implement the FastMemCopy class. Here is declaration of the class and some types we will use inside of it:
internal static class FastMemCopy
{
 [Flags]
 private enum AllocationTypes : uint
 {
  Commit = 0x1000, Reserve = 0x2000,
  Reset = 0x80000, LargePages = 0x20000000,
  Physical = 0x400000, TopDown = 0x100000,
  WriteWatch = 0x200000
 }

 [Flags]
 private enum MemoryProtections : uint
 {
  Execute = 0x10,   ExecuteRead = 0x20,
  ExecuteReadWrite = 0x40, ExecuteWriteCopy = 0x80,
  NoAccess = 0x01,  ReadOnly = 0x02,
  ReadWrite = 0x04,  WriteCopy = 0x08,
  GuartModifierflag = 0x100, NoCacheModifierflag = 0x200,
  WriteCombineModifierflag = 0x400
 }

 [Flags]
 private enum FreeTypes : uint
 {
  Decommit = 0x4000, Release = 0x8000
 }

 [UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)]
 private unsafe delegate void FastMemCopyDelegate();

 private static class NativeMethods
 {
  [DllImport("kernel32.dll", SetLastError = true)]
  internal static extern IntPtr VirtualAlloc(
   IntPtr lpAddress,
   UIntPtr dwSize,
   AllocationTypes flAllocationType,
   MemoryProtections flProtect);

  [DllImport("kernel32")]
  [return: MarshalAs(UnmanagedType.Bool)]
  internal static extern bool VirtualFree(
   IntPtr lpAddress,
   uint dwSize,
   FreeTypes flFreeType);
 }
Now lets declare the method itself:
public static unsafe void FastMemoryCopy(IntPtr src, IntPtr dst, int nBytes)
{
 if (IntPtr.Size == 4)
        {
                //we are in 32 bit mode

                //allocate memory for our asm method
                IntPtr p = NativeMethods.VirtualAlloc(
                    IntPtr.Zero,
                    new UIntPtr((uint)x86_FastMemCopy_New.Length),
                    AllocationTypes.Commit | AllocationTypes.Reserve,
                    MemoryProtections.ExecuteReadWrite);

                try
                {
                    //copy our method bytes to allocated memory
                    Marshal.Copy(x86_FastMemCopy_New, 0, p, x86_FastMemCopy_New.Length);

                    //make a delegate to our method
                    FastMemCopyDelegate _fastmemcopy = 
   (FastMemCopyDelegate)Marshal.GetDelegateForFunctionPointer(p, 
    typeof(FastMemCopyDelegate));

                    //offset to the end of our method block
                    p += x86_FastMemCopy_New.Length;

                    //store length param
                    p -= 8;
                    Marshal.Copy(BitConverter.GetBytes((long)nBytes), 0, p, 4);

                    //store destination address param
                    p -= 8;
                    Marshal.Copy(BitConverter.GetBytes((long)dst), 0, p, 4);

                    //store source address param
                    p -= 8;
                    Marshal.Copy(BitConverter.GetBytes((long)src), 0, p, 4);

                    //Start stopwatch
                    Stopwatch sw = new Stopwatch();
                    sw.Start();

                    //copy-past all data 10 times
                    for (int i = 0; i < 10; i++)
                        _fastmemcopy();

                    //stop stopwatch
                    sw.Stop();

                    //get message with measured time
                    System.Windows.Forms.MessageBox.Show(sw.ElapsedTicks.ToString());
                }
                catch (Exception ex)
                {
                    //if any exception
                    System.Windows.Forms.MessageBox.Show(ex.Message);
                }
                finally
                {
                    //free allocated memory
                    NativeMethods.VirtualFree(p, (uint)(x86_FastMemCopy_New.Length), 
   FreeTypes.Release);
                    GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
                }
 }
 else if (IntPtr.Size == 8)
        {
                throw new ApplicationException("x64 is not supported yet!");
 }
}
and assembly code that is represented as an array of bytes with explanation:
private static byte[] x86_FastMemCopy_New = new byte[]
{
 0x90, //nop do nothing
 0x60, //pushad store flag register on stack
 0x95, //xchg ebp, eax eax contains memory address of our method
 0x8B, 0xB5, 0x5A, 0x01, 0x00, 0x00, //mov esi,[ebp][00000015A] get source buffer address
 0x89, 0xF0, //mov eax,esi
 0x83, 0xE0, 0x0F, //and eax,00F will check if it is 16 byte aligned
 0x8B, 0xBD, 0x62, 0x01, 0x00, 0x00, //mov edi,[ebp][000000162] get destination address
 0x89, 0xFB, //mov ebx,edi
 0x83, 0xE3, 0x0F, //and ebx,00F will check if it is 16 byte aligned
 0x8B, 0x8D, 0x6A, 0x01, 0x00, 0x00, //mov ecx,[ebp][00000016A] get number of bytes to copy
 0xC1, 0xE9, 0x07, //shr ecx,7 divide length by 128
 0x85, 0xC9, //test ecx,ecx check if zero
 0x0F, 0x84, 0x1C, 0x01, 0x00, 0x00, //jz 000000146 &darr; copy the rest
 0x0F, 0x18, 0x06, //prefetchnta [esi] pre-fetch non-temporal source data for reading
 0x85, 0xC0, //test eax,eax check if source address is 16 byte aligned
 0x0F, 0x84, 0x8B, 0x00, 0x00, 0x00, //jz 0000000C0 &darr; go to copy if aligned
 0x0F, 0x18, 0x86, 0x80, 0x02, 0x00, 0x00, //prefetchnta [esi][000000280] pre-fetch more source data
 0x0F, 0x10, 0x06, //movups xmm0,[esi] copy 16 bytes of source data
 0x0F, 0x10, 0x4E, 0x10, //movups xmm1,[esi][010] copy more 16 bytes
 0x0F, 0x10, 0x56, 0x20, //movups xmm2,[esi][020] copy more
 0x0F, 0x18, 0x86, 0xC0, 0x02, 0x00, 0x00, //prefetchnta [esi][0000002C0] pre-fetch more
 0x0F, 0x10, 0x5E, 0x30, //movups xmm3,[esi][030]
 0x0F, 0x10, 0x66, 0x40, //movups xmm4,[esi][040]
 0x0F, 0x10, 0x6E, 0x50, //movups xmm5,[esi][050]
 0x0F, 0x10, 0x76, 0x60, //movups xmm6,[esi][060]
 0x0F, 0x10, 0x7E, 0x70, //movups xmm7,[esi][070] we've copied 128 bytes of source data
 0x85, 0xDB, //test ebx,ebx check if destination address is 16 byte aligned
 0x74, 0x21, //jz 000000087 &darr; go to past if aligned
 0x0F, 0x11, 0x07, //movups [edi],xmm0 past first 16 bytes to non-aligned destination address
 0x0F, 0x11, 0x4F, 0x10, //movups [edi][010],xmm1 past more
 0x0F, 0x11, 0x57, 0x20, //movups [edi][020],xmm2
 0x0F, 0x11, 0x5F, 0x30, //movups [edi][030],xmm3
 0x0F, 0x11, 0x67, 0x40, //movups [edi][040],xmm4
 0x0F, 0x11, 0x6F, 0x50, //movups [edi][050],xmm5
 0x0F, 0x11, 0x77, 0x60, //movups [edi][060],xmm6
 0x0F, 0x11, 0x7F, 0x70, //movups [edi][070],xmm7 we've pasted 128 bytes of source data
 0xEB, 0x1F, //jmps 0000000A6 &darr; continue
 0x0F, 0x2B, 0x07, //movntps [edi],xmm0 past first 16 bytes to aligned destination address
 0x0F, 0x2B, 0x4F, 0x10, //movntps [edi][010],xmm1 past more
 0x0F, 0x2B, 0x57, 0x20, //movntps [edi][020],xmm2
 0x0F, 0x2B, 0x5F, 0x30, //movntps [edi][030],xmm3
 0x0F, 0x2B, 0x67, 0x40, //movntps [edi][040],xmm4
 0x0F, 0x2B, 0x6F, 0x50, //movntps [edi][050],xmm5
 0x0F, 0x2B, 0x77, 0x60, //movntps [edi][060],xmm6
 0x0F, 0x2B, 0x7F, 0x70, //movntps [edi][070],xmm7 we've pasted 128 bytes of source data
 0x81, 0xC6, 0x80, 0x00, 0x00, 0x00, //add esi,000000080 increment source address by 128
 0x81, 0xC7, 0x80, 0x00, 0x00, 0x00, //add edi,000000080 increment destination address by 128
 0x83, 0xE9, 0x01, //sub ecx,1 decrement counter
 0x0F, 0x85, 0x7A, 0xFF, 0xFF, 0xFF, //jnz 000000035 &uarr; continue if not zero
 0xE9, 0x86, 0x00, 0x00, 0x00, //jmp 000000146 &darr; go to copy the rest of data

 0x0F, 0x18, 0x86, 0x80, 0x02, 0x00, 0x00, //prefetchnta [esi][000000280] pre-fetch source data
 0x0F, 0x28, 0x06, //movaps xmm0,[esi] copy 128 bytes from aligned source address
 0x0F, 0x28, 0x4E, 0x10, //movaps xmm1,[esi][010] copy more
 0x0F, 0x28, 0x56, 0x20, //movaps xmm2,[esi][020]
 0x0F, 0x18, 0x86, 0xC0, 0x02, 0x00, 0x00, //prefetchnta [esi][0000002C0] pre-fetch more data
 0x0F, 0x28, 0x5E, 0x30, //movaps xmm3,[esi][030]
 0x0F, 0x28, 0x66, 0x40, //movaps xmm4,[esi][040]
 0x0F, 0x28, 0x6E, 0x50, //movaps xmm5,[esi][050]
 0x0F, 0x28, 0x76, 0x60, //movaps xmm6,[esi][060]
 0x0F, 0x28, 0x7E, 0x70, //movaps xmm7,[esi][070] we've copied 128 bytes of source data
 0x85, 0xDB, //test ebx,ebx check if destination address is 16 byte aligned
 0x74, 0x21, //jz 000000112 &darr; go to past if aligned
 0x0F, 0x11, 0x07, //movups [edi],xmm0 past 16 bytes to non-aligned destination address
 0x0F, 0x11, 0x4F, 0x10, //movups [edi][010],xmm1 past more
 0x0F, 0x11, 0x57, 0x20, //movups [edi][020],xmm2
 0x0F, 0x11, 0x5F, 0x30, //movups [edi][030],xmm3
 0x0F, 0x11, 0x67, 0x40, //movups [edi][040],xmm4
 0x0F, 0x11, 0x6F, 0x50, //movups [edi][050],xmm5
 0x0F, 0x11, 0x77, 0x60, //movups [edi][060],xmm6
 0x0F, 0x11, 0x7F, 0x70, //movups [edi][070],xmm7 we've pasted 128 bytes of data
 0xEB, 0x1F, //jmps 000000131 &darr; continue copy-past
 0x0F, 0x2B, 0x07, //movntps [edi],xmm0 past 16 bytes to aligned destination address
 0x0F, 0x2B, 0x4F, 0x10, //movntps [edi][010],xmm1 past more
 0x0F, 0x2B, 0x57, 0x20, //movntps [edi][020],xmm2
 0x0F, 0x2B, 0x5F, 0x30, //movntps [edi][030],xmm3
 0x0F, 0x2B, 0x67, 0x40, //movntps [edi][040],xmm4
 0x0F, 0x2B, 0x6F, 0x50, //movntps [edi][050],xmm5
 0x0F, 0x2B, 0x77, 0x60, //movntps [edi][060],xmm6
 0x0F, 0x2B, 0x7F, 0x70, //movntps [edi][070],xmm7 we've pasted 128 bytes of data
 0x81, 0xC6, 0x80, 0x00, 0x00, 0x00, //add esi,000000080 increment source address by 128
 0x81, 0xC7, 0x80, 0x00, 0x00, 0x00, //add edi,000000080 increment destination address by 128
 0x83, 0xE9, 0x01, //sub ecx,1 decrement counter
 0x0F, 0x85, 0x7A, 0xFF, 0xFF, 0xFF, //jnz 0000000C0 &uarr; continue copy-past if non-zero
 0x8B, 0x8D, 0x6A, 0x01, 0x00, 0x00, //mov ecx,[ebp][00000016A] get number of bytes to copy
 0x83, 0xE1, 0x7F, //and ecx,07F get rest number of bytes
 0x85, 0xC9, //test ecx,ecx check if there are bytes
 0x74, 0x02, //jz 000000155 &darr; exit if there are no more bytes
 0xF3, 0xA4, //rep movsb copy rest of bytes
 0x0F, 0xAE, 0xF8, //sfence performs a serializing operation on all store-to-memory instructions
 0x61, //popad restore flag register
 0xC3, //retn return from our method to C#
 
 0x00, 0x00, 0x00, 0x00, //source buffer address
 0x00, 0x00, 0x00, 0x00,

 0x00, 0x00, 0x00, 0x00, //destination buffer address
 0x00, 0x00, 0x00, 0x00,

 0x00, 0x00, 0x00, 0x00, //number of bytes to copy-past
 0x00, 0x00, 0x00, 0x00
};
We will call this assemlby method via delegate we have created earlier.
This method works in 32 bit mode for now and I will implement the 64 bit mode later.
Will add source code if anyone is interested in it (almost all code is there in the article)

Points of Interest

During implementation and testing this method I have found that prefetchnta command is not very clear described even by the Intel specification, so I did try to figure out it myself and via google
Also, pay attention about movntps and movaps instructions as they work with 16-byte memory aligned addresses only.

0 comments:

Post a Comment

Subscribe to RSS Feed Follow me on Twitter!