Performance of compiled-to-delegate Expression

I'm generating an expression tree that maps properties from a source object to a destination object, that is then compiled to a Func and executed.

This is the debug view of the resulting LambdaExpression:

.Lambda #Lambda1(
    MemberMapper.Benchmarks.Program+ComplexSourceType $right,
    MemberMapper.Benchmarks.Program+ComplexDestinationType $left) {
    .Block(
        MemberMapper.Benchmarks.Program+NestedSourceType $Complex$955332131,
        MemberMapper.Benchmarks.Program+NestedDestinationType $Complex$2105709326) {
        $left.ID = $right.ID;
        $Complex$955332131 = $right.Complex;
        $Complex$2105709326 = .New MemberMapper.Benchmarks.Program+NestedDestinationType();
        $Complex$2105709326.ID = $Complex$955332131.ID;
        $Complex$2105709326.Name = $Complex$955332131.Name;
        $left.Complex = $Complex$2105709326;
        $left
    }
}

Cleaned up it would be:

(left, right) =>
{
    left.ID = right.ID;
    var complexSource = right.Complex;
    var complexDestination = new NestedDestinationType();
    complexDestination.ID = complexSource.ID;
    complexDestination.Name = complexSource.Name;
    left.Complex = complexDestination;
    return left;
}

That's the code that maps the properties on these types:

public class NestedSourceType
{
  public int ID { get; set; }
  public string Name { get; set; }
}

public class ComplexSourceType
{
  public int ID { get; set; }
  public NestedSourceType Complex { get; set; }
}

public class NestedDestinationType
{
  public int ID { get; set; }
  public string Name { get; set; }
}

public class ComplexDestinationType
{
  public int ID { get; set; }
  public NestedDestinationType Complex { get; set; }
}

The manual code to do this is:

var destination = new ComplexDestinationType
{
  ID = source.ID,
  Complex = new NestedDestinationType
  {
    ID = source.Complex.ID,
    Name = source.Complex.Name
  }
};

The problem is that when I compile the LambdaExpression and benchmark the resulting delegate it is about 10x slower than the manual version. I have no idea why that is. And the whole idea about this is maximum performance without the tedium of manual mapping.

When I take code by Bart de Smet from his blog post on this topic and benchmark the manual version of calculating prime numbers versus the compiled expression tree, they are completely identical in performance.

What can cause this huge difference when the debug view of the LambdaExpression looks like what you would expect?

EDIT

As requested I added the benchmark I used:

public static ComplexDestinationType Foo;

static void Benchmark()
{

  var mapper = new DefaultMemberMapper();

  var map = mapper.CreateMap(typeof(ComplexSourceType),
                             typeof(ComplexDestinationType)).FinalizeMap();

  var source = new ComplexSourceType
  {
    ID = 5,
    Complex = new NestedSourceType
    {
      ID = 10,
      Name = "test"
    }
  };

  var sw = Stopwatch.StartNew();

  for (int i = 0; i < 1000000; i++)
  {
    Foo = new ComplexDestinationType
    {
      ID = source.ID + i,
      Complex = new NestedDestinationType
      {
        ID = source.Complex.ID + i,
        Name = source.Complex.Name
      }
    };
  }

  sw.Stop();

  Console.WriteLine(sw.Elapsed);

  sw.Restart();

  for (int i = 0; i < 1000000; i++)
  {
    Foo = mapper.Map(source);
  }

  sw.Stop();

  Console.WriteLine(sw.Elapsed);

  var func = (Func)
             map.MappingFunction;

  var destination = new ComplexDestinationType();

  sw.Restart();

  for (int i = 0; i < 1000000; i++)
  {
    Foo = func(source, new ComplexDestinationType());
  }

  sw.Stop();

  Console.WriteLine(sw.Elapsed);
}

The second one is understandably slower than doing it manually as it involves a dictionary lookup and a few object instantiations, but the third one should be just as fast as it's the raw delegate there that's being invoked and the cast from Delegate to Func happens outside the loop.

I tried wrapping the manual code in a function as well, but I recall that it didn't make a noticeable difference. Either way, a function call shouldn't add an order of magnitude of overhead.

I also do the benchmark twice to make sure the JIT isn't interfering.

EDIT

You can get the code for this project here:

https://github.com/JulianR/MemberMapper/

I used the Sons-of-Strike debugger extension as described in that blog post by Bart de Smet to dump the generated IL of the dynamic method:

IL_0000: ldarg.2 
IL_0001: ldarg.1 
IL_0002: callvirt 6000003 ComplexSourceType.get_ID()
IL_0007: callvirt 6000004 ComplexDestinationType.set_ID(Int32)
IL_000c: ldarg.1 
IL_000d: callvirt 6000005 ComplexSourceType.get_Complex()
IL_0012: brfalse IL_0043
IL_0017: ldarg.1 
IL_0018: callvirt 6000006 ComplexSourceType.get_Complex()
IL_001d: stloc.0 
IL_001e: newobj 6000007 NestedDestinationType..ctor()
IL_0023: stloc.1 
IL_0024: ldloc.1 
IL_0025: ldloc.0 
IL_0026: callvirt 6000008 NestedSourceType.get_ID()
IL_002b: callvirt 6000009 NestedDestinationType.set_ID(Int32)
IL_0030: ldloc.1 
IL_0031: ldloc.0 
IL_0032: callvirt 600000a NestedSourceType.get_Name()
IL_0037: callvirt 600000b NestedDestinationType.set_Name(System.String)
IL_003c: ldarg.2 
IL_003d: ldloc.1 
IL_003e: callvirt 600000c ComplexDestinationType.set_Complex(NestedDestinationType)
IL_0043: ldarg.2 
IL_0044: ret 

I'm no expert at IL, but this seems pretty straightfoward and exactly what you would expect, no? Then why is it so slow? No weird boxing operations, no hidden instantiations, nothing. It's not exactly the same as expression tree above as there's also a null check on right.Complex now.

This is the code for the manual version (obtained through Reflector):

L_0000: ldarg.1 
L_0001: ldarg.0 
L_0002: callvirt instance int32 ComplexSourceType::get_ID()
L_0007: callvirt instance void ComplexDestinationType::set_ID(int32)
L_000c: ldarg.0 
L_000d: callvirt instance class NestedSourceType ComplexSourceType::get_Complex()
L_0012: brfalse.s L_0040
L_0014: ldarg.0 
L_0015: callvirt instance class NestedSourceType ComplexSourceType::get_Complex()
L_001a: stloc.0 
L_001b: newobj instance void NestedDestinationType::.ctor()
L_0020: stloc.1 
L_0021: ldloc.1 
L_0022: ldloc.0 
L_0023: callvirt instance int32 NestedSourceType::get_ID()
L_0028: callvirt instance void NestedDestinationType::set_ID(int32)
L_002d: ldloc.1 
L_002e: ldloc.0 
L_002f: callvirt instance string NestedSourceType::get_Name()
L_0034: callvirt instance void NestedDestinationType::set_Name(string)
L_0039: ldarg.1 
L_003a: ldloc.1 
L_003b: callvirt instance void ComplexDestinationType::set_Complex(class NestedDestinationType)
L_0040: ldarg.1 
L_0041: ret 

Looks identical to me..

EDIT

I followed the link in Michael B's answer about this topic. I tried implementing the trick in the accepted answer and it worked! If you want a summary of the trick: it creates a dynamic assembly and compiles the expression tree into a static method in that assembly and for some reason that's 10x faster. A downside to this is that my benchmark classes were internal (actually, public classes nested in an internal one) and it threw an exception when I tried to access them because they weren't accessible. There doesn't seem to be a workaround that, but I can simply detect if the types referenced are internal or not and decide which approach to compilation to use.

What still bugs me though is why that prime numbers method is identical in performance to the compiled expression tree.

And again, I welcome anyone to run the code at that GitHub repository to confirm my measurements and to make sure I'm not crazy :)

31
задан JulianR 1 March 2011 в 23:44
поделиться