Understanding Dart-generated assembly for performance analysis

I’m trying to study Dart performance in more depth and I think the next step is to do what @mraleph does in blogposts like this one and talks like this one: look at assembly code generated from dart compile.

Here’s my initial experimentation. I decided comparing iteration over a normal Dart List<int> with iteration over a typed-data Int64List. I assumed that the typed-data list would be as fast or faster than the generic Dart list. To my surprise, the generic Dart list seems to be faster. (Remember, this is a toy microbenchmark, and i_have_no_idea_what_im_doing.png, so this is likely wrong and even if it isn’t, it should inform any real-world engineering decisions.)

Here’s the code:

import 'dart:io';
import 'dart:typed_data';

void main(List<String> args) {
  final n = int.parse(args[0]);

  const variant = String.fromEnvironment('variant');
  switch (variant) {
    case 'typed':
      runTyped(n);
    case 'basic':
      runBasic(n);
    default:
      stderr.writeln("Must provide variant to run during compilation.");
      exit(1);
  }
}

/// A trick from C++ benchmark libraries to prevent the compiler from
/// optimizing away an unused value.
@pragma('vm:never-inline')
void doNotOptimizeAway(int value) {
  // This will never be true but the compiler can't prove it.
  if (DateTime.timestamp().millisecondsSinceEpoch == 0) {
    print(value);
  }
}

void runBasic(int n) {
  final collection = List<int>.generate(n, (index) => index);

  final stopwatch = Stopwatch()..start();
  _forLoopBasic(collection);
  stopwatch.stop();

  print('List.generate(): ${stopwatch.elapsedMicroseconds}us');
}

void runTyped(int n) {
  final collection = Int64List(n);
  for (int i = 0; i < n; i++) {
    collection[i] = i;
  }

  final stopwatch = Stopwatch()..start();
  _forLoopTyped(collection);
  stopwatch.stop();

  print('TypedData: ${stopwatch.elapsedMicroseconds}us');
}

@pragma('vm:never-inline')
void _forLoopBasic(List<int> collection) {
  var result = 0;
  for (int i = 0; i < collection.length; i++) {
    result += collection[i];
  }
  doNotOptimizeAway(result);
}

@pragma('vm:never-inline')
void _forLoopTyped(Int64List collection) {
  var result = 0;
  for (int i = 0; i < collection.length; i++) {
    result += collection[i];
  }
  doNotOptimizeAway(result);
}

Now, compiling an exe and running it with n=10,000,000 gives me about 6700—7600us for basic, and 8847—13500us for typed.

I compile with the following command (taken from Slava’s talk linked above):

dart compile exe --define=variant=basic \
  --extra-gen-snapshot-options --disassemble-optimized \
  --extra-gen-snapshot-options --code-comments \
  --extra-gen-snapshot-options --print-flow-graph-filter=forLoopBasic \
  -v bin/test_collection.dart

This gives me very similar output for both variants.

basic
Specializing Platform getters for target OS macos.
Generating AOT kernel dill.
Compiling /Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart to /Users/filiph/dev/benchmarkhor/example/bin/test_collection.exe using format Kind.exe:
Generating AOT snapshot. /Users/filiph/fvm/versions/stable/bin/cache/dart-sdk/bin/utils/gen_snapshot [--disassemble-optimized, --code-comments, --print-flow-graph-filter=forLoopBasic]
Code for optimized function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopBasic@13220660' (RegularFunction) {
        ;; B0
        ;; B1
        ;; Enter frame
0x1053b2b90    a9bf79fd               stp fp, lr, [sp, #-16]!
0x1053b2b94    aa0f03fd               mov fp, sp
        ;; CheckStackOverflow:8(stack=0, loop=0)
0x1053b2b98    f9401f50               ldr tmp, [thr, #56]
0x1053b2b9c    eb1001ff               cmp sp, tmp
0x1053b2ba0    54000369               bls 0x1053b2c0c
        ;; v21 <- LoadField:32(v2 T{_GrowableList} . GrowableObjectArray.length) [0, 576460752303423487] T{_Smi}
0x1053b2ba4    f840f020               ldr r0, [r1, #15]
        ;; v29 <- UnboxInt64:32([non-speculative], v21 T{_Smi}) [0, 576460752303423487] int64
0x1053b2ba8    9341fc02               asr r2, r0, #1
        ;; v15 <- LoadField:32(v2 . GrowableObjectArray.data) T{_List}
0x1053b2bac    f8417020               ldr r0, [r1, #23]
        ;; ParallelMove r3 <- C, r1 <- C goto:32 B5
0x1053b2bb0    d2800003               movz r3, #0x0
0x1053b2bb4    d2800001               movz r1, #0x0
        ;; B5
        ;;   Loop 0
        ;;   Loop Header
        ;; CheckStackOverflow:36(stack=0, loop=1)
0x1053b2bb8    f9401f50               ldr tmp, [thr, #56]
0x1053b2bbc    eb1001ff               cmp sp, tmp
0x1053b2bc0    540002a9               bls 0x1053b2c14
        ;; Branch if RelationalOp(<, v5, v29 T{_Smi}) T{bool} goto (3, 4)
0x1053b2bc4    eb02003f               cmp r1, r2
0x1053b2bc8    5400016a               bge 0x1053b2bf4
        ;; B3
        ;;   Loop 0
        ;; v34 <- LoadIndexed:26([_List] v15, v5 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
0x1053b2bcc    8b010c10               add tmp, r0, r1 lsl #3
0x1053b2bd0    f8417204               ldr r4, [tmp, #23]
        ;; v32 <- UnboxInt64([non-speculative], v34 T{int}) [-9223372036854775808, 9223372036854775807] int64
0x1053b2bd4    9341fc85               asr r5, r4, #1
0x1053b2bd8    36000044               tbzw r4, #0, 0x1053b2be0
0x1053b2bdc    f8407085               ldr r5, [r4, #7]
        ;; v10 <- BinaryInt64Op(+ [tr], v4, v32 T{int}) [-9223372036854775808, 9223372036854775807] int64
0x1053b2be0    8b050064               add r4, r3, r5
        ;; v12 <- BinaryInt64Op(+ [tr], v5 T{int}, v35 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
0x1053b2be4    91000425               add r5, r1, #0x1
        ;; ParallelMove r3 <- r4, r1 <- r5 goto:34 B5
0x1053b2be8    aa0403e3               mov r3, r4
0x1053b2bec    aa0503e1               mov r1, r5
0x1053b2bf0    17fffff2               b 0x1053b2bb8
        ;; B4
        ;; ParallelMove r1 <- r3
0x1053b2bf4    aa0303e1               mov r1, r3
        ;; StaticCall:38( doNotOptimizeAway<0> v4)
0x1053b2bf8    94000000               bl 0x1053b2bf8
        ;; ParallelMove r0 <- C
0x1053b2bfc    aa1603e0               mov r0, nr
        ;; DartReturn:40(v0)
0x1053b2c00    aa1d03ef               mov sp, fp
0x1053b2c04    a8c179fd               ldp fp, lr, [sp], #16 !
0x1053b2c08    d65f03c0               ret
        ;; CheckStackOverflowSlowPath
0x1053b2c0c    94000000               bl 0x1053b2c0c
0x1053b2c10    17ffffe5               b 0x1053b2ba4
        ;; CheckStackOverflowSlowPath
0x1053b2c14    94000000               bl 0x1053b2c14
0x1053b2c18    17ffffeb               b 0x1053b2bc4
}
Source positions for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopBasic@13220660' {
1053b2b90-1053b2bfb: Function '_forLoopBasic@13220660': static.@58
1053b2bfc-1053b2c0f: Function '_forLoopBasic@13220660': static.@52
1053b2c10-1053b2c17: Function '_forLoopBasic@13220660': static.@55
}
StackMaps for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopBasic@13220660' {
0x1053b2bfc: 
0x1053b2c10: 000000000000000000010
0x1053b2c18: 000000000000000000001
}
Catch entry moves for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopBasic@13220660' {
}
Entry points for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopBasic@13220660' {
  [code+0x07] 1053b2b90 kNormal
  [code+0x0f] 1053b2b90 kMonomorphic
  [code+0x17] 1053b2b90 kUnchecked
  [code+0x1f] 1053b2b90 kMonomorphicUnchecked
}
Static call target functions {
  0x1053b2bfc: file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::_doNotOptimizeAway, (pc-relative-call)
  0x1053b2c10: [Stub] _iso_stub_StackOverflowSharedWithoutFPURegsStub, (pc-relative-call)
  0x1053b2c18: [Stub] _iso_stub_StackOverflowSharedWithoutFPURegsStub, (pc-relative-call)
}
Generating executable.
Marking binary executable.
Generated: /Users/filiph/dev/benchmarkhor/example/bin/test_collection.exe
typed
Specializing Platform getters for target OS macos.
Generating AOT kernel dill.
Compiling /Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart to /Users/filiph/dev/benchmarkhor/example/bin/test_collection.exe using format Kind.exe:
Generating AOT snapshot. /Users/filiph/fvm/versions/stable/bin/cache/dart-sdk/bin/utils/gen_snapshot [--disassemble-optimized, --code-comments, --print-flow-graph-filter=forLoopTyped]
Code for optimized function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopTyped@13220660' (RegularFunction) {
        ;; B0
        ;; B1
        ;; Enter frame
0x105c32b40    a9bf79fd               stp fp, lr, [sp, #-16]!
0x105c32b44    aa0f03fd               mov fp, sp
        ;; CheckStackOverflow:8(stack=0, loop=0)
0x105c32b48    f9401f50               ldr tmp, [thr, #56]
0x105c32b4c    eb1001ff               cmp sp, tmp
0x105c32b50    540002e9               bls 0x105c32bac
        ;; v20 <- LoadField:32(v2 T{_Int64List} . TypedDataBase.length {final}) [0, 4611686018427387903] T{_Smi}
0x105c32b54    f840f020               ldr r0, [r1, #15]
        ;; v28 <- UnboxInt64:32([non-speculative], v20 T{_Smi}) [0, 4611686018427387903] int64
0x105c32b58    9341fc02               asr r2, r0, #1
        ;; ParallelMove r3 <- C, r0 <- C goto:32 B5
0x105c32b5c    d2800003               movz r3, #0x0
0x105c32b60    d2800000               movz r0, #0x0
        ;; B5
        ;;   Loop 0
        ;;   Loop Header
        ;; CheckStackOverflow:36(stack=0, loop=1)
0x105c32b64    f9401f50               ldr tmp, [thr, #56]
0x105c32b68    eb1001ff               cmp sp, tmp
0x105c32b6c    54000249               bls 0x105c32bb4
        ;; Branch if RelationalOp(<, v5, v28 T{_Smi}) T{bool} goto (3, 4)
0x105c32b70    eb02001f               cmp r0, r2
0x105c32b74    5400010a               bge 0x105c32b94
        ;; B3
        ;;   Loop 0
        ;; v32 <- LoadIndexed:26([_Int64List] v2, v5 T{int}) [-9223372036854775808, 9223372036854775807] int64
0x105c32b78    8b000c30               add tmp, r1, r0 lsl #3
0x105c32b7c    f8417204               ldr r4, [tmp, #23]
        ;; v10 <- BinaryInt64Op(+ [tr], v4, v32 T{int}) [-9223372036854775808, 9223372036854775807] int64
0x105c32b80    8b040065               add r5, r3, r4
        ;; v12 <- BinaryInt64Op(+ [tr], v5 T{int}, v33 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
0x105c32b84    91000404               add r4, r0, #0x1
        ;; ParallelMove r3 <- r5, r0 <- r4 goto:34 B5
0x105c32b88    aa0503e3               mov r3, r5
0x105c32b8c    aa0403e0               mov r0, r4
0x105c32b90    17fffff5               b 0x105c32b64
        ;; B4
        ;; ParallelMove r1 <- r3
0x105c32b94    aa0303e1               mov r1, r3
        ;; StaticCall:38( doNotOptimizeAway<0> v4)
0x105c32b98    94000000               bl 0x105c32b98
        ;; ParallelMove r0 <- C
0x105c32b9c    aa1603e0               mov r0, nr
        ;; DartReturn:40(v0)
0x105c32ba0    aa1d03ef               mov sp, fp
0x105c32ba4    a8c179fd               ldp fp, lr, [sp], #16 !
0x105c32ba8    d65f03c0               ret
        ;; CheckStackOverflowSlowPath
0x105c32bac    94000000               bl 0x105c32bac
0x105c32bb0    17ffffe9               b 0x105c32b54
        ;; CheckStackOverflowSlowPath
0x105c32bb4    94000000               bl 0x105c32bb4
0x105c32bb8    17ffffee               b 0x105c32b70
}
Source positions for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopTyped@13220660' {
105c32b40-105c32b9b: Function '_forLoopTyped@13220660': static.@67
105c32b9c-105c32baf: Function '_forLoopTyped@13220660': static.@61
105c32bb0-105c32bb7: Function '_forLoopTyped@13220660': static.@64
}
StackMaps for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopTyped@13220660' {
0x105c32b9c: 
0x105c32bb0: 000000000000000000010
0x105c32bb8: 000000000000000000010
}
Catch entry moves for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopTyped@13220660' {
}
Entry points for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopTyped@13220660' {
  [code+0x07] 105c32b40 kNormal
  [code+0x0f] 105c32b40 kMonomorphic
  [code+0x17] 105c32b40 kUnchecked
  [code+0x1f] 105c32b40 kMonomorphicUnchecked
}
Static call target functions {
  0x105c32b9c: file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::_doNotOptimizeAway, (pc-relative-call)
  0x105c32bb0: [Stub] _iso_stub_StackOverflowSharedWithoutFPURegsStub, (pc-relative-call)
  0x105c32bb8: [Stub] _iso_stub_StackOverflowSharedWithoutFPURegsStub, (pc-relative-call)
}
Generating executable.
Marking binary executable.
Generated: /Users/filiph/dev/benchmarkhor/example/bin/test_collection.exe

I used an LLM (ChatGPT 4o) to help me compare the two outputs side by side. I didn’t actually fully audit the LLM’s output yet, so I have to take it with a grain of salt. But it seems correct.

Output 1 (basic)                                                    Output 2 (typed)
--------------------------------------------------------------------------------------------------------
Code for optimized function 'file:///..._::_forLoopBasic@13220660'  Code for optimized function 'file:///..._::_forLoopTyped@13220660'
        ;; B0                                                          ;; B0
        ;; B1                                                          ;; B1
        ;; Enter frame                                                 ;; Enter frame
0x1053b2b90    a9bf79fd               stp fp, lr, [sp, #-16]!         0x105c32b40    a9bf79fd               stp fp, lr, [sp, #-16]!
0x1053b2b94    aa0f03fd               mov fp, sp                      0x105c32b44    aa0f03fd               mov fp, sp
        ;; CheckStackOverflow:8(stack=0, loop=0)                      ;; CheckStackOverflow:8(stack=0, loop=0)
0x1053b2b98    f9401f50               ldr tmp, [thr, #56]             0x105c32b48    f9401f50               ldr tmp, [thr, #56]
0x1053b2b9c    eb1001ff               cmp sp, tmp                     0x105c32b4c    eb1001ff               cmp sp, tmp
0x1053b2ba0    54000369               bls 0x1053b2c0c                 0x105c32b50    540002e9               bls 0x105c32bac
        ;; v21 <- LoadField:32(v2 T{_GrowableList} . Growable...      ;; v20 <- LoadField:32(v2 T{_Int64List} . TypedDataBase.len...
0x1053b2ba4    f840f020               ldr r0, [r1, #15]               0x105c32b54    f840f020               ldr r0, [r1, #15]
        ;; v29 <- UnboxInt64:32([non-speculative], v21 T{_Smi})...    ;; v28 <- UnboxInt64:32([non-speculative], v20 T{_Smi})...
0x1053b2ba8    9341fc02               asr r2, r0, #1                  0x105c32b58    9341fc02               asr r2, r0, #1
        ;; v15 <- LoadField:32(v2 . GrowableObjectArray.data) T{_List} 
0x1053b2bac    f8417020               ldr r0, [r1, #23]               
        ;; ParallelMove r3 <- C, r1 <- C goto:32 B5                   ;; ParallelMove r3 <- C, r0 <- C goto:32 B5
0x1053b2bb0    d2800003               movz r3, #0x0                   0x105c32b5c    d2800003               movz r3, #0x0
0x1053b2bb4    d2800001               movz r1, #0x0                   0x105c32b60    d2800000               movz r0, #0x0
        ;; B5                                                          ;; B5
        ;;   Loop 0                                                    ;;   Loop 0
        ;;   Loop Header                                               ;;   Loop Header
        ;; CheckStackOverflow:36(stack=0, loop=1)                     ;; CheckStackOverflow:36(stack=0, loop=1)
0x1053b2bb8    f9401f50               ldr tmp, [thr, #56]             0x105c32b64    f9401f50               ldr tmp, [thr, #56]
0x1053b2bbc    eb1001ff               cmp sp, tmp                     0x105c32b68    eb1001ff               cmp sp, tmp
0x1053b2bc0    540002a9               bls 0x1053b2c14                 0x105c32b6c    54000249               bls 0x105c32bb4
        ;; Branch if RelationalOp(<, v5, v29 T{_Smi}) T{bool}...      ;; Branch if RelationalOp(<, v5, v28 T{_Smi}) T{bool}...
0x1053b2bc4    eb02003f               cmp r1, r2                      0x105c32b70    eb02001f               cmp r0, r2
0x1053b2bc8    5400016a               bge 0x1053b2bf4                 0x105c32b74    5400010a               bge 0x105c32b94
        ;; B3                                                          ;; B3
        ;;   Loop 0                                                    ;;   Loop 0
        ;; v34 <- LoadIndexed:26([_List] v15, v5 T{int}) [-92233...    ;; v32 <- LoadIndexed:26([_Int64List] v2, v5 T{int}) [-9...
0x1053b2bcc    8b010c10               add tmp, r0, r1 lsl #3          0x105c32b78    8b000c30               add tmp, r1, r0 lsl #3
0x1053b2bd0    f8417204               ldr r4, [tmp, #23]              0x105c32b7c    f8417204               ldr r4, [tmp, #23]
        ;; v32 <- UnboxInt64([non-speculative], v34 T{int}) [-9...    
0x1053b2bd4    9341fc85               asr r5, r4, #1                  
0x1053b2bd8    36000044               tbzw r4, #0, 0x1053b2be0        
0x1053b2bdc    f8407085               ldr r5, [r4, #7]                
        ;; v10 <- BinaryInt64Op(+ [tr], v4, v32 T{int}) [-92233...    ;; v10 <- BinaryInt64Op(+ [tr], v4, v32 T{int}) [-92233...
0x1053b2be0    8b050064               add r4, r3, r5                  0x105c32b80    8b040065               add r5, r3, r4
        ;; v12 <- BinaryInt64Op(+ [tr], v5 T{int}, v35 T{_Smi})...    ;; v12 <- BinaryInt64Op(+ [tr], v5 T{int}, v33 T{_Smi})...
0x1053b2be4    91000425               add r5, r1, #0x1                0x105c32b84    91000404               add r4, r0, #0x1
        ;; ParallelMove r3 <- r4, r1 <- r5 goto:34 B5                 ;; ParallelMove r3 <- r5, r0 <- r4 goto:34 B5
0x1053b2be8    aa0403e3               mov r3, r4                      0x105c32b88    aa0503e3               mov r3, r5
0x1053b2bec    aa0503e1               mov r1, r5                      0x105c32b8c    aa0403e0               mov r0, r4
0x1053b2bf0    17fffff2               b 0x1053b2bb8                   0x105c32b90    17fffff5               b 0x105c32b64
        ;; B4                                                          ;; B4
        ;; ParallelMove r1 <- r3                                       ;; ParallelMove r1 <- r3
0x1053b2bf4    aa0303e1               mov r1, r3                      0x105c32b94    aa0303e1               mov r1, r3
        ;; StaticCall:38( doNotOptimizeAway<0> v4)                    ;; StaticCall:38( doNotOptimizeAway<0> v4)
0x1053b2bf8    94000000               bl 0x1053b2bf8                  0x105c32b98    94000000               bl 0x105c32b98
        ;; ParallelMove r0 <- C                                        ;; ParallelMove r0 <- C
0x1053b2bfc    aa1603e0               mov r0, nr                      0x105c32b9c    aa1603e0               mov r0, nr
        ;; DartReturn:40(v0)                                           ;; DartReturn:40(v0)
0x1053b2c00    aa1d03ef               mov sp, fp                      0x105c32ba0    aa1d03ef               mov sp, fp
0x1053b2c04    a8c179fd               ldp fp, lr, [sp], #16 !         0x105c32ba4    a8c179fd               ldp fp, lr, [sp], #16 !
0x1053b2c08    d65f03c0               ret                             0x105c32ba8    d65f03c0               ret
        ;; CheckStackOverflowSlowPath                                  ;; CheckStackOverflowSlowPath
0x1053b2c0c    94000000               bl 0x1053b2c0c                  0x105c32bac    94000000               bl 0x105c32bac
0x1053b2c10    17ffffe5               b 0x1053b2ba4                   0x105c32bb0    17ffffe9               b 0x105c32b54
        ;; CheckStackOverflowSlowPath                                  ;; CheckStackOverflowSlowPath
0x1053b2c14    94000000               bl 0x1053b2c14                  0x105c32bb4    94000000               bl 0x105c32bb4
0x1053b2c18    17ffffeb               b 0x1053b2bc4                   0x105c32bb8    17ffffee               b 0x105c32b70
}

It looks like the typed-data variant has slightly less code, and I don’t see any significant differences in the two assembly sources. In fact, the basic variant seems to be unboxing during the loop while the typed isn’t, which — if I understand correctly — should make basic slower (as one would expect anyway). That said, the last time I did anything with assembly was decades ago and I barely know the basic operations.

If someone more knowledgeable than I (wink wink @mraleph) could give this a quick look-see, I’d immensely be grateful. I must assume I’m doing something incredibly stupid but I don’t know what.

11 Likes

Last year I come across this kind of problem when trying to check performances between Dart and native C. I was noticing too much differences between C and Dart with a n-body simulation and @mraleph helped me out understanding (a little little bit) this on Twitter.

There is something related to boxing/unboxing things which @mraleph will probably explain. Anyway if you use a List of class containing you int, you will see what you thought would happen.
For example if you use

class IntBasic {
  int value;
  IntBasic(this.value);
}

and instead of List<int> you use List<IntBasic>, you will notice that the boxing/unboxing thing degrades performances. The variant=basic here passes from about 8000 ms to 12000 ms.

This happens for example using the Vector class.

I know there are tons of papers on this topic, but my knowledge is only the tip of the iceberg.

Update: briefly List<int> is still a list of value types, while List<IntBasic> is a reference type.

7 Likes

In general measuring once is bound to often yield confusing results. There are various sources of performance jitter including CPU frequency scaling and migration between performance and efficiency cores.

Try measuring repeatedly, though measuring repeatedly has its own challenges: e.g. if your input data is the same every time your measurement might not be representative of real world scenario because CPU will prime itself based on the input data and hide various overheads which otherwise might be present in the real world scenarios (branch mispredictions, cache misses).

Going to the original code in question: when comparing Int64List vs List<int> it is important to understand the difference in data representation before diving into the code, because it is the difference in the representation that explains the difference in generated code and subsequently possible differences in performance. You need to understand that difference before you can even write a benchmark that showcases these.

  • Default List implementation (e.g. what List.generate, List.filled, list literals and so on produce) is a generic container which stores all elements by reference, might be growable or fixed length
  • Int64List is a specialized container which directly stores 64bit integer values.

The Int64List is simple: if you have a 64bit integer and you want to put it into an Int64List or if you have an Int64List and you load an element - it is pretty clear how this happens[1].

List is more complicated[2]: you need need to box integer values to store them as reference. Dart VM uses tagged pointers to avoid allocating boxes for “small” integer values. These values are only “small” by name: in reality small integer spans from -2ptrBits-2 to 2ptrBits-2-1 on 64-bit platforms, where ptrBits is a the size of the pointer:

  • 32 on 32-bit platforms
  • 32 on 64-bit platforms with compressed pointers
  • 64 on 64-bit platforms

These small integers are tagged by shifting them left by 1 (and untagged by an arithmetic right shift by 1). If the value does not fit into the smi (small integer) range then a proper box is allocated for it instead on the heap. The sort of arithmetic which is required for tagging and untagged is rather cheap in terms of CPU cycles - though fallback (which handles boxed values) introduces the branch+load (untagging) or branch+allocation+store (tagging). But if you never hit large integers CPU learns that these branches are never taken and they kinda sorta fall into the background[3].

This should be enough background to explain why there is not that much difference between List<int> and Int64List for this particular benchmark:

  • everything is nice and monomorphic: e.g. compiler knows concrete implementations which flow into a particular access sites, so it is able to specialize them well.
  • values contained in the list fit into smi range, so boxing is not involved.

  1. Though I am intentionally omitting one subtle detail here: in reality Int64List has multiple implementations: Int64List can contain its own elements, be a view into another typed list or be a view into native memory (so called external typed list). You might notice that the code generated for int64List[i] changes depending on what compiler knows about the concrete representation of int64List. If compiler does not know anything it will fallback on the most generic way of accessing typed lists, which is possible because they use special uniform layout. But anyway, I digress. ↩︎

  2. I omit the difference between growable and non-growable lists here. In reality this by itself is a massive wrench which can make things significantly worse when working with List<T>: growable and non-growable lists don’t have uniform data layout and if compiler does not know which list flows into a particular list[i] access it would just generate a virtual call. Try benchmarking something like list = List.generate(..., growable: someRuntimeCondition). ↩︎

  3. Again, I am simplifying things. Branch density matters for some CPUs - you might not want excessive amount of branches too close to each other even if they are never taken. ↩︎

8 Likes

Thanks! This is really helpful.

You’re right, when I tried benchmarking the more general List.generate(..., growable: someRuntimeCondition), it was significantly slower (e.g. 380ms vs 71ms for typed data — the times are different because I run many iterations now, as suggested, though my implementation is very naive at the moment). Anyway, this is what I was trying to understand, and it seems that Dart, when given a non-growable List<int> can generate code that’s comparable in speed to using a chunk of contiguous data (because that’s what the List<int> becomes, if I understand correctly).

I still see List<int> (the non-growable kind) consistently beating Int64List on speed, even after adding more iterations. I know it shouldn’t but it kind of bothers me, so I’ll try to make the harness a bit more robust, randomize things, etc.

@MarcoB Thanks for the link and the idea with an explicit boxed int!

My next question is: where could one start to learn more? For example, when I encounter Smi or LoadIndexed or CheckStackOverflow in the generated comments, is there a place in which I can search for these strings? I mean I know I can search in github.com/dart-lang/sdk as a whole but maybe there’s a more targeted place I can search? A subdirectory of the repo, maybe? Because just searching through everything gives a lot of results to weed through.

Now, don’t get me wrong, I love to use the “just ask Slava” approach but I don’t think it scales very well.

Also, if there are more general resources to look into or even books or university textbooks that are required or recommended reading, please let me know. At the moment, I’m lightly brushing up on asm and C++ but clearly I could do much more.

9 Likes

That is correct. Default implementation of List<T> is using contiguous storage. There are two main differences between default List<T> implementations from typed data:

  1. default implementations of List<T> are storing references which will require boxing for large integer values (doubles are always boxed in List<double>). Typed data always stores values unboxed.
  2. default implementations of List<T> use full reference size for storing values, while with typed data you can select smaller integer size (e.g. go as low as 1 byte per element with Int8List) if you know that your values fit into it.

I don’t see that when I simply repeat benchmark 10 times:

  for (var i = 0; i < 100; i++) {
    final stopwatch = Stopwatch()..start();
    _forLoopTyped(collection);
    stopwatch.stop();
    print('${stopwatch.elapsedMicroseconds}');
  }

Though there might be all kinds of strange effects, including effects from loop header alignment - even though Apple’s official Apple Silicon optimizations guide advises against aligning loops last time I checked.

There are less instructions, less branches and the core is pretty much the same code - so the performance difference you are seeing is some artifact of measurement.

Good place to start is to read docs in the runtime/docs folder. Then try to read the code.

Compiler instructions classes live in il.h, so if you want to find CheckStackOverflow you look for class CheckStackOverflowInstr, etc.

8 Likes

Is the readme here the same as this blog. Based on a quick glance at it seems to be the same, but wondering if the readme has stuff this blog doesn’t go into since it’s been updated much more recently

The linked page is produced from markdown documents in the repo, markdown documents are somewhat newer.