I’m trying to study Dart performance in more depth and I think the next step is to do what @mraleph does in blogposts like this one and talks like this one: look at assembly code generated from dart compile
.
Here’s my initial experimentation. I decided comparing iteration over a normal Dart List<int>
with iteration over a typed-data Int64List
. I assumed that the typed-data list would be as fast or faster than the generic Dart list. To my surprise, the generic Dart list seems to be faster. (Remember, this is a toy microbenchmark, and i_have_no_idea_what_im_doing.png, so this is likely wrong and even if it isn’t, it should inform any real-world engineering decisions.)
Here’s the code:
import 'dart:io';
import 'dart:typed_data';
void main(List<String> args) {
final n = int.parse(args[0]);
const variant = String.fromEnvironment('variant');
switch (variant) {
case 'typed':
runTyped(n);
case 'basic':
runBasic(n);
default:
stderr.writeln("Must provide variant to run during compilation.");
exit(1);
}
}
/// A trick from C++ benchmark libraries to prevent the compiler from
/// optimizing away an unused value.
@pragma('vm:never-inline')
void doNotOptimizeAway(int value) {
// This will never be true but the compiler can't prove it.
if (DateTime.timestamp().millisecondsSinceEpoch == 0) {
print(value);
}
}
void runBasic(int n) {
final collection = List<int>.generate(n, (index) => index);
final stopwatch = Stopwatch()..start();
_forLoopBasic(collection);
stopwatch.stop();
print('List.generate(): ${stopwatch.elapsedMicroseconds}us');
}
void runTyped(int n) {
final collection = Int64List(n);
for (int i = 0; i < n; i++) {
collection[i] = i;
}
final stopwatch = Stopwatch()..start();
_forLoopTyped(collection);
stopwatch.stop();
print('TypedData: ${stopwatch.elapsedMicroseconds}us');
}
@pragma('vm:never-inline')
void _forLoopBasic(List<int> collection) {
var result = 0;
for (int i = 0; i < collection.length; i++) {
result += collection[i];
}
doNotOptimizeAway(result);
}
@pragma('vm:never-inline')
void _forLoopTyped(Int64List collection) {
var result = 0;
for (int i = 0; i < collection.length; i++) {
result += collection[i];
}
doNotOptimizeAway(result);
}
Now, compiling an exe and running it with n=10,000,000 gives me about 6700—7600us for basic
, and 8847—13500us for typed
.
I compile with the following command (taken from Slava’s talk linked above):
dart compile exe --define=variant=basic \
--extra-gen-snapshot-options --disassemble-optimized \
--extra-gen-snapshot-options --code-comments \
--extra-gen-snapshot-options --print-flow-graph-filter=forLoopBasic \
-v bin/test_collection.dart
This gives me very similar output for both variants.
basic
Specializing Platform getters for target OS macos.
Generating AOT kernel dill.
Compiling /Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart to /Users/filiph/dev/benchmarkhor/example/bin/test_collection.exe using format Kind.exe:
Generating AOT snapshot. /Users/filiph/fvm/versions/stable/bin/cache/dart-sdk/bin/utils/gen_snapshot [--disassemble-optimized, --code-comments, --print-flow-graph-filter=forLoopBasic]
Code for optimized function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopBasic@13220660' (RegularFunction) {
;; B0
;; B1
;; Enter frame
0x1053b2b90 a9bf79fd stp fp, lr, [sp, #-16]!
0x1053b2b94 aa0f03fd mov fp, sp
;; CheckStackOverflow:8(stack=0, loop=0)
0x1053b2b98 f9401f50 ldr tmp, [thr, #56]
0x1053b2b9c eb1001ff cmp sp, tmp
0x1053b2ba0 54000369 bls 0x1053b2c0c
;; v21 <- LoadField:32(v2 T{_GrowableList} . GrowableObjectArray.length) [0, 576460752303423487] T{_Smi}
0x1053b2ba4 f840f020 ldr r0, [r1, #15]
;; v29 <- UnboxInt64:32([non-speculative], v21 T{_Smi}) [0, 576460752303423487] int64
0x1053b2ba8 9341fc02 asr r2, r0, #1
;; v15 <- LoadField:32(v2 . GrowableObjectArray.data) T{_List}
0x1053b2bac f8417020 ldr r0, [r1, #23]
;; ParallelMove r3 <- C, r1 <- C goto:32 B5
0x1053b2bb0 d2800003 movz r3, #0x0
0x1053b2bb4 d2800001 movz r1, #0x0
;; B5
;; Loop 0
;; Loop Header
;; CheckStackOverflow:36(stack=0, loop=1)
0x1053b2bb8 f9401f50 ldr tmp, [thr, #56]
0x1053b2bbc eb1001ff cmp sp, tmp
0x1053b2bc0 540002a9 bls 0x1053b2c14
;; Branch if RelationalOp(<, v5, v29 T{_Smi}) T{bool} goto (3, 4)
0x1053b2bc4 eb02003f cmp r1, r2
0x1053b2bc8 5400016a bge 0x1053b2bf4
;; B3
;; Loop 0
;; v34 <- LoadIndexed:26([_List] v15, v5 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
0x1053b2bcc 8b010c10 add tmp, r0, r1 lsl #3
0x1053b2bd0 f8417204 ldr r4, [tmp, #23]
;; v32 <- UnboxInt64([non-speculative], v34 T{int}) [-9223372036854775808, 9223372036854775807] int64
0x1053b2bd4 9341fc85 asr r5, r4, #1
0x1053b2bd8 36000044 tbzw r4, #0, 0x1053b2be0
0x1053b2bdc f8407085 ldr r5, [r4, #7]
;; v10 <- BinaryInt64Op(+ [tr], v4, v32 T{int}) [-9223372036854775808, 9223372036854775807] int64
0x1053b2be0 8b050064 add r4, r3, r5
;; v12 <- BinaryInt64Op(+ [tr], v5 T{int}, v35 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
0x1053b2be4 91000425 add r5, r1, #0x1
;; ParallelMove r3 <- r4, r1 <- r5 goto:34 B5
0x1053b2be8 aa0403e3 mov r3, r4
0x1053b2bec aa0503e1 mov r1, r5
0x1053b2bf0 17fffff2 b 0x1053b2bb8
;; B4
;; ParallelMove r1 <- r3
0x1053b2bf4 aa0303e1 mov r1, r3
;; StaticCall:38( doNotOptimizeAway<0> v4)
0x1053b2bf8 94000000 bl 0x1053b2bf8
;; ParallelMove r0 <- C
0x1053b2bfc aa1603e0 mov r0, nr
;; DartReturn:40(v0)
0x1053b2c00 aa1d03ef mov sp, fp
0x1053b2c04 a8c179fd ldp fp, lr, [sp], #16 !
0x1053b2c08 d65f03c0 ret
;; CheckStackOverflowSlowPath
0x1053b2c0c 94000000 bl 0x1053b2c0c
0x1053b2c10 17ffffe5 b 0x1053b2ba4
;; CheckStackOverflowSlowPath
0x1053b2c14 94000000 bl 0x1053b2c14
0x1053b2c18 17ffffeb b 0x1053b2bc4
}
Source positions for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopBasic@13220660' {
1053b2b90-1053b2bfb: Function '_forLoopBasic@13220660': static.@58
1053b2bfc-1053b2c0f: Function '_forLoopBasic@13220660': static.@52
1053b2c10-1053b2c17: Function '_forLoopBasic@13220660': static.@55
}
StackMaps for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopBasic@13220660' {
0x1053b2bfc:
0x1053b2c10: 000000000000000000010
0x1053b2c18: 000000000000000000001
}
Catch entry moves for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopBasic@13220660' {
}
Entry points for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopBasic@13220660' {
[code+0x07] 1053b2b90 kNormal
[code+0x0f] 1053b2b90 kMonomorphic
[code+0x17] 1053b2b90 kUnchecked
[code+0x1f] 1053b2b90 kMonomorphicUnchecked
}
Static call target functions {
0x1053b2bfc: file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::_doNotOptimizeAway, (pc-relative-call)
0x1053b2c10: [Stub] _iso_stub_StackOverflowSharedWithoutFPURegsStub, (pc-relative-call)
0x1053b2c18: [Stub] _iso_stub_StackOverflowSharedWithoutFPURegsStub, (pc-relative-call)
}
Generating executable.
Marking binary executable.
Generated: /Users/filiph/dev/benchmarkhor/example/bin/test_collection.exe
typed
Specializing Platform getters for target OS macos.
Generating AOT kernel dill.
Compiling /Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart to /Users/filiph/dev/benchmarkhor/example/bin/test_collection.exe using format Kind.exe:
Generating AOT snapshot. /Users/filiph/fvm/versions/stable/bin/cache/dart-sdk/bin/utils/gen_snapshot [--disassemble-optimized, --code-comments, --print-flow-graph-filter=forLoopTyped]
Code for optimized function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopTyped@13220660' (RegularFunction) {
;; B0
;; B1
;; Enter frame
0x105c32b40 a9bf79fd stp fp, lr, [sp, #-16]!
0x105c32b44 aa0f03fd mov fp, sp
;; CheckStackOverflow:8(stack=0, loop=0)
0x105c32b48 f9401f50 ldr tmp, [thr, #56]
0x105c32b4c eb1001ff cmp sp, tmp
0x105c32b50 540002e9 bls 0x105c32bac
;; v20 <- LoadField:32(v2 T{_Int64List} . TypedDataBase.length {final}) [0, 4611686018427387903] T{_Smi}
0x105c32b54 f840f020 ldr r0, [r1, #15]
;; v28 <- UnboxInt64:32([non-speculative], v20 T{_Smi}) [0, 4611686018427387903] int64
0x105c32b58 9341fc02 asr r2, r0, #1
;; ParallelMove r3 <- C, r0 <- C goto:32 B5
0x105c32b5c d2800003 movz r3, #0x0
0x105c32b60 d2800000 movz r0, #0x0
;; B5
;; Loop 0
;; Loop Header
;; CheckStackOverflow:36(stack=0, loop=1)
0x105c32b64 f9401f50 ldr tmp, [thr, #56]
0x105c32b68 eb1001ff cmp sp, tmp
0x105c32b6c 54000249 bls 0x105c32bb4
;; Branch if RelationalOp(<, v5, v28 T{_Smi}) T{bool} goto (3, 4)
0x105c32b70 eb02001f cmp r0, r2
0x105c32b74 5400010a bge 0x105c32b94
;; B3
;; Loop 0
;; v32 <- LoadIndexed:26([_Int64List] v2, v5 T{int}) [-9223372036854775808, 9223372036854775807] int64
0x105c32b78 8b000c30 add tmp, r1, r0 lsl #3
0x105c32b7c f8417204 ldr r4, [tmp, #23]
;; v10 <- BinaryInt64Op(+ [tr], v4, v32 T{int}) [-9223372036854775808, 9223372036854775807] int64
0x105c32b80 8b040065 add r5, r3, r4
;; v12 <- BinaryInt64Op(+ [tr], v5 T{int}, v33 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
0x105c32b84 91000404 add r4, r0, #0x1
;; ParallelMove r3 <- r5, r0 <- r4 goto:34 B5
0x105c32b88 aa0503e3 mov r3, r5
0x105c32b8c aa0403e0 mov r0, r4
0x105c32b90 17fffff5 b 0x105c32b64
;; B4
;; ParallelMove r1 <- r3
0x105c32b94 aa0303e1 mov r1, r3
;; StaticCall:38( doNotOptimizeAway<0> v4)
0x105c32b98 94000000 bl 0x105c32b98
;; ParallelMove r0 <- C
0x105c32b9c aa1603e0 mov r0, nr
;; DartReturn:40(v0)
0x105c32ba0 aa1d03ef mov sp, fp
0x105c32ba4 a8c179fd ldp fp, lr, [sp], #16 !
0x105c32ba8 d65f03c0 ret
;; CheckStackOverflowSlowPath
0x105c32bac 94000000 bl 0x105c32bac
0x105c32bb0 17ffffe9 b 0x105c32b54
;; CheckStackOverflowSlowPath
0x105c32bb4 94000000 bl 0x105c32bb4
0x105c32bb8 17ffffee b 0x105c32b70
}
Source positions for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopTyped@13220660' {
105c32b40-105c32b9b: Function '_forLoopTyped@13220660': static.@67
105c32b9c-105c32baf: Function '_forLoopTyped@13220660': static.@61
105c32bb0-105c32bb7: Function '_forLoopTyped@13220660': static.@64
}
StackMaps for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopTyped@13220660' {
0x105c32b9c:
0x105c32bb0: 000000000000000000010
0x105c32bb8: 000000000000000000010
}
Catch entry moves for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopTyped@13220660' {
}
Entry points for function 'file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::__forLoopTyped@13220660' {
[code+0x07] 105c32b40 kNormal
[code+0x0f] 105c32b40 kMonomorphic
[code+0x17] 105c32b40 kUnchecked
[code+0x1f] 105c32b40 kMonomorphicUnchecked
}
Static call target functions {
0x105c32b9c: file:///Users/filiph/dev/benchmarkhor/example/bin/test_collection.dart_::_doNotOptimizeAway, (pc-relative-call)
0x105c32bb0: [Stub] _iso_stub_StackOverflowSharedWithoutFPURegsStub, (pc-relative-call)
0x105c32bb8: [Stub] _iso_stub_StackOverflowSharedWithoutFPURegsStub, (pc-relative-call)
}
Generating executable.
Marking binary executable.
Generated: /Users/filiph/dev/benchmarkhor/example/bin/test_collection.exe
I used an LLM (ChatGPT 4o) to help me compare the two outputs side by side. I didn’t actually fully audit the LLM’s output yet, so I have to take it with a grain of salt. But it seems correct.
Output 1 (basic) Output 2 (typed)
--------------------------------------------------------------------------------------------------------
Code for optimized function 'file:///..._::_forLoopBasic@13220660' Code for optimized function 'file:///..._::_forLoopTyped@13220660'
;; B0 ;; B0
;; B1 ;; B1
;; Enter frame ;; Enter frame
0x1053b2b90 a9bf79fd stp fp, lr, [sp, #-16]! 0x105c32b40 a9bf79fd stp fp, lr, [sp, #-16]!
0x1053b2b94 aa0f03fd mov fp, sp 0x105c32b44 aa0f03fd mov fp, sp
;; CheckStackOverflow:8(stack=0, loop=0) ;; CheckStackOverflow:8(stack=0, loop=0)
0x1053b2b98 f9401f50 ldr tmp, [thr, #56] 0x105c32b48 f9401f50 ldr tmp, [thr, #56]
0x1053b2b9c eb1001ff cmp sp, tmp 0x105c32b4c eb1001ff cmp sp, tmp
0x1053b2ba0 54000369 bls 0x1053b2c0c 0x105c32b50 540002e9 bls 0x105c32bac
;; v21 <- LoadField:32(v2 T{_GrowableList} . Growable... ;; v20 <- LoadField:32(v2 T{_Int64List} . TypedDataBase.len...
0x1053b2ba4 f840f020 ldr r0, [r1, #15] 0x105c32b54 f840f020 ldr r0, [r1, #15]
;; v29 <- UnboxInt64:32([non-speculative], v21 T{_Smi})... ;; v28 <- UnboxInt64:32([non-speculative], v20 T{_Smi})...
0x1053b2ba8 9341fc02 asr r2, r0, #1 0x105c32b58 9341fc02 asr r2, r0, #1
;; v15 <- LoadField:32(v2 . GrowableObjectArray.data) T{_List}
0x1053b2bac f8417020 ldr r0, [r1, #23]
;; ParallelMove r3 <- C, r1 <- C goto:32 B5 ;; ParallelMove r3 <- C, r0 <- C goto:32 B5
0x1053b2bb0 d2800003 movz r3, #0x0 0x105c32b5c d2800003 movz r3, #0x0
0x1053b2bb4 d2800001 movz r1, #0x0 0x105c32b60 d2800000 movz r0, #0x0
;; B5 ;; B5
;; Loop 0 ;; Loop 0
;; Loop Header ;; Loop Header
;; CheckStackOverflow:36(stack=0, loop=1) ;; CheckStackOverflow:36(stack=0, loop=1)
0x1053b2bb8 f9401f50 ldr tmp, [thr, #56] 0x105c32b64 f9401f50 ldr tmp, [thr, #56]
0x1053b2bbc eb1001ff cmp sp, tmp 0x105c32b68 eb1001ff cmp sp, tmp
0x1053b2bc0 540002a9 bls 0x1053b2c14 0x105c32b6c 54000249 bls 0x105c32bb4
;; Branch if RelationalOp(<, v5, v29 T{_Smi}) T{bool}... ;; Branch if RelationalOp(<, v5, v28 T{_Smi}) T{bool}...
0x1053b2bc4 eb02003f cmp r1, r2 0x105c32b70 eb02001f cmp r0, r2
0x1053b2bc8 5400016a bge 0x1053b2bf4 0x105c32b74 5400010a bge 0x105c32b94
;; B3 ;; B3
;; Loop 0 ;; Loop 0
;; v34 <- LoadIndexed:26([_List] v15, v5 T{int}) [-92233... ;; v32 <- LoadIndexed:26([_Int64List] v2, v5 T{int}) [-9...
0x1053b2bcc 8b010c10 add tmp, r0, r1 lsl #3 0x105c32b78 8b000c30 add tmp, r1, r0 lsl #3
0x1053b2bd0 f8417204 ldr r4, [tmp, #23] 0x105c32b7c f8417204 ldr r4, [tmp, #23]
;; v32 <- UnboxInt64([non-speculative], v34 T{int}) [-9...
0x1053b2bd4 9341fc85 asr r5, r4, #1
0x1053b2bd8 36000044 tbzw r4, #0, 0x1053b2be0
0x1053b2bdc f8407085 ldr r5, [r4, #7]
;; v10 <- BinaryInt64Op(+ [tr], v4, v32 T{int}) [-92233... ;; v10 <- BinaryInt64Op(+ [tr], v4, v32 T{int}) [-92233...
0x1053b2be0 8b050064 add r4, r3, r5 0x105c32b80 8b040065 add r5, r3, r4
;; v12 <- BinaryInt64Op(+ [tr], v5 T{int}, v35 T{_Smi})... ;; v12 <- BinaryInt64Op(+ [tr], v5 T{int}, v33 T{_Smi})...
0x1053b2be4 91000425 add r5, r1, #0x1 0x105c32b84 91000404 add r4, r0, #0x1
;; ParallelMove r3 <- r4, r1 <- r5 goto:34 B5 ;; ParallelMove r3 <- r5, r0 <- r4 goto:34 B5
0x1053b2be8 aa0403e3 mov r3, r4 0x105c32b88 aa0503e3 mov r3, r5
0x1053b2bec aa0503e1 mov r1, r5 0x105c32b8c aa0403e0 mov r0, r4
0x1053b2bf0 17fffff2 b 0x1053b2bb8 0x105c32b90 17fffff5 b 0x105c32b64
;; B4 ;; B4
;; ParallelMove r1 <- r3 ;; ParallelMove r1 <- r3
0x1053b2bf4 aa0303e1 mov r1, r3 0x105c32b94 aa0303e1 mov r1, r3
;; StaticCall:38( doNotOptimizeAway<0> v4) ;; StaticCall:38( doNotOptimizeAway<0> v4)
0x1053b2bf8 94000000 bl 0x1053b2bf8 0x105c32b98 94000000 bl 0x105c32b98
;; ParallelMove r0 <- C ;; ParallelMove r0 <- C
0x1053b2bfc aa1603e0 mov r0, nr 0x105c32b9c aa1603e0 mov r0, nr
;; DartReturn:40(v0) ;; DartReturn:40(v0)
0x1053b2c00 aa1d03ef mov sp, fp 0x105c32ba0 aa1d03ef mov sp, fp
0x1053b2c04 a8c179fd ldp fp, lr, [sp], #16 ! 0x105c32ba4 a8c179fd ldp fp, lr, [sp], #16 !
0x1053b2c08 d65f03c0 ret 0x105c32ba8 d65f03c0 ret
;; CheckStackOverflowSlowPath ;; CheckStackOverflowSlowPath
0x1053b2c0c 94000000 bl 0x1053b2c0c 0x105c32bac 94000000 bl 0x105c32bac
0x1053b2c10 17ffffe5 b 0x1053b2ba4 0x105c32bb0 17ffffe9 b 0x105c32b54
;; CheckStackOverflowSlowPath ;; CheckStackOverflowSlowPath
0x1053b2c14 94000000 bl 0x1053b2c14 0x105c32bb4 94000000 bl 0x105c32bb4
0x1053b2c18 17ffffeb b 0x1053b2bc4 0x105c32bb8 17ffffee b 0x105c32b70
}
It looks like the typed-data variant has slightly less code, and I don’t see any significant differences in the two assembly sources. In fact, the basic
variant seems to be unboxing during the loop while the typed
isn’t, which — if I understand correctly — should make basic
slower (as one would expect anyway). That said, the last time I did anything with assembly was decades ago and I barely know the basic operations.
If someone more knowledgeable than I (wink wink @mraleph) could give this a quick look-see, I’d immensely be grateful. I must assume I’m doing something incredibly stupid but I don’t know what.