The `input` and `output` arrays, used in the sink / source, are defined as extern. The source is reading from `input` and the sink is writing to `output`.
If we look at the asm code generated with `-Ofast` with armclang `AC6` and for one iteration of the schedule, we get:
The generated scheduler is:
```txt
PUSH {r4-r6,lr}
MOVW r5,#0x220
MOVW r1,#0x620
MOVT r5,#0x3000
MOV r4,r0
MOVT r1,#0x3000
MOV r0,r5
MOV r2,#0x200
BL __aeabi_memcpy4 ; 0x10000a94
MOVW r6,#0x420
MOV r0,r5
MOVT r6,#0x3000
MOVS r2,#0x80
VMOV.F32 s0,#0.5
MOV r1,r6
BL arm_offset_f32 ; 0x10002cd0
MOV r0,#0x942c
MOV r1,r6
MOVT r0,#0x3000
MOV r2,#0x200
BL __aeabi_memcpy4 ; 0x10000a94
MOVS r1,#0
MOVS r0,#1
STR r1,[r4,#0]
POP {r4-r6,pc}
```
It is the code you would get if you was manually writing a call to the corresponding CMSIS-DSP function. All the C++ templates have disappeared. The switch / case used to implement the scheduler has also been removed.
The code was generated with `memoryOptimization` enabled and the Python script detected in this case that the FIFOs are used as arrays. As consequence, there is no FIFO update code. They are used as normal arrays.
switch(schedule[id])
{
case 0:
{
cgStaticError = proc.run();
}
break;
The generated code is as efficient as something manually coded.
case 1:
{
cgStaticError = sink.run();
}
break;
The sink and the sources have been replaced by a `memcpy`. The call to the CMSIS-DSP function is just loading the registers and branching to the CMSIS-DSP function.
case 2:
{
cgStaticError = source.run();
}
break;
The input buffer `input` is at address `0x30000620`.
default:
break;
}
CG_AFTER_NODE_EXECUTION;
CHECKERROR;
}
debugCounter--;
CG_AFTER_ITERATION;
nbSchedule++;
}
The `output` buffer is at address `0x3000942c`.
errorHandling:
CG_AFTER_SCHEDULE;
*error=cgStaticError;
return(nbSchedule);
}
```
We can see in the code:
If we look at the asm of the scheduler generated for a Cortex-M7 with `-Ofast` with armclang `AC6.19` and for **one** iteration of the schedule, we get (disassembly is from uVision IDE):
It is the code you would get if you was manually writing a call to the corresponding CMSIS-DSP functions. All the C++ templates have disappeared. The switch / case used to implement the scheduler has also been removed.
```
MOV r0,#0x942c
...
MOVT r0,#0x3000
```
just before the `memcpy`
The code was generated with `memoryOptimization` enabled and the Python script detected in this case that the FIFOs are used as arrays. As consequence, there is no FIFO update code. They are used as normal arrays.
The generated code is as efficient as something manually coded.
The sink and the sources have been replaced by a `memcpy`. The call to the CMSIS-DSP function is just loading the registers and branching to the CMSIS-DSP function.
It is not always as ideal as in this example. But it demonstrates that the use of C++ templates and a Python code generator is enabling a low overhead solution to the problem of streaming and compute graph.