Improved compute graph documentation

More details about the new dynamic mode.
3 years ago · 1378fa94f5
parent fc38d7d21f
commit 1378fa94f5
13 changed files with 383 additions and 256 deletions
--- a/ComputeGraph/CycloStatic.md
+++ b/ComputeGraph/CycloStatic.md
@ -0,0 +1,46 @@
+# Cyclo static scheduling
+
+Beginning with the version `1.7.0` of the Python wrapper and version >= `1.12` of CMSIS-DSP, cyclo static scheduling has been added.
+
+## What is the problem it is trying to solve ?
+
+Let's consider a sample rate converter from 48 kHz to 44.1 kHz.
+
+For each input sample, on average it produces 44.1 / 48 = 0.91875 samples.
+
+There are two ways to do this:
+
+- One can observe that 48000/44100 = 160/147. So each time 160 samples are consumed, 147 samples are produced
+- The number of sample produced can vary from one execution of the node to the other so that on average, 0.91875 samples are generated per execution
+
+In the first case, it is synchronous but you need to wait for 160 input samples before being able to do some processing. It is introducing a latency and depending on the sample rate and use case, this latency may be too big. We need more flexibility.
+
+In the second case, we have the flexibility but it is no more synchronous because the resampler is not producing the same amount of samples at each execution.
+
+But we can observe that even if is is no more stationary, it is periodic. After consuming 160 samples the behavior should repeat.
+
+One can use the resampler in the [SpeexDSP](https://gitlab.xiph.org/xiph/speexdsp) project to test. If we decide to consume only 40 samples in input to have less latency, then the resampler of SpeexDSP will produce 37,37,37 and 36 samples for the first 4 executions.
+
+And (40+40+40+40)/(37+37+37+36) = 160 / 147.
+
+So the flow of data on the output is not static but it is periodic.
+
+This is now supported in the CMSIS-DSP compute graph and on each IO one can define a period. For this example, it could be:
+
+```python
+b=Sampler("sampler",floatType,40,[37,37,37,36])
+```
+
+Note that in the C++ class, the template parameters giving the number of samples produced or consumed on an IO cannot be used any more in this case. The value is still generated but now represent the maximum on a period.
+
+And, in the run function you need to pass the number of sample read or written to the read and write buffer functions:
+
+```c
+this->getWriteBuffer(nbOfSamplesForCurrentExecution)
+```
+
+For synchronous node, nothing is changed and they are coded as in the previous versions.
+
+The drawback of cyclo static scheduling is that the schedule length is increased. If we take the first example with a source producing 5 samples and a node consuming 7 samples and if the source is replaced by another source producing [5,5] then it is not equivalent. In the first case we can have only one execution of the source. In the second case, the scheduling will always contain an even number of executions of the sources. So the schedule length will be bigger. But memory usage will be the same (FIFOs of same size).
+
+Since schedule tend to be bigger with cyclo static scheduling, a new code generation mode has been introduced and is enabled by default : now instead of having a sequence of function calls, the schedule is coded by an array of number and there is a switch / case to select the function to be called.
--- a/ComputeGraph/Dynamic.md
+++ b/ComputeGraph/Dynamic.md
@ -0,0 +1,147 @@
+# Dynamic Data Flow
+
+Versions of the compute graph corresponding to CMSIS-DSP Version >= `1.14.3` and Python wrapper version >= `1.10.0` are supporting  a new dynamic / asynchronous mode.
+
+ With a dynamic flow, the flow of data is potentially changing at each execution. The IOs can generate or consume a different amount of data at each execution of their node (including no data).
+
+This can be useful for sample oriented use cases where not all samples are available but a processing must nevertheless take place each time a subset of samples is available (samples could come from sensors).
+
+With a dynamic flow and scheduling, there is no more any way to ensure that there won't be FIFO underflow of overflow due to scheduling. As consequence, the nodes must be able to check for this problem and decide what to do.
+
+* A sink may decide to generate fake data in case of FIFO underflow
+* A source may decide to skip some data in case of FIFO overflow
+* Another node may decide to do nothing and skip the execution
+* Another node may decide to raise an error.
+
+With dynamic scheduling, a node must implement the function `prepareForRunning` and decide what to do.
+
+3 error / status codes are reserved for this. They are defined in the header `cg_status.h`. This header is not included by default, but if you define you own error codes, they should be coherent with `cg_status` and use the same values for the 3 status / error codes which are used in dynamic mode:
+
+* `CG_SUCCESS`  = 0 : Node can execute
+* `CG_SKIP_EXECUTION` = -5 : Node will skip the execution
+* `CG_BUFFER_ERROR` = -6 : Unrecoverable error due to FIFO underflow / overflow (only raised in pure function like CMSIS-DSP ones called directly)
+
+Any other returned value will stop the execution.
+
+The dynamic mode (also named asynchronous), is enabled with option : `asynchronous`
+
+The system will still compute a scheduling and FIFO sizes as if the flow was static. We can see the static flow as an average of the dynamic flow. In dynamic mode, the FIFOs may need to be bigger than the ones computed in static mode.  The static estimation is giving a first idea of what the size of the FIFOs should be. The size can be increased by specifying a percent increase with option `FIFOIncrease`.
+
+For pure compute functions (like CMSIS-DSP ones), which are not packaged into a C++ class, there is no way to customize the decision logic in case of a problem with FIFO. There is a global option : `asyncDefaultSkip`. 
+
+When `true`, a pure function that cannot run will just skip the execution. With `false`, the execution will stop. For any other decision algorithm, the pure function needs to be packaged in a C++ class.
+
+`Duplicate` nodes are skipping the execution in case of problems with FIFOs. If it is not the wanted behavior, you can either:
+
+* Replace the Duplicate class by a custom one by changing the class name with option `duplicateNodeClassName` on the graph.
+* Don't use the automatic duplication feature and introduce your duplicate nodes in the compute graph
+
+When you don't want to generate or consume data in a node, just don't call the functions `getReadBuffer` or `getWriteBuffer` for your IOs.
+
+## prepareForRunning
+
+The method `prepareForRunning` is needed to check if the node execution is going to be possible.
+
+Inside this function, you have access to methods like:
+
+* `willOverflow`
+* `willUnderflow`
+
+In case of several IOs, you may also have:
+
+* `willOverflow1`
+* `willOverflow2` 
+
+etc ...
+
+The functions have an interface like:
+
+```C++
+bool willOverflow(int nb = outputSize)
+```
+
+or
+
+```C++
+bool willUnderflow(int nb = inputSize)
+```
+
+The `inputSize` and `outputSize` are coming from the template arguments. So, by default the node is using the parameters of the static compute graph.
+
+You may want to read or write more or less than what is defined in the static compute graph. But it must be coherent with the `run` function.
+
+If you use `willOverflow(4)` to check if you can write `4` samples in the output in the `prepareForRunning` function, then in the `run` function you must access to the write buffer by requesting `4` samples with `getWriteBuffer(4)`
+
+If you don't want to write or read on an IO, just don't use the function `getWriteBuffer` and `getReadBuffer` in the `run` function.
+
+It is also possible to use the functions `willOverflow`, `willUnderflow` in the `run` function. It can be used to avoid calling the `getReadBuffer` and `getWriteBuffer` when you nevertheless want to run the node although some FIFOs cannot be used.
+
+**WARNING**: You are responsible for checking if a FIFO is going to underflow or overflow **before** using `getReadBuffer` or `getWriteBuffer`.
+
+If the `getReadBuffer` and `getWriteBuffer` are causing an underflow or overflow of the FIFO, you'll have memory corruptions and the compute graph will no more work.
+
+## Graph constraints
+
+The dynamic / asynchronous mode is using a synchronous graph as average / ideal case. But it is important to understand that we are no more in static / synchronous mode and some static graph may be too complex for the dynamic mode. Let's take the following graph as example:
+
+![async_topological2](documentation/async_topological2.png)
+
+The generated schedule is:
+
+```
+src
+src
+src
+src
+src
+filter
+sink
+sink
+sink
+sink
+sink
+```
+
+If we use a strategy of skipping the execution of a node in case of overflow / underflow, what will happen is:
+
+* Schedule execution 1
+  * First `src` node execution is successful since there is a sample
+  * All other execution attempts will be skipped 
+* Schedule execution 2
+  * First `src` node execution is successful since there is a sample
+  * All other execution attempt will be skipped 
+* ...
+* Schedule execution 5:
+  * First `src` node execution is successful since there is a sample
+  * 4 other `src` node executions are skipped
+  * The `filter` execution can finally take place since enough data has been generated
+
+
+
+In summary , it is totally useless in asynchronous mode to attempt to run the same node several times in the same scheduling iteration except if we are sure there will always be enough data. In previous example, we see that only the first attempt at running `src` is doing something. Other attempts are always skipped.
+
+
+
+Instead, one could try the following graph:
+
+![async_topological1](documentation/async_topological1.png)
+
+With this graph, each node execution will be attempted only once during an execution.
+
+But the `filter` needs 5 samples, so we need to increase the size of the FIFOs from `1` to `5` or the `filter` node will never be executed. 
+
+It is possible with the option `FIFOIncrease` but it is better to make it explicit with the following graph:
+
+![async_topological3](documentation/async_topological3.png)
+
+In this case, the FIFO is big enough. `src` node will be executed each time there is a sample. `filter` will execute only when 5 samples have been accumulated in the FIFO. Each node execution is only attempted once during a schedule.
+
+
+
+As consequence, the recommendation in dynamic / asynchronous mode is to:
+
+* Ensure that the amount of data produced and consumed on each FIFO end is the same (so that each node execution is attempted only once during a schedule)
+* Use the maximum amount of samples required on both ends of the FIFO
+  * Here `sink` is generating  at most `1` sample, `filter` needs 5. So we use `5` on both ends of the FIFO
+* More complex graphs will create a useless overhead in dynamic / asynchronous mode
+
--- a/ComputeGraph/MATHS.md
+++ b/ComputeGraph/MATHS.md
@ -21,15 +21,13 @@ The following matrix `M` is created from the previous graph. The first column re

 ![math-matrix1](documentation/math-matrix1.png)

-The first row thus mean that an execution of the filter is consuming 7 samples on the first edge and execution of the source is producing 5 samples. The sink is not connected to the first edge so the value is 0.
-
-
+The first row means that an execution of the filter is consuming 7 samples on the first edge and execution of the source is producing 5 samples. The sink is not connected to the first edge so the value is 0.

 If a node is run `nb` times then the matrix can be used to compute the state of the edges after this execution.

 A vector `s` can be used to represent how many time each node is executed. Then `M.s` is the amount of data produced / consumed on each edge.

-If `f` is the state of the edges (amount of data on each edge) then we have:
+If `f` is the state of the edges (amount of data on each edge) then, after execution of the nodes as described with `s`, we have:

 `f' = M . s + f`

@ -37,13 +35,23 @@ where `f'` is the new state after the execution of the nodes.

 If we want to find a scheduling of this graph allowing to stream samples from the source to the sink, then a periodic solution must be found. It is equivalent to finding a solution of:

-`M.s = 0`
+`M . s = 0`
+
+The theory is showing that if the graph is schedulable, the space of solution has dimension 1. So we can find a solution with minimal integer values for the coefficients by :

-The theory is showing that if the graph is schedulable, the space of solution has dimension 1. So we can find a solution with minimal integer values for the coefficients by just scaling any solution.
+* Converting the solution (which may be rational) to integers
+* Using the greatest common divider to find the smallest solution

 In the above example, we find the scheduling vector : `s={5,5,7}`

-Once we know how many time each node must be executed, we can try to find a schedule minimizing the memory usage. The algorithm computes a topological sort of the graph and starts from the sinks. A node is scheduled if it has enough data on its edges : a normalized measure is being used on each edge. The amount of data is not directly used but it is normalized by the amount of data read or produced by the node in a given execution. The idea is to run the node as soon as enough data is available to make the execution of the node possible.
+Once we know how many time each node must be executed, we can try to find a schedule minimizing the memory usage. The algorithm computes a topological sort of the graph and starts from the sinks. A node is scheduled if it has enough data on its edges : a normalized measure is being used on each edge. The amount of data is not directly used but it is normalized by the amount of data read or produced by the node in a given execution. The idea is to run the node as soon as enough data is available to make the execution of the node possible:
+
+For instance, the 2 following cases are equivalent for the algorithm:
+
+* A FIFO containing 128 samples and connected to a node consuming 128 samples
+* A FIFO containing 1 sample and connected to a node consuming 1 sample
+
+The algorithm is considering those 2 FIFOs as filled in the same way.

 The graph is structured in layers : nodes are in the same layer if their distance to the sinks is the same.

@ -67,5 +75,5 @@ So we can reuse the previous theory if we assume that each node execution is in

 Once we have computed the matrix and the scheduling solution, the details of the schedule are computed using a different granularity : the cycles are no more considered as a whole but instead  each execution step inside each cycle is used.

-As consequence, the effect of the cyclo static scheduling is just to increase the length of the final scheduling sequence since each node will have to be executed a number of times which is constrained by the least common multiples of the period of the connected nodes.
+As consequence, the effect of the cyclo-static scheduling is just to increase the length of the final scheduling sequence since each node will have to be executed a number of times which is constrained by the least common multiples of the period of the connected nodes.

--- a/ComputeGraph/README.md
+++ b/ComputeGraph/README.md
@ -4,7 +4,7 @@

 Embedded systems are often used to implement streaming solutions : the software is processing and / or generating stream of samples. The software is made of components that have no concept of streams : they are working with buffers. As a consequence, implementing a streaming solution is forcing the developer to think about scheduling questions, FIFO sizing etc ...

-The CMSIS-DSP compute graph is a low overhead solution to this problem : it makes it easier to build streaming solutions by connecting components and computing a scheduling at build time. The use of C++ template also enables the compiler to have more information about the components for better code generation.
+The CMSIS-DSP compute graph is a **low overhead** solution to this problem : it makes it easier to build streaming solutions by connecting components and computing a scheduling at **build time**. The use of C++ template also enables the compiler to have more information about the components for better code generation.

 A dataflow graph is a representation of how compute blocks are connected to implement a streaming processing. 

@ -16,8 +16,6 @@ Here is an example with 3 nodes:

 Each node is producing and consuming some amount of samples. For instance, the source node is producing 5 samples each time it is run. The filter node is consuming 7 samples each time it is run.

-
-
 The FIFOs lengths are represented on each edge of the graph : 11 samples for the leftmost FIFO and 5 for the other one.

 In blue, the amount of samples generated or consumed by a node each time it is called.
@ -28,14 +26,12 @@ When the processing is applied to a stream of samples then the problem to solve

 > **how the blocks must be scheduled and the FIFOs connecting the block dimensioned**

-The general problem can be very difficult. But, if some constraints are applied to the graph then some algorithms can compute a static schedule.
+The general problem can be very difficult. But, if some constraints are applied to the graph then some algorithms can compute a static schedule at build time.

-When the following constraints are satisfied we say we have a Synchronous Dataflow Graph:
+When the following constraints are satisfied we say we have a Synchronous / Static Dataflow Graph:

 - Static graph : graph topology is not changing
- Each node is always consuming and producing the same number of samples
-
-In CMSIS-DSP, we are naming this a static flow. 
+- Each node is always consuming and producing the same number of samples (static flow)

 The CMSIS-DSP Compute Graph Tools are a set of Python scripts and C++ classes with following features:

@ -49,7 +45,7 @@ The CMSIS-DSP Compute Graph Tools are a set of Python scripts and C++ classes wi
 - The Python script will generate a C++ implementation of the static schedule 
 - The Python script can also generate a Python implementation of the static schedule (for use with the CMSIS-DSP Python wrapper)

-(There is no FIFO underflow or overflow due to the scheduling. If there are not enough cycles to run the processing, the real-time will be broken and the solution won't work But this problem is independent from the scheduling itself. )
+There is no FIFO underflow or overflow due to the scheduling. If there are not enough cycles to run the processing, the real-time will be broken and the solution won't work But this problem is independent from the scheduling itself. 

 ## Why it is useful

@ -73,27 +69,34 @@ The periodic schedule generated for this graph has a length of 19. It is big for

 The schedule is (the size of the FIFOs after the execution of the node displayed in the brackets):

-`source [ 5   0]`
-`source [10   0]`
-`filter [ 3   5]`
-`sink   [ 3   0]`
-`source [ 8   0]`
-`filter [ 1   5]`
-`sink   [ 1   0]`
-`source [ 6   0]`
-`source [11   0]`
-`filter [ 4   5]`
-`sink   [ 4   0]`
-`source [ 9   0]`
-`filter [ 2   5]`
-`sink   [ 2   0]`
-`source [ 7   0]`
-`filter [ 0   5]`
-`sink   [ 0   0]`
+```
+source [ 5   0]
+source [10   0]
+filter [ 3   5]
+sink   [ 3   0]
+source [ 8   0]
+filter [ 1   5]
+sink   [ 1   0]
+source [ 6   0]
+source [11   0]
+filter [ 4   5]
+sink   [ 4   0]
+source [ 9   0]
+filter [ 2   5]
+sink   [ 2   0]
+source [ 7   0]
+filter [ 0   5]
+sink   [ 0   0]
+```

 At the end, both FIFOs are empty so the schedule can be run again : it is periodic !

-The latest version of the compute graph also supports dynamic scheduling.
+The compute graph is focusing on the synchronous / static case but some extensions have been introduced for more flexibility:
+
+* A [cyclo-static scheduling](CycloStatic.md) (nearly static)
+* A [dynamic/asynchronous](Dynamic.md) mode
+
+Here is a summary of the different configuration supported by the compute graph. The cyclo-static scheduling is part of the static flow mode.

 ![supported_configs](documentation/supported_configs.png)

@ -117,7 +120,7 @@ In this file, you can describe new type of blocks that you need in the compute g

 Finally, you can execute `graph.py` to generate the C++ files.

-The generated files need to include the `ComputeGraph/cg/static/src/GenericNodes.h` and the nodes used in the graph and which can be found in `cg/static/nodes/cpp`.
+The generated files need to include the `ComputeGraph/cg/src/GenericNodes.h` and the nodes used in the graph and which can be found in `cg/nodes/cpp`. Those headers are part of the CMSIS-DSP Pack. They are optional so you'll need to select the compute graph extension in the pack.

 If you have declared new nodes in `graph.py` then you'll need to provide an implementation.

@ -127,6 +130,7 @@ More details and explanations can be found in the documentation for the examples
 * [Example 2 : More complex example with delay and CMSIS-DSP](documentation/example2.md)
 * [Example 3 : Working example with CMSIS-DSP and FFT](documentation/example3.md)
 * [Example 4 : Same as example 3 but with the CMSIS-DSP Python wrapper](documentation/example4.md)
+* [Example 10 : The asynchronous mode](documentation/example10.md)

 Examples 5 and 6 are showing how to use the CMSIS-DSP MFCC with a synchronous data flow.

@ -134,108 +138,30 @@ Example 7 is communicating with OpenModelica. The Modelica model (PythonTest) in

 Example 8 is showing how to define a new custom datatype for the IOs of the nodes. Example 8 is also demonstrating a new feature where an IO can be connected up to 3 inputs and the static scheduler will automatically generate duplicate nodes.

-
-
 ## Frequently asked questions:

 There is a [FAQ](FAQ.md) document.

-## Cyclo static scheduling
-
-Beginning with the version 1.7.0 of the Python wrapper and version >= 1.12 of CMSIS-DSP, cyclo static scheduling has been added.
-
-#### What is the problem it is trying to solve ?
-
-Let's consider a sample rate converter from 48 kHz to 44.1 kHz.
-
-For each input sample, on average it produces 44.1 / 48 = 0.91875 samples.
-
-There are two ways to do this:
-
- One can observe that 48000/44100 = 160/147. So each time 160 samples are consumed, 147 samples are produced
- The number of sample produced can vary from one execution of the node to the other
-
-In the first case, it is synchronous but you need to wait for 160 input samples before being able to do some processing. It is introducing a latency and depending on the sample rate and use case, this latency may be too big. We need more flexiblity.
-
-In the second case, we have the flexibility but it is no more synchronous because the resampler is not producing the same amount of samples at each execution.
-
-But we can observe that even if is is no more stationary, it is periodic. After consuming 160 samples the behavior should repeat.
+## Options

-One can use the resampler in the [SpeexDSP](https://gitlab.xiph.org/xiph/speexdsp) project to test. If we decide to consume only 40 samples in input to have less latency, then the resampler of SpeexDSP will produce 37,37,37 and 36 samples for the first 4 executions.
+Several options can be used in the Python to control the schedule generation. Some options are used by the scheduling algorithm and other options are used by the code generators or graphviz generator:

-And (40+40+40+40)/(37+37+37+36) = 160 / 147.
+### Options for the graph

-So the flow of data on the output is not static but it is periodic.
+Those options needs to be used on the graph object created with `Graph()`.

-This is now supported in the CMSIS-DSP compute graph and on each IO one can define a period. For this example, it could be:
+For instance :

 ```python
-b=Sampler("sampler",floatType,40,[37,37,37,36])
-```
-
-Note that in the C++ class, the template parameters giving the number of samples produced or consumed on an IO cannot be used any more in this case. The value is still generated but now represent the maximum on a period.
-
-And, in the run function you need to pass the number of sample read or written to the read and write buffer functions:
-
-```c
-this->getWriteBuffer(nbOfSamplesForCurrentExecution)
+g = Graph()
+g.defaultFIFOClass = "FIFO"
 ```

-For synchronous node, nothing is changed and they are coded as in the previous versions.
-
-The drawback of cyclo static scheduling is that the schedule length is increased. If we take the first example with a source producing 5 samples and a node consuming 7 samples and if the source is replaced by another source producing [5,5] then it is not equivalent. In the first case we can have only one execution of the source. In the second case, the scheduling will always contain an even number of executions of the sources. So the schedule length will be bigger. But memory usage will be the same (FIFOs of same size).
-
-Since schedule tend to be bigger with cyclo static scheduling, a new code generation mode has been introduced and is enabled by default : now instead of having a sequence of function calls, the schedule is coded by an array of number and there is a switch / case to select the function to be called.
-
-## Dynamic Data Flow
-
-Versions of the compute graph corresponding to CMSIS-DSP Version >= 1.14.4 and Python wrapper version >= 1.10.0 are supporting  a new dynamic / asynchronous mode.
-
- With a dynamic flow, the flow of data is potentially changing at each execution. The IOs can generate or consume a different amount of data at each execution of their node (including no data).
-
-This can be useful for sample oriented use cases where not all samples are available but a processing must nevertheless take place each time a subset of samples is available (samples could come from sensors).
-
-With a dynamic flow and scheduling, there is no more any way to ensure that there won't be FIFO underflow of overflow due to scheduling. As consequence, the nodes must be able to check for this problem and decide what to do.
-
-* A sink may decide to generate fake data in case of FIFO underflow
-* A source may decide to skip some data in case of FIFO overflow
-* Another node may decide to do nothing and skip the execution
-* Another node may decide to raise an error.
-
-With dynamic scheduling, a node must implement the function `prepareForRunning` and decide what to do.
-
-3 error / status codes are reserved for this. They are defined in the header `cg_status.h`. This header is not included by default, but if you define you own error codes, they should be coherent with `cg_status` and use the same values for the 3 status / error codes which are used in dynamic mode:
-
-* `CG_SUCCESS`  = 0 : Node can execute
-* `CG_SKIP_EXECUTION` = -5 : Node will skip the execution
-* `CG_BUFFER_ERROR` = -6 : Unrecoverable error due to FIFO underflow / overflow (only raised in pure function like CMSIS-DSP ones called directly)
-
-Any other returned value will stop the execution.
-
-The dynamic mode (also named asynchronous), is enabled with option : `asynchronous`
-
-The system will still compute a scheduling and FIFO sizes as if the flow was static. We can see the static flow as an average of the dynamic flow. In dynamic mode, the FIFOs may need to be bigger than the ones computed in static mode.  The static estimation is giving a first idea of what the size of the FIFOs should be. The size can be increased by specifying a percent increase with option `FIFOIncrease`.
-
-For pure compute functions (like CMSIS-DSP ones), which are not packaged into a C++ class, there is no way to customize the decision logic in case of a problem with FIFO. There is a global option : `asyncDefaultSkip`. 
-
-When `true`, a pure function that cannot run will just skip the execution. With `false`, the execution will stop. For any other decision algorithm, the pure function needs to be packaged in a C++ class.
-
-`Duplicate` nodes are skipping the execution in case of problems with FIFOs. If it is not the wanted behavior, you can either:
-
-* Replace the Duplicate class by a custom one by changing the class name with option `duplicateNodeClassName` on the graph.
-* Don't use the automatic duplication feature and introduce your duplicate nodes in the compute graph
-
-When you don't want to generate or consume data in a node, just don't call the functions `getReadBuffer` or `getWriteBuffer` for your IOs.
-
-## Options
-
-Several options and be used to control the schedule generation. Some options are used by the scheduling algorithm and other options are used by the code generator:
-
-### Options for the graph
-
 #### defaultFIFOClass (default = "FIFO")

-Class used for FIFO by default. Can also be customized for each connection (`connect` of `connectWithDelay` call)
+Class used for FIFO by default. Can also be customized for each connection (`connect` of `connectWithDelay` call) with something like:
+
+`g.connect(src.o,b.i,fifoClass="FIFOClassNameForThisConnection")`

 #### duplicateNodeClassName(default="Duplicate")

@ -243,6 +169,16 @@ Prefix used to generate the duplicate node classes like `Duplicate2`, `Duplicate

 ### Options for the scheduling

+Those options needs to be used on a configuration objects passed as argument of the scheduling function. For instance:
+
+```python
+conf = Configuration()
+conf.debugLimit = 10
+sched = g.computeSchedule(config = conf)
+```
+
+Note that the configuration object also contain options for the code generators.
+
 #### memoryOptimization (default = False)

 When the amount of data written to a FIFO and read from the FIFO is the same, the FIFO is just an array. In this case, depending on the scheduling, the memory used by different arrays may be reused if those arrays are not needed at the same time.
@ -253,7 +189,7 @@ This option is enabling an analysis to optimize the memory usage by merging some

 Try to prioritize the scheduling of the sinks to minimize the latency between sources and sinks.

-When  this option is enable, the tool may not be able to find a schedule in all cases. If it can't find a schedule, it will raise a `DeadLock` exception.
+When  this option is enabled, the tool may not be able to find a schedule in all cases. If it can't find a schedule, it will raise a `DeadLock` exception.

 #### displayFIFOSizes (default = False)

@ -273,7 +209,7 @@ When `debugLimit` is > 0, the number of iterations of the scheduling is limited

 When true, generate some code to dump the FIFO content at runtime. Only useful for debug.

-In C code generation, it is only available when using the mode `codeArray == False`.
+In C++ code generation, it is only available when using the mode `codeArray == False`.

 When this mode is enabled, the first line of the scheduler file is :

@ -287,25 +223,23 @@ Name of the scheduler function used in the generated code.

 #### prefix (default = "")

-Prefix to add before the FIFO buffer definition. Those buffers are not static and are global. If you want to use several schedulers in your code, the buffer names used by each should be different.
+Prefix to add before the FIFO buffer definitions. Those buffers are not static and are global. If you want to use several schedulers in your code, the buffer names used by each should be different.

 Another possibility would be to make the buffer static by redefining the macro `CG_BEFORE_BUFFER`

-
-
 #### Options for C Code Generation only

 ##### cOptionalArgs (default = "")

-Optional arguments to pass to the C version of the scheduler function
+Optional arguments to pass to the C API of the scheduler function

 ##### codeArray (default = True)

-When true, the scheduling is defined as an array. Otherwise, the list of function calls is generated.
+When true, the scheduling is defined as an array. Otherwise, a list of function calls is generated.

-The list of function call may be easier to read but if the schedule is long, it is not good for code size. In that case, it is better to encode the schedule as an array rather than a list of functions.
+A list of function call may be easier to read but if the schedule is long, it is not good for code size. In that case, it is better to encode the schedule as an array rather than a list of functions.

-When `codeArray` is True, the option `switchCase` is taken into account.
+When `codeArray` is True, the option `switchCase`can also be used.

 ##### switchCase (default = True)

@ -315,7 +249,11 @@ When the schedule is encoded as an array, it can either be an array of function

 ##### eventRecorder (default = False)

-Enable the generation of `CMSIS EventRecorder` intrumentation in the code.
+Enable the generation of `CMSIS EventRecorder` intrumentation in the code. The CMSIS-DSP Pack is providing definition of 3 events:
+
+* Schedule iteration
+* Node execution
+* Error

 ##### customCName (default = "custom.h")

@ -357,9 +295,11 @@ This implies `codeArray` and `switchCase`. This disables `memoryOptimizations`.

 Synchronous FIFOs that are just buffers will be considered as FIFOs in asynchronous mode.

+More info are available in the documentation for [this mode](Dynamic.md).
+
 ##### FIFOIncrease (default 0)

-In case of asynchronous scheduling, the FIFOs may need to be bigger than what is computed assuming a synchronous scheduling. This option is used to increase the FIFO size. It represents a percent increase.
+In case of dynamic / asynchronous scheduling, the FIFOs may need to be bigger than what is computed assuming a static / synchronous scheduling. This option is used to increase the FIFO size. It represents a percent increase.

 For instance, a value of 10 means the FIFO will have their size updated from `oldSize` to `1.1 * oldSize` which is ` (1 + 10%)* oldSize`

@ -369,8 +309,6 @@ Behavior of a pure function (like CMSIS-DSP) in asynchronous mode. When `True`,

 If another error recovery is needed, the function must be packaged into a C++ class to implement a `prepareForRun` function.

-
-
 #### Options for Python code generation only

 ##### pyOptionalArgs (default = "")
@ -441,33 +379,19 @@ It will generate the C++ files for the schedule and a pdf representation of the

 Note that the Python code is relying on the CMSIS-DSP PythonWrapper which is now also containing the Python scripts for the Synchronous Data Flow.

-To build the C examples:
-
-* CMSIS-DSP must be built, 
-* the .cpp file contained in the example must be built
-* the include folder `cg/static/src` must be added
-
 For `example3` which is using an input file, `cmake` should have copied the input test pattern `input_example3.txt` inside the build folder. The output file will also be generated in the build folder.

-`example4` is like `example3` but in pure Python and using the CMSIS-DSP Python wrapper (which must already be installed before trying the example). `example4` is not built by the cmake. You'll need to go to the `example4` folder and type:
+`example4` is like `example3` but in pure Python and using the CMSIS-DSP Python wrapper (which must already be installed before trying the example). To run a Python example, you need to go into an example folder and type:

 ```bash
-python graph.py 
 python main.py
 ```

-The first line is generating the schedule in Python. The second line is executing the schedule.
-
 `example7` is communicating with `OpenModelica`. You need to install the VHTModelica blocks from the [VHT-SystemModeling](https://github.com/ARM-software/VHT-SystemModeling) project on our GitHub

 ## Limitations

-It is a first version and there are lots of limitations and probably bugs:
-
- The code generation is using [Jinja](https://jinja.palletsprojects.com/en/3.0.x/) template in `cg/static/templates`. They must be cleaned to be more readable. You can modify the templates according to your needs ;
 - CMSIS-DSP integration must be improved to make it easier
- Some optimizations are missing 
- Some checks are missing : for instance you can connect several nodes to the same io port. And io port must be connected to only one other io port. It is not checked by the script.
 - The code is requiring a lot more comments and cleaning
 - A C version of the code generator is missing
 - The code generation could provide more flexibility for memory allocation with a choice between:
@ -487,13 +411,13 @@ Here is a list of the nodes supported by default. More can be easily added:
  - The name must not contain the prefix `arm` nor the the type suffix
  - For instance, use `Dsp("mult",CType(F32),NBSAMPLES)` to use `arm_mult_f32`
  - Other CMSIS-DSP function (with an instance variable) are requiring the creation of a Node if it is not already provided
- CFFT / ICFFT : Use of CMSIS-DSP CFFT. Currently only F32 and Q15 
+- CFFT / ICFFT : Use of CMSIS-DSP CFFT. Currently only F32, F16 and Q15 
 - Zip / Unzip : To zip / unzip streams 
 - ToComplex : Map a real stream onto a complex stream
 - ToReal : Extract real part of a complex stream
- FileSource and FileSink : Read/write float to/from a file
+- FileSource and FileSink : Read/write float to/from a file (Host only)
 - NullSink : Do nothing. Useful for debug 
- StereoToMonoQ15 : Interleaved stereo converted to mono with scaling to avoid saturation of the addition
+- InterleavedStereoToMono : Interleaved stereo converted to mono with scaling to avoid saturation of the addition
 - Python only nodes:
  - WavSink and WavSource to use wav files for testing
  - VHTSDF : To communicate with OpenModelica using VHTModelica blocks
--- a/ComputeGraph/documentation/async_topological1.png
+++ b/ComputeGraph/documentation/async_topological1.png
--- a/ComputeGraph/documentation/async_topological2.png
+++ b/ComputeGraph/documentation/async_topological2.png
--- a/ComputeGraph/documentation/async_topological3.png
+++ b/ComputeGraph/documentation/async_topological3.png
--- a/ComputeGraph/documentation/example1.md
+++ b/ComputeGraph/documentation/example1.md
@ -179,7 +179,6 @@ There are other fields for the configuration:
 - `cOptionalArgs` and pyOptionalArgs for passing additional arguments to the scheduling function
 - `prefix` to prefix the same of the global buffers
 - `memoryOptimization` : Experimental. It is attempting to reuse buffer memory and share it between several FIFOs 
- `pathToSDFModule` : Path to the Python SDF module so that the generated Python code can find it
 - `codeArray` : Experimental. When a schedule is very long, representing it as a sequence of function calls is not good for the code size of the generated solution. When this option is enabled, the schedule is described with an array. It implies that the pure function calls cannot be inlined any more and are replaced by new nodes which are automatically generated.
 - `eventRecorder` : Enable the support for the CMSIS Event Recorder.

@ -254,7 +253,17 @@ class Sink: public GenericSink<IN, inputSize>
 public:
    Sink(FIFOBase<IN> &src):GenericSink<IN,inputSize>(src){};

-    int run()
+    int prepareForRunning() override
+    {
+        if (this->willUnderflow())
+        {
+           return(CG_SKIP_EXECUTION_ID_CODE); // Skip execution
+        }
+
+        return(0);
+    };
+    
+    int run() override
    {
        IN *b=this->getReadBuffer();
        printf("Sink\n");
@ -272,11 +281,13 @@ public:

 The `Sink` is inheriting from the `GenericSink`. In the constructor we pass the fifos : input fifos first (output fifos are always following the input fifos when they are used. For a sink, we have no output fifos).

-In the template parameters ,we pass the type/length for each io : input first then followed by outputs (when there are some outputs).
+In the template parameters , we pass the type/length for each io : input first then followed by outputs (when there are some outputs).

-The node must have a run function which is implementing the processing.
+The node must have a `run` function which is implementing the processing.

-Here we just dump to stdout the content of the buffer. The amount of data read by `getReadBuffer` is defined in the `GenericSink` and is coming from the template parameter.
+The `prepareForRunning` function is used only in dynamic / asynchronous mode. But it must be defined (even if not used) in static / synchronous mode or the code won#t build.
+
+Here the sink is just dumping to stdout the content of the buffer. The amount of data read by `getReadBuffer` is defined in the `GenericSink` and is coming from the template parameter.

 The `Source` definition is very similar:

@ -287,7 +298,18 @@ class Source: GenericSource<OUT,outputSize>
 public:
    Source(FIFOBase<OUT> &dst):GenericSource<OUT,outputSize>(dst),mCounter(0){};

-    int run(){
+    int prepareForRunning() override
+    {
+        if (this->willOverflow())
+        {
+           return(CG_SKIP_EXECUTION_ID_CODE); // Skip execution
+        }
+
+        return(0);
+    };
+    
+    int run()  override
+    {
        OUT *b=this->getWriteBuffer();

        printf("Source\n");
@ -318,7 +340,20 @@ class ProcessingNode: public GenericNode<IN,inputSize,OUT,outputSize>
 public:
    ProcessingNode(FIFOBase<IN> &src,FIFOBase<OUT> &dst,int,const char*,int):GenericNode<IN,inputSize,OUT,outputSize>(src,dst){};

-    int run(){
+    int prepareForRunning() override
+    {
+        if (this->willOverflow() ||
+            this->willUnderflow()
+           )
+        {
+           return(CG_SKIP_EXECUTION_ID_CODE); // Skip execution
+        }
+
+        return(0);
+    };
+    
+    int run() override
+    {
        printf("ProcessingNode\n");
        IN *a=this->getReadBuffer();
        OUT *b=this->getWriteBuffer();
@ -331,7 +366,7 @@ public:

 The processing node is (very arbitrary) copying the value at index 3 to index 0 of the output.

-The processing node is taking 3 arguments after the FIFOs in the constructor.
+The processing node is taking 3 arguments after the FIFOs in the constructor because the Python script is defining 3 additional arguments for this node : `int`, `string` and another `int` but passed trough a variable in the scheduler.

 ### scheduler.cpp

@ -377,6 +412,8 @@ A value `<0` in `error` means there was an error during the execution.

 The returned valued is the number of schedules fully executed when the error occurred.

+The `someVariable` is defined in the Python script. The Python script can add as many arguments as needed with whatever type is needed.
+
 The scheduling function is starting with a definition of some variables used for debug and statistics:

 ```C++
@ -406,6 +443,8 @@ Sink<float32_t,5> sink(fifo1);
 Source<float32_t,5> source(fifo0);
 ```

+One can see that the processing nodes has 3 additional arguments in addition to the FIFOs. Those arguments are defined in the Python script. The third argument is `someVariable` and this variable must be in the scope. That's why the Python script is adding an argument `someVariable` to the scheduler API. So, one can pass information to nay node from the outside of the scheduler using those additional arguments.
+
 And finally, the function is entering the scheduling loop:

 ```C++
@ -417,7 +456,7 @@ And finally, the function is entering the scheduling loop:
       CHECKERROR;
 ```

-`CHECKERROR` is a macro defined in `Sched.h`. It is just testing if `cgStaticError< 0` and breaking out of the loop if it is the case.
+`CHECKERROR` is a macro defined in `Sched.h`. It is just testing if `cgStaticError< 0` and breaking out of the loop if it is the case. This can be redefined by the user.

 Since an application may want to use several SDF graphs, the name of the `sched` and `customInit` functions can be customized in the `configuration` object on the Python side:

@ -436,8 +475,9 @@ config.prefix="bufferPrefix"
 It looks complex because there is a lot of information but the process is always the same:

 1. You define new kind of nodes in the Python. They define the IOs, type and amount of data read/written on each IO
-2. You create instance of those new kind of Nodes
+2. You create Python instance of those new kind of Nodes
 3. You connect them in a graph and generate a schedule
 4. In you AppNodes.h, you implement the new kind of nodes with a C++ template:
-   1. It is generally defining the IO and the function to call when run
+   1. The template is generally defining the IO and the function to call when run
+   1. It should be minimal. The template is just a wrapper. Don't forget those nodes are created on the stack in the scheduler function. So they should not be too big. They should just be simple wrappers
 5. If you need more control on the initialization, it is possible to pass additional arguments to the nodes constructors and to the scheduler function.
--- a/ComputeGraph/documentation/example10.md
+++ b/ComputeGraph/documentation/example10.md
@ -0,0 +1,27 @@
+# Example 10
+
+This example is implementing a dynamic / asynchronous mode.
+
+It is enabled in `graph.py` with:
+
+`conf.asynchronous = True`
+
+The FIFO sizes are doubled with:
+
+`conf.FIFOIncrease = 100`
+
+The graph implemented in this example is:
+
+![graph10](graph10.png)
+
+There is a global iteration count corresponding to one execution of the schedule.
+
+The odd source is generating a value only when the count is odd.
+
+The even source is generating a value only when the count is even.
+
+The processing is adding its inputs. If no data is available on an input, 0 is used.
+
+In case of fifo overflow or underflow, any node will slip its execution.
+
+All nodes are generating or consuming one sample but the FIFOs have a size of 2 because of the 100% increase requested in the configuration settings.
--- a/ComputeGraph/documentation/example3.md
+++ b/ComputeGraph/documentation/example3.md
@ -77,7 +77,19 @@ public:
         status=arm_cfft_init_f32(&sfft,inputSize>>1);
    };
    
-    int run(){
+    int prepareForRunning() override
+    {
+        if (this->willOverflow() ||
+            this->willUnderflow()
+           )
+        {
+           return(CG_SKIP_EXECUTION_ID_CODE); // Skip execution
+        }
+
+        return(0);
+    };
+
+    int run() override {
        IN *a=this->getReadBuffer();
        OUT *b=this->getWriteBuffer();
        memcpy((void*)b,(void*)a,outputSize*sizeof(IN));
@ -109,6 +121,7 @@ It can be used by just doing in your `AppNodes.h` file :
 From Python side it would be:

 ```python
-from cmsisdsp.cg.nodes.CFFT import *
+from cmsisdsp.cg.scheduler import *
 ```

+The scheduler module is automatically including the default nodes.
--- a/ComputeGraph/documentation/example4.md
+++ b/ComputeGraph/documentation/example4.md
@ -30,12 +30,12 @@ This file is defining the new nodes which were used in `graph.py`. In `graph.py`
 In `appnodes.py` we including new kind of nodes for simulation purpose:

 ```python
-from cmsisdsp.cg.nodes.CFFT import *
+from cmsisdsp.cg.scheduler import *
 ```



-The CFFT is vey similar to the C++ version of example 3:
+The CFFT is very similar to the C++ version of example 3. But there is no `prepareForRunning`. Dynamic / asynchronous mode is not implemented for Python.

 ```python
 class CFFT(GenericNode):
--- a/ComputeGraph/documentation/graph10.png
+++ b/ComputeGraph/documentation/graph10.png
--- a/ComputeGraph/examples/example3/generated/scheduler.cpp
+++ b/ComputeGraph/examples/example3/generated/scheduler.cpp
@ -142,14 +142,14 @@ uint32_t scheduler(int *error)
    /*
    Create FIFOs objects
    */
-    FIFO<float32_t,FIFOSIZE0,0,1> fifo0(buf1);
-    FIFO<float32_t,FIFOSIZE1,0,1> fifo1(buf2);
-    FIFO<float32_t,FIFOSIZE2,0,1> fifo2(buf3);
-    FIFO<float32_t,FIFOSIZE3,0,1> fifo3(buf4);
-    FIFO<float32_t,FIFOSIZE4,0,1> fifo4(buf5);
-    FIFO<float32_t,FIFOSIZE5,0,1> fifo5(buf6);
-    FIFO<float32_t,FIFOSIZE6,0,1> fifo6(buf7);
-    FIFO<float32_t,FIFOSIZE7,0,1> fifo7(buf8);
+    FIFO<float32_t,FIFOSIZE0,0,0> fifo0(buf1);
+    FIFO<float32_t,FIFOSIZE1,1,0> fifo1(buf2);
+    FIFO<float32_t,FIFOSIZE2,1,0> fifo2(buf3);
+    FIFO<float32_t,FIFOSIZE3,1,0> fifo3(buf4);
+    FIFO<float32_t,FIFOSIZE4,1,0> fifo4(buf5);
+    FIFO<float32_t,FIFOSIZE5,1,0> fifo5(buf6);
+    FIFO<float32_t,FIFOSIZE6,1,0> fifo6(buf7);
+    FIFO<float32_t,FIFOSIZE7,0,0> fifo7(buf8);

    CG_BEFORE_NODE_INIT;
    /* 
@ -174,84 +174,6 @@ uint32_t scheduler(int *error)
        {
            CG_BEFORE_NODE_EXECUTION;

-            cgStaticError = 0;
-            switch(schedule[id])
-            {
-                case 0:
-                {
-                                        
-                    bool canRun=true;
-                    canRun &= !fifo1.willUnderflowWith(256);
-                    canRun &= !fifo2.willOverflowWith(256);
-
-                    if (!canRun)
-                    {
-                      cgStaticError = CG_SKIP_EXECUTION_ID_CODE;
-                    }
-                    else
-                    {
-                        cgStaticError = 0;
-                    }
-                }
-                break;
-
-                case 1:
-                {
-                    cgStaticError = audioOverlap.prepareForRunning();
-                }
-                break;
-
-                case 2:
-                {
-                    cgStaticError = audioWin.prepareForRunning();
-                }
-                break;
-
-                case 3:
-                {
-                    cgStaticError = cfft.prepareForRunning();
-                }
-                break;
-
-                case 4:
-                {
-                    cgStaticError = icfft.prepareForRunning();
-                }
-                break;
-
-                case 5:
-                {
-                    cgStaticError = sink.prepareForRunning();
-                }
-                break;
-
-                case 6:
-                {
-                    cgStaticError = src.prepareForRunning();
-                }
-                break;
-
-                case 7:
-                {
-                    cgStaticError = toCmplx.prepareForRunning();
-                }
-                break;
-
-                case 8:
-                {
-                    cgStaticError = toReal.prepareForRunning();
-                }
-                break;
-
-                default:
-                break;
-            }
-
-            if (cgStaticError == CG_SKIP_EXECUTION_ID_CODE)
-                continue;
-
-            CHECKERROR;
-
            switch(schedule[id])
            {
                case 0: