Small fixup patch so the 'ico' and 'act' scheduling sections could be
ordered as multi-threaded. However, we still only order these single
threaded at the moment (but switching them to multi-threaded now works).
Before this change, some forked processes were being inlined in
`V3Timing` because they contained no `CAwait`s. This only works under
the assumption that no `CAwait`s will be added there later, which is not
true, as a function called by a forked process could be turned into a
coroutine later. The call would be wrapped in a new `CAwait`, but the
process itself would have already been inlined at this point.
This commit moves the inlining to `transformForks` in `V3SchedTiming`,
which is called at a point when all `CAwait`s are already in place.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
The recent patch to defer substitutions on V3Gate crashes on circular
logic that has cycle length >= 3 with all inlineable signals (cycle
length 2 is detected correctly and is not inlined). Fix by stopping
recursion at the loop-back edge.
Fixes#3543
This is detritus from when V3TraceDecl used to run after V3Gate, today
V3TraceDecl runs before V3Gate and this value has no function at all.
No functional change intended.
dynamic_cast is not free. Replace obvious instances (where the result is
unconditionally dereferenced) with static_cast in contexts with
performance implications.
Replace std::set<SiblingMC> with V3Lists to keep track of SiblingMCs
associated with MTasks, use a std::set<LogicMTask*> for ensuring
uniqueness. This yields a bit more speed in PartContraction.
- Use modern C++
- Implement OrderLogicVertex->LogicMTask map with
OrderLogicVertex::userp(), insteas of std::unordered_map
- Simplify data structures
- Simplify code and assert properties
No functional change.
Refactor ProcessMoveBuildGraph utilizing the fact that OrderGraph is a
bipartite graph, also remove unnecessary unordered_map and distribute
variable domain map. No functional change.
Adds timing support to Verilator. It makes it possible to use delays,
event controls within processes (not just at the start), wait
statements, and forks.
Building a design with those constructs requires a compiler that
supports C++20 coroutines (GCC 10, Clang 5).
The basic idea is to have processes and tasks with delays/event controls
implemented as C++20 coroutines. This allows us to suspend and resume
them at any time.
There are five main runtime classes responsible for managing suspended
coroutines:
* `VlCoroutineHandle`, a wrapper over C++20's `std::coroutine_handle`
with move semantics and automatic cleanup.
* `VlDelayScheduler`, for coroutines suspended by delays. It resumes
them at a proper simulation time.
* `VlTriggerScheduler`, for coroutines suspended by event controls. It
resumes them if its corresponding trigger was set.
* `VlForkSync`, used for syncing `fork..join` and `fork..join_any`
blocks.
* `VlCoroutine`, the return type of all verilated coroutines. It allows
for suspending a stack of coroutines (normally, C++ coroutines are
stackless).
There is a new visitor in `V3Timing.cpp` which:
* scales delays according to the timescale,
* simplifies intra-assignment timing controls and net delays into
regular timing controls and assignments,
* simplifies wait statements into loops with event controls,
* marks processes and tasks with timing controls in them as
suspendable,
* creates delay, trigger scheduler, and fork sync variables,
* transforms timing controls and fork joins into C++ awaits
There are new functions in `V3SchedTiming.cpp` (used by `V3Sched.cpp`)
that integrate static scheduling with timing. This involves providing
external domains for variables, so that the necessary combinational
logic gets triggered after coroutine resumption, as well as statements
that need to be injected into the design eval function to perform this
resumption at the correct time.
There is also a function that transforms forked processes into separate
functions.
See the comments in `verilated_timing.h`, `verilated_timing.cpp`,
`V3Timing.cpp`, and `V3SchedTiming.cpp`, as well as the internals
documentation for more details.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Various optimizations to speed up MTasks coarsening (which is the long
pole in the multi-threaded scheduling of very large designs).
The biggest impact ones:
- Use efficient hand written Pairing Heaps for implementing priority
queues and the scoreboard, instead of the old SortByValueMap. This
helps us avoid having to sort a lot of merge candidates that we will
never actually consider and helps a lot in performance.
- Remove unnecessary associative containers and store data structures
(the heap nodes in particular) directly in the object they relate to.
This eliminates a huge amount of lookups and helps a lot in
performance.
- Distribute storage for SiblingMC instances into the LogicMTask
instances, and combine with the sibling maps. This again eliminates
hash table lookups and makes storage structures smaller.
- Remove some now bidirectional edge maps, keep only the forward map.
There are also some other smaller optimizations:
- Replaced more unnecessary dynamic_casts with static_casts
- Templated some functions/classes to reduce the number of static
branches in loops.
- Improves sorting of edges for sibling candidate creation
- Various micro-optimizations here and there
This speeds up MTask coarsening by 3.8x on a large design, which
translates to a 2.5x speedup of the ordering pass in multi-threaded
mode. (Combined with the earlier optimizations, ordering is now 3x
faster.)
Due to the elimination of a lot of the auxiliary data structures, and
ensuring a minimal size for the necessary ones, memory consumption of
the MTask coarsening is also reduced (measured up to 4.4x reduction
though the accuracy of this is low).
The algorithm is identical except for minor alterations of the order
some candidates are added or removed, this can cause perturbation in the
output due to tied scores being broken based on IDs.