The --trace-threads option can now be used to perform tracing on a
thread separate from the main thread when using VCD tracing (with
--trace-threads 1). For FST tracing --trace-threads can be 1 or 2, and
--trace-fst --trace-threads 1 is the same a what --trace-fst-threads
used to be (which is now deprecated).
Performance numbers on SweRV EH1 CoreMark, clang 6.0.0, Intel i7-3770 @
3.40GHz, IO to ramdisk, with numactl set to schedule threads on different
physical cores. Relative speedup:
--trace -> --trace --trace-threads 1 +22%
--trace-fst -> --trace-fst --trace-threads 1 +38% (as --trace-fst-thread)
--trace-fst -> --trace-fst --trace-threads 2 +93%
Speed relative to --trace with no threaded tracing:
--trace 1.00 x
--trace --trace-threads 1 0.82 x
--trace-fst 1.79 x
--trace-fst --trace-threads 1 1.23 x
--trace-fst --trace-threads 2 0.87 x
This means FST tracing with 2 extra threads is now faster than single
threaded VCD tracing, and is on par with threaded VCD tracing. You do
pay for it in total compute though as --trace-fst --trace-threads 2 uses
about 240% CPU vs 150% for --trace-fst --trace-threads 1, and 155% for
--trace --trace threads 1. Still for interactive use it should be
helpful with large designs.
Includes `timescale, $printtimescale, $timeformat.
VL_TIME_MULTIPLIER, VL_TIME_PRECISION, VL_TIME_UNIT have been removed
and the time precision must now match the SystemC time precision.
To get closer behavior to older versions, use e.g. --timescale-override
"1ps/1ps".
* Improve tracing performance.
Various tactics used to improve performance of both VCD and FST tracing:
- Both: Change tracing functions to templates to take variable widths as
template parameters. For VCD, subsequently specialize these to the
values used by Verilator. This avoids redundant instructions and hard
to predict branches.
- Both: Check for value changes via direct pointer access into the
previous signal value buffer. This eliminates a lot of simple pointer
arithmetic instructions form the tracing code.
- Both: Verilator provides clean input, no need to mask out used bits.
- VCD: pre-compute identifier codes and use memory copy instead of
re-computing them every time a code is emitted. This saves a lot of
instructions and hard to predict branches. The added D-cache misses
are cheaper than the removed branches/instructions.
- VCD: re-write the routines emitting the changes to be more efficient.
- FST: Use previous signal value buffer the same way as the VCD tracing
code, and only call the FST API when a change is detected.
Performance as measured on SweRV EH1, with the pre-canned CoreMark
benchmark running from DCCM/ICCM, clang 6.0.0, Intel i7-3770 @ 3.40GHz,
and IO to ramdisk:
+--------------+---------------+----------------------+
| VCD | FST | FST separate thread |
| (--trace) | (--trace-fst) | (--trace-fst-thread) |
------------+-----------------------------------------------------+
Before | 30.2 s | 121.1 s | 69.8 s |
============+==============+===============+======================+
After | 24.7 s | 45.7 s | 32.4 s |
------------+--------------+---------------+----------------------+
Speedup | 22 % | 256 % | 215 % |
------------+--------------+---------------+----------------------+
Rel. to VCD | 1 x | 1.85 x | 1.31 x |
------------+--------------+---------------+----------------------+
In addition, FST trace size for the above reduced by 48%.
This looks like a bits/bytes bug. The affected m_codeInc member
determines how many 32-bit words to allocate in a buffer used to store
previous values of the signal, but this was off by a factor of 8, so
we used to use too much memory.
SweRV VCD tracing speed +6.5% (excluding IO, clang 6.0), due mainly to
reduced D cache misses.
When using the __ALL*.cpp based single compile mode (i.e.: without
VM_PARALLEL_BUILDS), the fast path tracing code used to be included in
__Allsup.cpp, which was compiled with OPT_SLOW, severely harming tracing
performance. We now have __ALLfast.cpp and __ALLslow.cpp instead of
__ALLcls.cpp and __ALLsup.cpp, so we can compile the fast support code
with OPT_FAST as well.
This patch eliminates a major piece of inefficiency in FST tracing
support, by using an array to lookup fstHandle values corresponding
to trace codes, instead of a tree based std::map. With this change, FST
tracing is now only about 3x slower than VCD tracing. We do require
more memory to store the symbol lookup table, but the size of that is
still small, for the speed benefit.