The --trace-threads option can now be used to perform tracing on a
thread separate from the main thread when using VCD tracing (with
--trace-threads 1). For FST tracing --trace-threads can be 1 or 2, and
--trace-fst --trace-threads 1 is the same a what --trace-fst-threads
used to be (which is now deprecated).
Performance numbers on SweRV EH1 CoreMark, clang 6.0.0, Intel i7-3770 @
3.40GHz, IO to ramdisk, with numactl set to schedule threads on different
physical cores. Relative speedup:
--trace -> --trace --trace-threads 1 +22%
--trace-fst -> --trace-fst --trace-threads 1 +38% (as --trace-fst-thread)
--trace-fst -> --trace-fst --trace-threads 2 +93%
Speed relative to --trace with no threaded tracing:
--trace 1.00 x
--trace --trace-threads 1 0.82 x
--trace-fst 1.79 x
--trace-fst --trace-threads 1 1.23 x
--trace-fst --trace-threads 2 0.87 x
This means FST tracing with 2 extra threads is now faster than single
threaded VCD tracing, and is on par with threaded VCD tracing. You do
pay for it in total compute though as --trace-fst --trace-threads 2 uses
about 240% CPU vs 150% for --trace-fst --trace-threads 1, and 155% for
--trace --trace threads 1. Still for interactive use it should be
helpful with large designs.
This patch de-duplicates common functionality between the VCD and FST
trace implementation. It also enables adding new trace formats more
easily and consistently.
No functional nor performance change intended.
The FST trace timescale used to be set in the constructor via
set_time_unit, but at that point we haven't normally opened the
file yet so it was just dropped. On top of that, we actually want
to use set_time_resolution... FST trace timescales now match the VCD.
If the first dump was not at time zero, then the FST trace used
to contain the initial values as if they were set at time zero. Now
they only appear at the time the first dump call is actually made,
and hence match the VCD trace exactly.
Includes `timescale, $printtimescale, $timeformat.
VL_TIME_MULTIPLIER, VL_TIME_PRECISION, VL_TIME_UNIT have been removed
and the time precision must now match the SystemC time precision.
To get closer behavior to older versions, use e.g. --timescale-override
"1ps/1ps".
* Improve tracing performance.
Various tactics used to improve performance of both VCD and FST tracing:
- Both: Change tracing functions to templates to take variable widths as
template parameters. For VCD, subsequently specialize these to the
values used by Verilator. This avoids redundant instructions and hard
to predict branches.
- Both: Check for value changes via direct pointer access into the
previous signal value buffer. This eliminates a lot of simple pointer
arithmetic instructions form the tracing code.
- Both: Verilator provides clean input, no need to mask out used bits.
- VCD: pre-compute identifier codes and use memory copy instead of
re-computing them every time a code is emitted. This saves a lot of
instructions and hard to predict branches. The added D-cache misses
are cheaper than the removed branches/instructions.
- VCD: re-write the routines emitting the changes to be more efficient.
- FST: Use previous signal value buffer the same way as the VCD tracing
code, and only call the FST API when a change is detected.
Performance as measured on SweRV EH1, with the pre-canned CoreMark
benchmark running from DCCM/ICCM, clang 6.0.0, Intel i7-3770 @ 3.40GHz,
and IO to ramdisk:
+--------------+---------------+----------------------+
| VCD | FST | FST separate thread |
| (--trace) | (--trace-fst) | (--trace-fst-thread) |
------------+-----------------------------------------------------+
Before | 30.2 s | 121.1 s | 69.8 s |
============+==============+===============+======================+
After | 24.7 s | 45.7 s | 32.4 s |
------------+--------------+---------------+----------------------+
Speedup | 22 % | 256 % | 215 % |
------------+--------------+---------------+----------------------+
Rel. to VCD | 1 x | 1.85 x | 1.31 x |
------------+--------------+---------------+----------------------+
In addition, FST trace size for the above reduced by 48%.
This looks like a bits/bytes bug. The affected m_codeInc member
determines how many 32-bit words to allocate in a buffer used to store
previous values of the signal, but this was off by a factor of 8, so
we used to use too much memory.
SweRV VCD tracing speed +6.5% (excluding IO, clang 6.0), due mainly to
reduced D cache misses.