forked from github/verilator
500 lines
19 KiB
ReStructuredText
500 lines
19 KiB
ReStructuredText
.. Copyright 2003-2022 by Wilson Snyder.
|
|
.. SPDX-License-Identifier: LGPL-3.0-only OR Artistic-2.0
|
|
|
|
.. _Simulating:
|
|
|
|
************************************
|
|
Simulating (Verilated-Model Runtime)
|
|
************************************
|
|
|
|
This section describes items related to simulating, that using a Verilated
|
|
model's executable. For the runtime arguments to a simulated model, see
|
|
:ref:`Simulation Runtime Arguments`.
|
|
|
|
|
|
.. _Benchmarking & Optimization:
|
|
|
|
Benchmarking & Optimization
|
|
===========================
|
|
|
|
For best performance, run Verilator with the :vlopt:`-O3`
|
|
:vlopt:`--x-assign fast <--x-assign>` :vlopt:`--x-initial fast
|
|
<--x-initial>` :vlopt:`--noassert <--assert>` options. The :vlopt:`-O3`
|
|
option will require longer time to run Verilator, and :vlopt:`--x-assign
|
|
fast <--x-assign>` :vlopt:`--x-initial fast <--x-assign>` may increase the
|
|
risk of reset bugs in trade for performance; see the above documentation
|
|
for these options.
|
|
|
|
If using Verilated multithreaded, use ``numactl`` to ensure you are using
|
|
non-conflicting hardware resources. See :ref:`Multithreading`. Also
|
|
consider using profile-guided optimization, see :ref:`Thread PGO`.
|
|
|
|
Minor Verilog code changes can also give big wins. You should not have any
|
|
UNOPTFLAT warnings from Verilator. Fixing these warnings can result in
|
|
huge improvements; one user fixed their one UNOPTFLAT warning by making a
|
|
simple change to a clock latch used to gate clocks and gained a 60%
|
|
performance improvement.
|
|
|
|
Beyond that, the performance of a Verilated model depends mostly on your
|
|
C++ compiler and size of your CPU's caches. Experience shows that large
|
|
models are often limited by the size of the instruction cache, and as such
|
|
reducing code size if possible can be beneficial.
|
|
|
|
The supplied $VERILATOR_ROOT/include/verilated.mk file uses the OPT,
|
|
OPT_FAST, OPT_SLOW and OPT_GLOBAL variables to control optimization. You
|
|
can set these when compiling the output of Verilator with Make, for
|
|
example:
|
|
|
|
.. code-block:: bash
|
|
|
|
make OPT_FAST="-Os -march=native" -f Vour.mk Vour__ALL.a
|
|
|
|
OPT_FAST specifies optimization options for those parts of the model that
|
|
are on the fast path. This is mostly code that is executed every
|
|
cycle. OPT_SLOW applies to slow-path code, which executes rarely, often
|
|
only once at the beginning or end of simulation. Note that OPT_SLOW is
|
|
ignored if VM_PARALLEL_BUILDS is not 1, in which case all generated code
|
|
will be compiled in a single compilation unit using OPT_FAST. See also the
|
|
Verilator :vlopt:`--output-split` option. The OPT_GLOBAL variable applies
|
|
to common code in the runtime library used by Verilated models (shipped in
|
|
$VERILATOR_ROOT/include). Additional C++ files passed on the verilator
|
|
command line use OPT_FAST. The OPT variable applies to all compilation
|
|
units in addition to the specific "OPT" variables described above.
|
|
|
|
You can also use the :vlopt:`-CFLAGS` and/or :vlopt:`-LDFLAGS` options on
|
|
the verilator command line to pass arguments directly to the compiler or
|
|
linker.
|
|
|
|
The default values of the "OPT" variables are chosen to yield good
|
|
simulation speed with reasonable C++ compilation times. To this end,
|
|
OPT_FAST is set to "-Os" by default. Higher optimization such as "-O2" or
|
|
"-O3" may help (though often they provide only a very small performance
|
|
benefit), but compile times may be excessively large even with medium sized
|
|
designs. Compilation times can be improved at the expense of simulation
|
|
speed by reducing optimization, for example with OPT_FAST="-O0". Often good
|
|
simulation speed can be achieved with OPT_FAST="-O1 -fstrict-aliasing" but
|
|
with improved compilation times. Files controlled by OPT_SLOW have little
|
|
effect on performance and therefore OPT_SLOW is empty by default
|
|
(equivalent to "-O0") for improved compilation speed. In common use-cases
|
|
there should be little benefit in changing OPT_SLOW. OPT_GLOBAL is set to
|
|
"-Os" by default and there should rarely be a need to change it. As the
|
|
runtime library is small in comparison to a lot of Verilated models,
|
|
disabling optimization on the runtime library should not have a serious
|
|
effect on overall compilation time, but may have detrimental effect on
|
|
simulation speed, especially with tracing. In addition to the above, for
|
|
best results use OPT="-march=native", the latest Clang compiler (about 10%
|
|
faster than GCC), and link statically.
|
|
|
|
Generally the answer to which optimization level gives the best user
|
|
experience depends on the use case and some experimentation can pay
|
|
dividends. For a speedy debug cycle during development, especially on large
|
|
designs where C++ compilation speed can dominate, consider using lower
|
|
optimization to get to an executable faster. For throughput oriented use
|
|
cases, for example regressions, it is usually worth spending extra
|
|
compilation time to reduce total CPU time.
|
|
|
|
If you will be running many simulations on a single model, you can
|
|
investigate profile guided optimization. See :ref:`Compiler PGO`.
|
|
|
|
Modern compilers also support link-time optimization (LTO), which can help
|
|
especially if you link in DPI code. To enable LTO on GCC, pass "-flto" in
|
|
both compilation and link. Note LTO may cause excessive compile times on
|
|
large designs.
|
|
|
|
Unfortunately, using the optimizer with SystemC files can result in
|
|
compilation taking several minutes. (The SystemC libraries have many little
|
|
inlined functions that drive the compiler nuts.)
|
|
|
|
If you are using your own makefiles, you may want to compile the Verilated
|
|
code with ``--MAKEFLAGS -DVL_INLINE_OPT=inline``. This will inline
|
|
functions, however this requires that all cpp files be compiled in a single
|
|
compiler run.
|
|
|
|
You may uncover further tuning possibilities by profiling the Verilog code.
|
|
See :ref:`profiling`.
|
|
|
|
When done optimizing, please let the author know the results. We like to
|
|
keep tabs on how Verilator compares, and may be able to suggest additional
|
|
improvements.
|
|
|
|
|
|
.. _Coverage Analysis:
|
|
|
|
Coverage Analysis
|
|
=================
|
|
|
|
Verilator supports adding code to the Verilated model to support
|
|
SystemVerilog code coverage. With :vlopt:`--coverage`, Verilator enables
|
|
all forms of coverage:
|
|
|
|
* :ref:`User Coverage`
|
|
* :ref:`Line Coverage`
|
|
* :ref:`Toggle Coverage`
|
|
|
|
When a model with coverage is executed, it will create a coverage file for
|
|
collection and later analysis, see :ref:`Coverage Collection`.
|
|
|
|
|
|
.. _User Coverage:
|
|
|
|
Functional Coverage
|
|
-------------------
|
|
|
|
With :vlopt:`--coverage` or :vlopt:`--coverage-user`, Verilator will
|
|
translate functional coverage points which the user has inserted manually
|
|
into the SystemVerilog design, into the Verilated model.
|
|
|
|
Currently, all functional coverage points are specified using SystemVerilog
|
|
assertion syntax which must be separately enabled with :vlopt:`--assert`.
|
|
|
|
For example, the following SystemVerilog statement will add a coverage
|
|
point, under the coverage name "DefaultClock":
|
|
|
|
.. code-block:: sv
|
|
|
|
DefaultClock: cover property (@(posedge clk) cyc==3);
|
|
|
|
|
|
.. _Line Coverage:
|
|
|
|
Line Coverage
|
|
-------------
|
|
|
|
With :vlopt:`--coverage` or :vlopt:`--coverage-line`, Verilator will
|
|
automatically add coverage analysis at each code flow change point (e.g. at
|
|
branches). At each such branch a unique counter is incremented. At the
|
|
end of a test, the counters along with the filename and line number
|
|
corresponding to each counter are written into the coverage file.
|
|
|
|
Verilator automatically disables coverage of branches that have a $stop in
|
|
them, as it is assumed $stop branches contain an error check that should
|
|
not occur. A :option:`/*verilator&32;coverage_block_off*/` metacomment
|
|
will perform a similar function on any code in that block or below, or
|
|
:option:`/*verilator&32;coverage_off*/` and
|
|
:option:`/*verilator&32;coverage_on*/` will disable and enable coverage
|
|
respectively around a block of code.
|
|
|
|
Verilator may over-count combinatorial (non-clocked) blocks when those
|
|
blocks receive signals which have had the UNOPTFLAT warning disabled; for
|
|
most accurate results do not disable this warning when using coverage.
|
|
|
|
|
|
.. _Toggle Coverage:
|
|
|
|
Toggle Coverage
|
|
---------------
|
|
|
|
With :vlopt:`--coverage` or :vlopt:`--coverage-toggle`, Verilator will
|
|
automatically add toggle coverage analysis into the Verilated model.
|
|
|
|
Every bit of every signal in a module has a counter inserted. The counter
|
|
will increment on every edge change of the corresponding bit.
|
|
|
|
Signals that are part of tasks or begin/end blocks are considered local
|
|
variables and are not covered. Signals that begin with underscores (see
|
|
:vlopt:`--coverage-underscore`), are integers, or are very wide (>256 bits
|
|
total storage across all dimensions, see :vlopt:`--coverage-max-width`) are
|
|
also not covered.
|
|
|
|
Hierarchy is compressed, such that if a module is instantiated multiple
|
|
times, coverage will be summed for that bit across **all** instantiations
|
|
of that module with the same parameter set. A module instantiated with
|
|
different parameter values is considered a different module, and will get
|
|
counted separately.
|
|
|
|
Verilator makes a minimally-intelligent decision about what clock domain
|
|
the signal goes to, and only looks for edges in that clock domain. This
|
|
means that edges may be ignored if it is known that the edge could never be
|
|
seen by the receiving logic. This algorithm may improve in the future.
|
|
The net result is coverage may be lower than what would be seen by looking
|
|
at traces, but the coverage is a more accurate representation of the
|
|
quality of stimulus into the design.
|
|
|
|
There may be edges counted near time zero while the model stabilizes. It's
|
|
a good practice to zero all coverage just before releasing reset to prevent
|
|
counting such behavior.
|
|
|
|
A :option:`/*verilator&32;coverage_off*/`
|
|
:option:`/*verilator&32;coverage_on*/` metacomment pair can be used around
|
|
signals that do not need toggle analysis, such as RAMs and register files.
|
|
|
|
|
|
.. _Coverage Collection:
|
|
|
|
Coverage Collection
|
|
-------------------
|
|
|
|
When any coverage flag was used to Verilate, Verilator will add appropriate
|
|
coverage point insertions into the model and collect the coverage data.
|
|
|
|
To get the coverage data from the model, in the user wrapper code,
|
|
typically at the end once a test passes, call
|
|
:code:`Verilated::threadContextp()->coveragep()->write` with an argument of the filename for
|
|
the coverage data file to write coverage data to (typically
|
|
"logs/coverage.dat").
|
|
|
|
Run each of your tests in different directories, potentially in parallel.
|
|
Each test will create a :file:`logs/coverage.dat` file.
|
|
|
|
After running all of the tests, execute the :command:`verilator_coverage`
|
|
command, passing arguments pointing to the filenames of all of the
|
|
individual coverage files. :command:`verilator_coverage` will reads the
|
|
:file:`logs/coverage.dat` file(s), and create an annotated source code
|
|
listing showing code coverage details.
|
|
|
|
:command:`verilator_coverage` may also be used for test grading, that is
|
|
computing which tests are important to fully cover the design.
|
|
|
|
For an example, see the :file:`examples/make_tracing_c/logs` directory.
|
|
Grep for lines starting with '%' to see what lines Verilator believes need
|
|
more coverage.
|
|
|
|
Additional options of :command:`verilator_coverage` allow for merging of
|
|
coverage data files or other transformations.
|
|
|
|
Info files can be written by verilator_coverage for import to
|
|
:command:`lcov`. This enables use of :command:`genhtml` for HTML reports
|
|
and importing reports to sites such as `https://codecov.io
|
|
<https://codecov.io>`_.
|
|
|
|
|
|
.. _Profiling:
|
|
|
|
Code Profiling
|
|
==============
|
|
|
|
The Verilated model may be code-profiled using GCC or Clang's C++ profiling
|
|
mechanism. Verilator provides additional flags to help map the resulting
|
|
C++ profiling results back to the original Verilog code responsible for the
|
|
profiled C++ code functions.
|
|
|
|
To use profiling:
|
|
|
|
#. Use Verilator's :vlopt:`--prof-cfuncs`.
|
|
#. Build and run the simulation model.
|
|
#. The model will create gmon.out.
|
|
#. Run :command:`gprof` to see where in the C++ code the time is spent.
|
|
#. Run the gprof output through the :command:`verilator_profcfunc` program
|
|
and it will tell you what Verilog line numbers on which most of the time
|
|
is being spent.
|
|
|
|
|
|
.. _Execution Profiling:
|
|
|
|
Execution Profiling
|
|
===================
|
|
|
|
For performance optimization, it is useful to see statistics and visualize how
|
|
execution time is distributed in a verilated model.
|
|
|
|
With the :vlopt:`--prof-exec` option, Verilator will:
|
|
|
|
* Add code to the Verilated model to record execution flow.
|
|
|
|
* Add code to save profiling data in non-human-friendly form to the file
|
|
specified with :vlopt:`+verilator+prof+exec+file+\<filename\>`.
|
|
|
|
* In multi-threaded models, add code to record the start and end time of each
|
|
macro-task across a number of calls to eval. (What is a macro-task? See the
|
|
Verilator internals document (:file:`docs/internals.rst` in the
|
|
distribution.)
|
|
|
|
The :command:`verilator_gantt` program may then be run to transform the
|
|
saved profiling file into a nicer visual format and produce some related
|
|
statistics.
|
|
|
|
.. figure:: figures/fig_gantt_min.png
|
|
|
|
Example verilator_gantt output, as viewed with GTKWave.
|
|
|
|
The measured_parallelism shows the number of CPUs being used at a given moment.
|
|
|
|
The cpu_thread section shows which thread is executing on each of the physical CPUs.
|
|
|
|
The thread_mtask section shows which macro-task is running on a given thread.
|
|
|
|
For more information see :command:`verilator_gantt`.
|
|
|
|
|
|
.. _Profiling ccache efficiency:
|
|
|
|
Profiling ccache efficiency
|
|
===========================
|
|
|
|
The Verilator generated Makefile provides support for basic profiling of
|
|
ccache behavior during the build. This can be used to track down files that
|
|
might be unnecessarily rebuilt, though as of today even small code changes
|
|
will usually require rebuilding a large number of files. Improving ccache
|
|
efficiency during the edit/compile/test loop is an active area of
|
|
development.
|
|
|
|
To get a basic report of how well ccache is doing, add the `ccache-report`
|
|
target when invoking the generated Makefile:
|
|
|
|
.. code-block:: bash
|
|
|
|
make -C obj_dir -f Vout.mk Vout ccache-report
|
|
|
|
This will print a report based on all executions of ccache during this
|
|
invocation of Make. The report is also written to a file, in this example
|
|
`obj_dir/Vout__cache_report.txt`.
|
|
|
|
To use the `ccache-report` target, at least one other explicit build target
|
|
must be specified, and OBJCACHE must be set to 'ccache'.
|
|
|
|
This feature is currently experimental and might change in subsequent
|
|
releases.
|
|
|
|
.. _Save/Restore:
|
|
|
|
Save/Restore
|
|
============
|
|
|
|
The intermediate state of a Verilated model may be saved, so that it may
|
|
later be restored.
|
|
|
|
To enable this feature, use :vlopt:`--savable`. There are limitations in
|
|
what language features are supported along with :vlopt:`--savable`; if you
|
|
attempt to use an unsupported feature Verilator will throw an error.
|
|
|
|
To use save/restore, the user wrapper code must create a VerilatedSerialize
|
|
or VerilatedDeserialze object then calling the :code:`<<` or :code:`>>`
|
|
operators on the generated model and any other data the process needs
|
|
saved/restored. These functions are not thread safe, and are typically
|
|
called only by a main thread.
|
|
|
|
For example:
|
|
|
|
.. code-block:: C++
|
|
|
|
void save_model(const char* filenamep) {
|
|
VerilatedSave os;
|
|
os.open(filenamep);
|
|
os << main_time; // user code must save the timestamp, etc
|
|
os << *topp;
|
|
}
|
|
void restore_model(const char* filenamep) {
|
|
VerilatedRestore os;
|
|
os.open(filenamep);
|
|
os >> main_time;
|
|
os >> *topp;
|
|
}
|
|
|
|
|
|
Profile-Guided Optimization
|
|
===========================
|
|
|
|
Profile-guided optimization is the technique where profiling data is
|
|
collected by running your simulation executable, then this information is
|
|
used to guide the next Verilation or compilation.
|
|
|
|
There are two forms of profile-guided optimizations. Unfortunately for
|
|
best results they must each be performed from the highest level code to the
|
|
lowest, which means performing them separately and in this order:
|
|
|
|
* :ref:`Thread PGO`
|
|
* :ref:`Compiler PGO`
|
|
|
|
Other forms of PGO may be supported in the future, such as clock and reset
|
|
toggle rate PGO, branch prediction PGO, statement execution time PGO, or
|
|
others as they prove beneficial.
|
|
|
|
|
|
.. _Thread PGO:
|
|
|
|
Thread Profile-Guided Optimization
|
|
----------------------------------
|
|
|
|
Verilator supports profile-guided optimization (Verilation) of multi-threaded
|
|
models (Thread PGO) to improve performance.
|
|
|
|
When using multithreading, Verilator computes how long macro tasks take and
|
|
tries to balance those across threads. (What is a macro-task? See the
|
|
Verilator internals document (:file:`docs/internals.rst` in the
|
|
distribution.) If the estimations are incorrect, the threads will not be
|
|
balanced, leading to decreased performance. Thread PGO allows collecting
|
|
profiling data to replace the estimates and better optimize these
|
|
decisions.
|
|
|
|
To use Thread PGO, Verilate the model with the :vlopt:`--prof-pgo` option. This
|
|
will code to the verilated model to save profiling data for profile-guided
|
|
optimization.
|
|
|
|
Run the model executable. When the executable exits, it will create a
|
|
profile.vlt file.
|
|
|
|
Rerun Verilator, optionally omitting the :vlopt:`--prof-pgo` option,
|
|
and adding the profile.vlt generated earlier to the command line.
|
|
|
|
Note there is no Verilator equivalent to GCC's --fprofile-use. Verilator's
|
|
profile data file (profile.vlt) can be placed on the verilator command line
|
|
directly without any prefix.
|
|
|
|
If results from multiple simulations are to be used in generating the
|
|
optimization, multiple simulation's profile.vlt may be concatenated
|
|
externally, or each of the files may be fed as separate command line
|
|
options into Verilator. Verilator will sum the profile results, so a
|
|
longer running test will have proportionally more weight for optimization
|
|
than a shorter running test.
|
|
|
|
If you provide any profile feedback data to Verilator, and it cannot use
|
|
it, it will issue the :option:`PROFOUTOFDATE` warning that threads were
|
|
scheduled using estimated costs. This usually indicates that the profile
|
|
data was generated from different Verilog source code than Verilator is
|
|
currently running against. Therefore, repeat the data collection phase to
|
|
create new profiling data, then rerun Verilator with the same input source
|
|
files and that new profiling data.
|
|
|
|
|
|
.. _Compiler PGO:
|
|
|
|
Compiler Profile-Guided Optimization
|
|
------------------------------------
|
|
|
|
GCC and Clang support compiler profile-guided optimization (PGO). This
|
|
optimizes any C/C++ program including Verilated code. Using compiler PGO
|
|
typically yields improvements of 5-15% on both single-threaded and
|
|
multi-threaded models.
|
|
|
|
To use compiler PGO with GCC or Clang, please see the appropriate compiler
|
|
documentation. The process in GCC 10 was as follows:
|
|
|
|
1. Compile the Verilated model with the compiler's "-fprofile-generate"
|
|
flag:
|
|
|
|
.. code-block:: bash
|
|
|
|
verilator [whatever_flags] --make \
|
|
-CFLAGS -fprofile-generate -LDFLAGS -fprofile-generate
|
|
|
|
or, if calling make yourself, add -fprofile-generate appropriately to your
|
|
Makefile.
|
|
|
|
2. Run your simulation. This will create \*.gcda file(s) in the same
|
|
directory as the source files.
|
|
|
|
3. Recompile the model with -fprofile-use. The compiler will read the
|
|
\*.gcda file(s).
|
|
|
|
For GCC:
|
|
|
|
.. code-block:: bash
|
|
|
|
verilator [whatever_flags] --build \
|
|
-CFLAGS "-fprofile-use -fprofile-correction"
|
|
|
|
For Clang:
|
|
|
|
.. code-block:: bash
|
|
|
|
llvm-profdata merge -output default.profdata *.profraw
|
|
verilator [whatever_flags] --build \
|
|
-CFLAGS "-fprofile-use -fprofile-correction"
|
|
|
|
or, if calling make yourself, add these CFLAGS switches appropriately to
|
|
your Makefile.
|
|
|
|
Clang and GCC also support -fauto-profile which uses sample-based
|
|
feedback-directed optimization. See the appropriate compiler
|
|
documentation.
|