This provides minor simulation performance benefit, but can provide
large C++ compilation time improvement, notably with Clang (4x).
This patch implements #2366 .
This adds the flag --generate-waivefile <filename>. This will generate
a verilator config file with the proper lint_off statemens to turn off
warnings emitted during this particular run.
This feature can be used to start with using Verilator as linter and
systematically capture all known lint warning for further
elimination. It hopefully helps people turning of -Wno-fatal or
-Wno-lint and gradually improve their code base.
Signed-off-by: Stefan Wallentowitz <stefan.wallentowitz@hm.edu>
--output-split is now on by default with value 20000.
--output-split-cfuncs and --output-split-ctrace now defaults to the
value of --output-split unless explicitly specified.
Instead of __ALLfast.cpp and __ALLslow.cpp, we now create only a single
__ALL.cpp and compile it with OPT_FAST, this speeds up small builds
where the C compiler does not dominate. A separate patch will follow
turning VM_PARALLEL_BUILDS on by default at a certain size.
Given this change to the build there is now no point in emitting both
fast and slow routines into the same .cpp file when --output-split is
not set as they will be just included in the same __ALL.cpp file. To
keep things simpler and the output easier to comprehend, V3EmitC has
also been changed to always emit the fast and slow files separately.
Also change verilated.mk to apply OPT_SLOW to all slow files, not just
ones called *__Slow.cpp. This change in particular ensures __Syms.cpp
is build as slow.
Part of #2360.
- Improve flag pruning heuristic
- Set all trace activity flags in slow code. This in turns enables us
to remove checking the slow flag on the fast path.
Firstly, we always use a byte array for fine grained activity flags
instead of a bit vector (we used to use a byte array only if we had
parallel mtasks). The byte vector can be set more cheaply in eval,
closing about 1/3 of the gap in performance between compiling with
or without --trace on SweRV EH1. The speed of tracing itself is not
measurably different.
Secondly, we prune the activity tracking such that if a set of activity
flag combinations only guard a small number of signals, we will turn
those signals into awayls traced signals. This avoids code which
sometimes tests dozens of activity flags just to subsequently check one
signal and dump it if it's value changed. We can just check the signal
state straight instead, and not bother with the flags. This removes
about 30% of activity flags in SweRV EH1, and makes both single threaded
VCD and FST tracing 8-9% faster.
The main goal of this patch is to enable splitting the full and
incremental tracing functions into multiple functions, which can then be
run in parallel at a later stage. It also simplifies further
experimentation as all of the interesting trace code construction now
happens in V3Trace. No functional change is intended by this patch, but
there are some implementation changes in the generated code.
Highlights:
- Pass symbol table directly to trace callbacks for simplicity.
- A new traceRegister function is generated which adds each trace
function as an individual callback, which means we can have multiple
callbacks for each trace function type.
- A new traceCleanup function is generated which clears the activity
flags, as the trace callbacks might be implemented as multiple functions.
- Re-worked sub-function handling so there is no separate sub-function
for each trace activity class. Sub-functions are generate when required
by splitting.
- traceFull/traceChg are now created in V3Trace rather than V3TraceDecl,
this requires carrying the trace value tree in TraceDecl until it
reaches V3Trace where the TraceInc nodes are created (previously a
TraceInc was also created in V3TraceDecl which carries the value).
- Allow arbitrary number of open array dimensions, not just 3. Note
right now this only works with the array querying functions specified
in IEEE 1800-2017 H.12.2
- Issue error when passing dynamic array or queue as DPI open array
(currently unsupported)
- Also tweaked AstVar::vlArgTypeRecurse, which should now error or fail
for unsupported types.
We used to iterate the m_callMmap std::multimap by getting limit
iterators from equal_range, but we also modify the same map in the loop
which invalidates those limit iterators. Note this only caused actual
problems if the new AstCCall inserted via 'addCall' in the loop had a
memory address (which is used as the key) which fell into the range
returned by equal_range, so was pretty hard to trigger.
- Issue an error when --build is used together with --make
- When given --build, always use GNU Make to perform the build
- Update documentation (examples were good as they were)
- Remove the broken t_flag_build_cmake test
Fixes#2280
Use SIMD intrinsics to render VCD traces.
I have measured 10-40% single threaded performance increase with VCD
tracing on SweRV EH1 and lowRISC Ibex using SSE2 intrinsics to render
the trace. Also helps a tiny bit with FST, but now almost all of the FST
overhead is in the FST library.
I have reworked the tracing routines to use more precisely sized
arguments. The nice thing about this is that the performance without the
intrinsics is pretty much the same as it was before, as we do at most 2x
as much work as necessary, but in exchange there are no data dependent
branches at all.
Convert trace buffer to 32-bit entries, rather than a union containing a
pointer type. Also tweaked trace entry layouts for a bit more
performance. This gains another 10% on SweRV EH1 CoreMark.
- Change templated trace routines to branch table.
Removed templating from trace chgBus and fullBus and replaced them with
a branch table like the other there is a very small (< 1%) penalty for
this on SwerRV EH1 CoreMark, but this is less than the variability of
disk IO so it's worth it to keep the code simpler and smaller.
- Prefetch VCD suffix buffer at the top of emit*
- Increase ILP in VCD emit* routines
- Use a 64-bit unaligned store to emit the VCD suffix (on x86 only)
The performance difference with these is very small, but the changes
hopefully make this code more performance-portable across various
micro-architectures.
We used to include a .cpp file on the link line for the shared library,
which was ignored, but generated a .d file for the .so which contained
the header files required by the .cpp file. This then caused a rebuild
where we included the .d in verilated.mk to included in the .h headers
among the prerequisites of the .so, yielding a clang error about treating
.h files as c++-header rather than c-header... Long story short, we don't
do that anymore. This used t cause t_a4_examples to fail on occasion.
Note there is no need for a separate compilation rule for the
<--protect-lib>.cpp, as it will jsut pick up the standard OPT_FAST rule.
The --trace-threads option can now be used to perform tracing on a
thread separate from the main thread when using VCD tracing (with
--trace-threads 1). For FST tracing --trace-threads can be 1 or 2, and
--trace-fst --trace-threads 1 is the same a what --trace-fst-threads
used to be (which is now deprecated).
Performance numbers on SweRV EH1 CoreMark, clang 6.0.0, Intel i7-3770 @
3.40GHz, IO to ramdisk, with numactl set to schedule threads on different
physical cores. Relative speedup:
--trace -> --trace --trace-threads 1 +22%
--trace-fst -> --trace-fst --trace-threads 1 +38% (as --trace-fst-thread)
--trace-fst -> --trace-fst --trace-threads 2 +93%
Speed relative to --trace with no threaded tracing:
--trace 1.00 x
--trace --trace-threads 1 0.82 x
--trace-fst 1.79 x
--trace-fst --trace-threads 1 1.23 x
--trace-fst --trace-threads 2 0.87 x
This means FST tracing with 2 extra threads is now faster than single
threaded VCD tracing, and is on par with threaded VCD tracing. You do
pay for it in total compute though as --trace-fst --trace-threads 2 uses
about 240% CPU vs 150% for --trace-fst --trace-threads 1, and 155% for
--trace --trace threads 1. Still for interactive use it should be
helpful with large designs.
Includes `timescale, $printtimescale, $timeformat.
VL_TIME_MULTIPLIER, VL_TIME_PRECISION, VL_TIME_UNIT have been removed
and the time precision must now match the SystemC time precision.
To get closer behavior to older versions, use e.g. --timescale-override
"1ps/1ps".
* Improve tracing performance.
Various tactics used to improve performance of both VCD and FST tracing:
- Both: Change tracing functions to templates to take variable widths as
template parameters. For VCD, subsequently specialize these to the
values used by Verilator. This avoids redundant instructions and hard
to predict branches.
- Both: Check for value changes via direct pointer access into the
previous signal value buffer. This eliminates a lot of simple pointer
arithmetic instructions form the tracing code.
- Both: Verilator provides clean input, no need to mask out used bits.
- VCD: pre-compute identifier codes and use memory copy instead of
re-computing them every time a code is emitted. This saves a lot of
instructions and hard to predict branches. The added D-cache misses
are cheaper than the removed branches/instructions.
- VCD: re-write the routines emitting the changes to be more efficient.
- FST: Use previous signal value buffer the same way as the VCD tracing
code, and only call the FST API when a change is detected.
Performance as measured on SweRV EH1, with the pre-canned CoreMark
benchmark running from DCCM/ICCM, clang 6.0.0, Intel i7-3770 @ 3.40GHz,
and IO to ramdisk:
+--------------+---------------+----------------------+
| VCD | FST | FST separate thread |
| (--trace) | (--trace-fst) | (--trace-fst-thread) |
------------+-----------------------------------------------------+
Before | 30.2 s | 121.1 s | 69.8 s |
============+==============+===============+======================+
After | 24.7 s | 45.7 s | 32.4 s |
------------+--------------+---------------+----------------------+
Speedup | 22 % | 256 % | 215 % |
------------+--------------+---------------+----------------------+
Rel. to VCD | 1 x | 1.85 x | 1.31 x |
------------+--------------+---------------+----------------------+
In addition, FST trace size for the above reduced by 48%.
This looks like a bits/bytes bug. The affected m_codeInc member
determines how many 32-bit words to allocate in a buffer used to store
previous values of the signal, but this was off by a factor of 8, so
we used to use too much memory.
SweRV VCD tracing speed +6.5% (excluding IO, clang 6.0), due mainly to
reduced D cache misses.
* v3Os: include <windows.h> instead of <winnt.h>
The windows.h header file should be included prior to any other headers,
in order to ensure all definitions are available. By only including
some headers, such as winnt.h, many "undefined symbol" messages are
generated.
Include "windows.h" to fix the build on msys2 under mingw64.
Signed-off-by: Sean Cross <sean@xobs.io>
* configure: check for bcrypt and psapi on windows
These two libraries must be linked in order to have access to
BCryptGenRandom and GetProcessMemoryInfo respectively.
Signed-off-by: Sean Cross <sean@xobs.io>
* Add +verilator+noassert flag
This allows to disable the assert check per simulation argument.
* Add AssertOn check for assert
Insert the check AssertOn to allow disabling of asserts.
Asserts can be disabled by not using the `--assert` flag or by calling
`AssertOn(false)`, or passing the "+verilator+noassert" runtime flag.
Add tests for this behavior.
Bad tests check that the assert still causes a stop.
Non bad tests check that asserts are properly disabled and cause no stop
of the simulation.
Fixes#2162.
Signed-off-by: Tobias Wölfel <tobias.woelfel@mailbox.org>
* Correct file location
Signed-off-by: Tobias Wölfel <tobias.woelfel@mailbox.org>
* Add description for single test execution
Without this description it is not obvious how to run a single test from
the regression test suite.
Signed-off-by: Tobias Wölfel <tobias.woelfel@mailbox.org>
The build is now by default configured to link performance critical
libraries (libgcc, libstdc++, libtcmalloc) statically. This improves
Verilation speed by between 4.5-7% based on my measurements as it
eliminates approx 20% of the mispredicted branches from the execution.
With partial static linking, the size of the .text section in
verilator_bin is increased by about 14%, and the binary is itself only
about 800KB bigger on disk, so hopefully this is not a big issue in
exchange for the faster compilation speed. A configure option
"--disable-partial-static" is provided to restore the old behaviour of
linking everything dynamically.
Note: This patch also changes to use libtcmalloc_minimal, which is all
we really need and itself has fewer dependencies.
Replace the virtual type() method on AstNode with a non-virtual, inlined
accessor to a const member variable m_type. This means that in order to be
able to use this for type testing, it needs to be initialized based on the
final type of the node. This is achieved by passing the relevant AstType
value back through the constructor call chain. Most of the boilerplate
involved is auto generated by first feeding V3AstNodes.h through astgen to
get V3AstNodes__gen.h, which is then included in V3Ast.h. No client code
needs to be aware and there is no functional change intended.
Eliminating the virtual function call to fetch the node type identifier
results in measured compilation speed improvement of 5-10% as it
eliminates up to 20% of all mispredicted branches from the execution.