Simulator Design Considerations =============================== Design History -------------- Work on the project started in September 2011. The design was originally inspired from the experience with fREEDA [freeda]_ and carrot [carrot]_ plus some ideas from other simulators and improvements that take advantage of the flexibility in Python. Formulation-Independent Circuit Representation ---------------------------------------------- The internal circuit representation and the device library attempt to be independent of the formulation used for simulation. The ``Circuit`` class has no methods to obtain nodal voltages or calculate the Jacobian of the circuit. This is delegated to other classes that handle particular circuit analysis approaches such as nodal, port-based, linear/nonlinear separation, *etc.* Element Interfaces ++++++++++++++++++ All elements but frequency-defined ones are modeled using the current-source approach [aplac2]_. The interface essentially describes a subcircuit composed of independent and voltage-controlled current/charge sources. If the nodal approach is used this has the advantage that an analysis can be implemented by just considering a few fundamental components. For example there is no need for MNAM stamps as the NAM can handle all elements. This approach is also very flexible. Both fREEDA-style state-variable and spice-style nonlinear models have been implemented. Frequency-defined elements are modeled by their Y matrix (or equivalent, see S-parameter-based transmission line model). The interface returns a 3-D matrix with the parameters at all frequencies. It is possible to conceive a more compact hybrid representation for some devices that combines the current-source approach with smaller Y matrices. However this may unnecessarily complicate the implementation of simpler devices. This interface is being reviewed and may change in the future. Internal Terminal Handling ++++++++++++++++++++++++++ Internal terminals are not tracked directly by the circuit. One of the advantages of this is that a device can process parameters independently of the containing circuit (a reference to the circuit is no longer needed in ``process_params()``). Another advantage is that the terminal name does not need to be unique. The ``Circuit`` class has now a function to retrieve all internal terminals, which (as explained above) are not present in the ``termDict`` dictionary. By default there is no local reference terminal. A local reference terminal can be created calling ``self.add_reference_term()``. Internal Terminal Indexing ^^^^^^^^^^^^^^^^^^^^^^^^^^ Internal terminals could be internally indexed by its position in the terminal list, but for devices that are derived from a base device (such as autothermal devices) the internal terminal indexes change if the number of external terminals change. This problem can be avoided by indexing internal terminals separately and refer to internal terminals as ``(self.numTerms + number)`` instead of using fixed numbers. A more convenient solution currently implemented is to have the ``add_internal_terminal()`` and ``add_reference_term()`` functions return the internal terminal index. This solution however would not work for a 'hardwired' subcircuit. Hardwired subcircuits would also present other problems with parameter lists. As hardwired subcircuits can be replaced with regular subcircuits, no support for them is planned. Just for the record a possible solution is presented here: have a function in the Element class that returns terminal indexes. External terminal indexes would also change in this case:: class Element: ... def term_idx(n): return self.baseIndex + n def int_term_idx(n): return self.baseIndex + self.numTerms + n ``baseIndex`` is set to the starting terminal number for a particular element. Separation of current and charge return vectors in eval_cqs() +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ The return of ``eval_cqs()`` for device models is normally a tuple of the form ``(iVec, qVec)``, where ``iVec`` is a vector of currents and ``qVec`` a vector of charges. Any of the vectors may be empty in cases where there are no currents or charges to calculate. For this reason sometimes this approach introduces some overhead compared to having both currents and charges combined in a single vector. However the current output format is easier to handle at a higher level and thus it is preferred. Still, the ``eval()`` and ``eval_and_deriv()`` functions lump currents and charges in a single vector because this is a lot more efficient when implemented using AD tapes (at least with the current interface). However, the analysis code could bypass those completely and generate custom AD tapes for greater efficiency. Operating Point Information +++++++++++++++++++++++++++ When the ``getOP`` flag is set in ``eval_cqs()``, additional operating point variables may be calculated. Originally these variables were returned in an array so they could be taped by the automatic differentiation library. However keeping track of many different variables in a vector is prone to mistakes and there is no clear advantage in making operating point variables (other than ouput currents / charges) available for differentiation. Due to this operating point variables are now returned in a dictionary so they can be accessed by name. Only the dictionary is returned when ``getOP`` is True. This simplifies implementation as the currents and charges can be obtained using the (faster) AD tapes. Why ``get_OP()`` is a separated function? Answer: often the Jacobian is needed to calculate operating point variables (transconductance, capacitance). We can not evaluate the Jacobian (using AD) from within ``eval_cqs()``. The current interface requires the ``get_OP()`` function to correctly work for regular and electrothermal versions of a given model. This is necessary if AD tapes are to be used: for electrothermal models AD tapes input/output variables include temperature/power. Model, netlist variables and sensitivities considerations --------------------------------------------------------- Currently device parameters can be initially set by three different means, in order of priority: 1. Explicitly specified in the device line 2. Specified in a global model 3. The default value The explicitly-specified parameters are kept in a separated dictionary in each element. Global netlist variables can be used to set the values of parameters specified in the device line or the model line. 1. If a string is specified as the value of a numeric parameter value, then it is marked as a potential variable. 2. Variables are specified in a ``.vars`` statement in the netlist and are assumed to be numeric/vectors 3. When the circuit/analysis is initialized, elements/models/analysis check the global netlist variable dictionary to find and set the variable value. If the variable is not found raise an exception. One problem with this is that by that time the netlist line number is lost and the diagnostic message is not as good. Another difficulty is how to update dependent parameters when a variable value is changed. This would require to repeat the whole process for all models/elements as there is no way to know which ones are affected. A change in variables/model/element parameters is likely to happen in sweeps, sensitivity and optimization calculations. From the above considerations the current solution requires re-visiting all elements and re-generating all equations. One work around is to create a list of elements to be updated when needed in the analysis code. Efficiency notes for nodal analysis ----------------------------------- For efficiency indexing individual elements in arrays from Python should be avoided as much as possible. Advanced numpy indexing to avoid loops for each element of the matrices was tried but unfortunately ``G[obj] += Jac`` does not work when some of the elements selected by ``obj`` are repeated (see tentative numpy tutorial at http://www.scipy.org/NumPy_Tutorial). Possible approaches to overcome this are the following: 1. Use a few optimized functions doing the inner loops (perhaps using cython) or implementing fancy indexing with sparse matrices. 2. Create a giant AD tape for the whole circuit. The nodalAD module implements this. This approach is simpler but in practice test results (see nodalAD module) show that it does not improve efficiency. Also tape generation seems to require a dense matrix multiplication (G * x). This rules out the approach for large circuits. Currently the first approach is being used for dense matrices and is described here: nonlinear (and frequency-defined) elements are added new attribute vectors ``nD_?pos`` and ``nD_?neg`` that contain the non-reference terminal RC numbers where the internal current sources are connected. This requires some pre-processing but in this way we can avoid ``if`` statements in inner loop functions. For regular linear transconductances this does not seem necessary as we only have to fill the matrix once. For the sparse-matrix nodal implementation, further pre-processing is needed to create the index vectors to directly fill the main matrix. The main matrix is built in triplet format (Scipy coo_matrix format). Values from nonlinear elements are directly inserted in the coo_matrix ``data`` array:: M.data[mpidx + self._mbase] = Jac.flat[jacpidx] self._mbase += len(mpidx) M.data[mnidx + self._mbase] = -Jac.flat[jacnidx] self._mbase += len(mnidx) It seems that this method is the fastest that can be achieved without resorting to compilation. According to my tests fancy indexing is a little faster (very little) than using ``numpy.put()`` and ``numpy.take()`` to fill the matrix. This does not agree with the comments in http://wesmckinney.com/blog/?p=215 . Perhaps this is different for other architectures or versions of the program? Further optimization would possibly require access to the low-level interface to the C libraries plus compilation of some functions used in inner loops. Profiler results for soliton.net ++++++++++++++++++++++++++++++++ A profile transient analysis of the soliton line with a matrix size of 3022 was performed using scipy matrices, optimized matrix filling and SuperLU for matrix decomposition. The number of time steps is 501. The results shown in the table were obtained in a netbook with an Intel Atom processor: ====== ======= ======= ======= ======= =========================== ncalls tottime percall cumtime percall filename:lineno(function) ====== ======= ======= ======= ======= =========================== 1 0.000 0.000 59.380 59.380 cardoon:27(run_analyses) 1 0.152 0.152 59.380 59.380 tran.py:73(run) 501 0.040 0.000 44.099 0.088 fsolve.py:23(solve) 501 0.016 0.000 44.059 0.088 nodal.py:420(solve_simple) 501 0.260 0.001 44.043 0.088 fsolve.py:59(fsolve_Newton) 505 0.068 0.000 40.807 0.081 nodal.py:423(get_deltax) 501 7.340 0.015 25.678 0.051 nodalSP.py:731(get_i_Jac) 505 0.132 0.000 14.793 0.029 nodalSP.py:204(_get_deltax) 47282 7.824 0.000 9.421 0.000 nodalSP.py:165(_set_Jac) 505 0.036 0.000 9.129 0.018 linsolve.py:131(factorized) 505 0.096 0.000 9.093 0.018 linsolve.py:103(splu) 505 8.597 0.017 8.597 0.017 :0(dgstrf) 501 2.756 0.006 7.720 0.015 nodalSP.py:651(update_q) 1004 4.904 0.005 4.904 0.005 :0(solve) 23782 1.996 0.000 4.236 0.000 cppaddev.py:143(eval_and_deriv) 230703 3.812 0.000 3.812 0.000 :0(len) 505 0.136 0.000 3.084 0.006 coo.py:238(tocsc) 1020 2.680 0.003 2.680 0.003 :0(max) 499 0.036 0.000 2.492 0.005 nodalSP.py:223(get_chord_deltax) 8013 0.408 0.000 2.476 0.000 nodalSP.py:43(set_quad) 70829 2.388 0.000 2.388 0.000 nodal.py:52(set_i) 23782 1.604 0.000 2.100 0.000 adfun.py:208(jacobian) 18224 1.252 0.000 2.076 0.000 nodalSP.py:34(triplet_append) ====== ======= ======= ======= ======= =========================== There seem to be no obvious major bottlenecks. Most of the simulation time is spent building the matrix and evaluating nonlinear devices (``get_i_Jac``), followed by linear system solving (``_get_deltax``). Note than nonlinear device evaluation ``eval_and_deriv`` takes a small percentage of the total simulation time, in part because there are few nonlinear devices in this circuit. Profiler results for sum741_profile.net +++++++++++++++++++++++++++++++++++++++ A profile transient analysis of the soliton line with a matrix size of 189 was performed using scipy matrices, optimized matrix filling and SuperLU for matrix decomposition. The number of time steps is 101. The results shown in the table were obtained in a netbook with an Intel Atom processor: ====== ======= ======= ======= ======= =========================== ncalls tottime percall cumtime percall filename:lineno(function) ====== ======= ======= ======= ======= =========================== 1 0.000 0.000 13.149 13.149 cardoon:27(run_analyses) 1 0.020 0.020 13.149 13.149 tran.py:73(run) 101 0.012 0.000 11.137 0.110 fsolve.py:23(solve) 101 0.012 0.000 11.125 0.110 nodal.py:420(solve_simple) 101 0.084 0.001 11.113 0.110 fsolve.py:59(fsolve_Newton) 257 0.064 0.000 10.849 0.042 nodal.py:423(get_deltax) 203 1.844 0.009 6.748 0.033 nodalSP.py:731(get_i_Jac) 257 0.032 0.000 2.716 0.011 nodalSP.py:204(_get_deltax) 11960 2.076 0.000 2.424 0.000 nodalSP.py:165(_set_Jac) 6708 0.604 0.000 1.660 0.000 cppaddev.py:143(eval_and_deriv) 54 0.192 0.004 1.320 0.024 nodalSP.py:438(get_i_Jac) 257 0.088 0.000 1.320 0.005 coo.py:238(tocsc) 257 0.008 0.000 1.284 0.005 linsolve.py:131(factorized) 257 0.040 0.000 1.276 0.005 linsolve.py:103(splu) 257 1.008 0.004 1.008 0.004 :0(dgstrf) 101 0.284 0.003 0.992 0.010 nodalSP.py:651(update_q) 6708 0.764 0.000 0.888 0.000 adfun.py:208(jacobian) 54474 0.800 0.000 0.800 0.000 :0(len) 14586 0.768 0.000 0.768 0.000 nodal.py:52(set_i) 358 0.088 0.000 0.740 0.002 base.py:270(__mul__) 260 0.112 0.000 0.528 0.002 compressed.py:20(__init__) 358 0.064 0.000 0.384 0.001 compressed.py:259(_mul_vector) 9334 0.376 0.000 0.376 0.000 nodal.py:41(set_xin) ====== ======= ======= ======= ======= =========================== Similar observations can be made for this case, except that in this circuit most of the devices are nonlinear. Profiler results using pysparse library (old implementation) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Note: the code used in this profile is not the default currently used in the program. A profile transient analysis of the soliton line with a matrix size of 3022 (using pysparse) seems to indicate that about half of the time is spent building and half factoring the matrix. At this time the main Jacobian is (almost) created and factored from scratch at every iteration. The results shown in the table were obtained in a netbook with an Intel Atom processor. Note the time to evaluate nonlinear models (``eval_and_deriv``) is only about 25% of the time to build the matrix. ====== ======= ======= ======= ======= =========================== ncalls tottime percall cumtime percall filename:lineno(function) ====== ======= ======= ======= ======= =========================== 1 0.000 0.000 23.301 23.301 :1() 1 0.000 0.000 23.301 23.301 profile:0(run_analyses(analysisQueue)) 1 0.000 0.000 23.301 23.301 cardoon:26(run_analyses) 1 0.036 0.036 23.301 23.301 tran.py:66(run) 100 0.000 0.000 19.181 0.192 fsolve.py:23(solve) 100 0.012 0.000 19.181 0.192 nodalSP.py:257(solve_simple) 100 0.132 0.001 19.169 0.192 fsolve.py:59(fsolve_Newton) 257 0.056 0.000 18.197 0.071 nodalSP.py:260(get_deltax) 253 3.516 0.014 8.993 0.036 nodalSP.py:729(get_i_Jac) 257 0.016 0.000 8.921 0.035 nodalSP.py:246(_get_deltax) 257 8.453 0.033 8.453 0.033 :0(factorize) 12126 0.924 0.000 2.096 0.000 cppaddev.py:143(eval_and_deriv) ====== ======= ======= ======= ======= =========================== Of course things may be different with other circuits but for this case it may be expected to speed the simulator at least by a factor of two if matrix creation and factorization are optimized. Currently voltages are stored in terminals only after the final solution is found. The main reason for this is efficiency as it is less work to operate directly from the vector of unknowns in the equation-solving routine. Temperature handling in electrothermal simulations -------------------------------------------------- The nodal voltage in the thermal port of electrothermal models represents the difference between the device temperature and the ambient temperature. In this way a zero difference is usually a good guess for the nonlinear solver and the numerical solution is more robust. Some issues that should be addressed in the future include: * Automatically convert the temperature difference to the actual temperature for plotting and saving (the same scheme should also work for currents in voltage sources for example). * Propagate units across terminals in thermal circuits using an algorithm similar to freeda's local reference group checking. .. include:: ../TODO