Mathematical background
Dynamical Systems Reconstruction (DSR)
DSR refers to a broad class of methods aimed at recovering the underlying dynamics of a system from observational data, often without direct access to the governing equations. The central idea is to approximate the system’s evolution based on trajectories observed in a typically high-dimensional state space, while preserving important invariant short- and long-term properties of the system. This task is motivated by the fact that many real-world systems, from climate and neuroscience to fluid dynamics and finance, exhibit complex, nonlinear behavior that defies analytical modeling. In such settings, data-driven approaches offer a powerful alternative, seeking to learn the dynamics directly from measurements.
Modern DSR techniques leverage advances in machine learning and numerical modeling to recover not only the system’s short-term predictive structure but also its long-term statistical and topological properties. The field spans a spectrum of approaches, from interpretable models based on sparse regression or symbolic dynamics to highly expressive deep learning architectures capable of approximating arbitrary nonlinear systems.
DSR Models
For DSR, we employ multiple classes of generative models that serve as universal function approximators, capable of learning complex, nonlinear dynamics directly from time series data. These models evolve a -dimensional latent state , which is forward-propagated at each model timestep to represent the system’s internal dynamics. Observations are generated through a corresponding observation model that maps the latent state to the -dimensional observation state. Additionally these models can process an external input .
We employ different latent DSR models:
Piecewise linear recurrent neural network (PLRNN)
where , , , are weights. The nonlinearity is given through the piecewise linear ReLU function.
Shallow PLRNN (shPLRNN)
where , , , with .
Almost-linear recurrent neural network (ALRNN)
with the almost linear activation function
where denotes the number of nonlinear units and again . For , the ALRNN is equal to the PLRNN.
Observation Models
For the common case of Gaussian observations, we generally employ a linear model that maps the -dimensional latent states to the dimensional observations, i.e. . The multimodal framework allows for more complex, modality-specific decoder models, which specialize the observation mapping to the data at hand.
Teacher Forcing
Teacher forcing is a training strategy for RNNs, which facilitates learning chaotic dynamics. In this repository, two specific variants of teacher forcing are implemented: Sparse-and Generalized Teacher Forcing (STF and GTF, respectively). Both are combined with vanilla backpropagation through time (BPTT) training of RNNs. The purpose of teacher forcing is to control the RNN dynamcis during training in order to 1) mitigate the problem of chaos-induced exploding gradients as well as 2) smoothen the loss landscape and hence ease optimization. Teacher forcing achieves this by actively modulating the RNN state during evolution, either by mixing the RNN state with some ideal state (GTF) or by replacing the state altogether at specific intervals (STF). The target state is in the simplest case given by the data itself (i.e. our observation model is an identity mapping from a subset of RNN units) or by inversion of the observation model. Speaking figuratively, the RNN trajectory is "pulled" towards the target trajectory in latent space in intervals of for STF, and constantly but with smaller magnitude controlled by a parameter in case of GTF. Mathematically, this comes down to
where is the teacher signal and .
STF and GTF are able to mitigate exploding gradients by actively modulating the norm of the model Jacobian during training. The choice of or is highly dependent on the data and its chaoticity and the specific model used. In theory, the optimal choice to mitigate exploding gradients can be determined by the maximum Lyapunov exponent of the observed system. However, since this quantity is generally not accessible when dealing with real-world data, we often fall back to training heuristics (e.g. annealing strategies) or simple grid search. While GTF provides more theoretical guarantees, the methods are roughly on par given the right choice of and . For more details, we refer the reader to Hess et al. (2023) for more details on GTF, and Mikhaeil et al. (2022) as well as Brenner et al. (2022) for STF.
Multimodal Teacher Forcing (MTF)
For more details we refer to Brenner, Hess et al. (2024).
In the previous sections we have made the assumption that the data is described by a Gaussian distribution (with mean of the measurement). This is also the standard assumption for the RNNs equation. However, in many cases the measurements are not Gaussian, but rather count-, ordinal- or even categorical data. This is the case for example in neurophysiological data, or survey data. In these cases the standard approach is to transform the data to a latent space, where it can be described by a Gaussian distribution. This is a non-trivial task, so machine learning methods are used to learn the transformation. The MTF framework is a modified variational autoencoder (VAE), with the encoder transforming into the aformentioned latent space, and with modality-specific decoders that make sure that the backwards transformation is outputting the correct distribution. The encoder and decoder are trained jointly with the DSR model, which only acts on the latent space, and serves as the prior for the encoder. This prior enforces a consistency between the encoder and DSR model, which allows for the end-to-end training. MTF is an abbreviation for multimodal teacher forcing, as the framework uses the encoded timeseries as the teacher forcing signal.
Hierarchical Approach
For more details we refer to Brenner, Weber et al. (2025).
The hierarchical approach can be motivated with the following setup: Assume we have a set of measurements, where it is clear that the measurements aren't generated by the same dynamical system, but the systems share some commonalities. A commond example could be the Lorenz-63 system with multiple different parameters or a pendulum with different friction coefficients. In these cases, the underlying differential equations are the same and the parameters differ, but the approach can be used for more complex commonalities between systems.
In the vanilla approach, one would train a separate model for each system. This has multiple disadvantages, such as the need for more data (per system) and the need for more computational resources.
In the hierarchical approach, one would train models which share parameters between each other, meaning that the parameters of the PLRNN are composed from a set of shared parameters and a set of system-specific parameters. Furthermore, we choose the dimensionality of the system-specific parameters to be substantially lower than the shared parameters, which forces all differences between the systems to be represented in a low-dimensional, thus interpretable space. This also allows for more efficient use of the data, as the shared parameters are trained on measurements from all systems, while the system-specific parameters are trained only on the respective system. This means that less data per system is needed to train the model.