The orthogonality of modality-shared and modality-specific latent space may be too difficult to satisfy in real-world scenarios. Figure 1 presents an example of the physiological indicators of diabetic patients, in which the signals related to the brain and heart are observed in the time series data. Specifically, Figure 1(a) illustrates the real data generation process. The causal directions from insulin concentration to blood pressure and heart rate demonstrate how diabetes leads to complications such as heart disease and high blood pressure. As shown in Figure 1(b), existing methods apply orthogonal constraints on the estimated latent variables despite the dependence among the true latent sources, which results in variable entanglement and further leads to suboptimal performance in downstream tasks. To address the challenge of dependent latent sources, we propose a multi-modal temporal disentanglement framework to estimate the ground-truth latent variables with identifiability guarantees.
Figure 1. Illustration of physiological indicators of diabetics, where brain-related and heart-related signals are observations. (a) In the true generation process, observations are generated from dependent latent sources. (b) In the estimation process, enforcing orthogonality on estimated sources can result in the entanglement of latent sources and meaningless noises.
To show how to learn disentangled representation for multi-modal time series data, we introduce the data generation process as shown in Figure 2.
Figure 2. Data generation process of time series data with two modalities. The grey and white nodes denote the observed and latent variables, respectively.
- Firstly, we obtain the modality-shared and modality-specific latent variables through the modality extractor. During this process, various constraints are employed to ensure that the extracted modality latent variables possess rich semantic information.
- Specifically, we first utilize the prior constraints of modality latent variables to guarantee the extraction of modality latent variables that are rich in semantic meaning. Secondly, we apply the modality-sharing constraints to ensure that the modality-shared latent variables extracted from each modality are consistent.
- Finally, the obtained modality latent variables will be used for various downstream tasks.
- Our model overview is as shown in Figure 3.
Figure 3. Illustration of the proposed MATE model, we consider two modalities for a convenient understanding, more modalities can be easily extended. Modality-specific encoders are used to extract the latent variables of different modalities. The specific prior networks and the shared prior network are used to estimate the prior distribution for KL divergence.
- Python 3.8
- torch == 2.0.1
- scikit-learn==1.2.2
Dependencies can be installed using the following command:
pip install -r requirements.txt
Please download the dataset from the provided links in the Dataset section.
Motion:
WiFi:
https://github.com/ermongroup/Wifi_Activity_Recognition
KETI:
https://github.com/Shuheng-Li/Relational-Inference/tree/master/KETI_oneweek
Humaneve:
http://humaneva.is.tue.mpg.de/
h36m:
http://vision.imar.ro/human3.6m/description.php
D1NAMO:
https://github.com/PSI-TAMU/D1NAMO
UCIHAR:
https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones
PAMAP2:
https://archive.ics.uci.edu/dataset/231/pamap2+physical+activity+monitoring
HAC and EPIC-Kitchens:
https://huggingface.co/datasets/hdong51/Human-Animal-Cartoon/tree/main
python train.py -dataset=[DATASET]
The main results are shown in Table 1.
Table 1. Time series classification for Motion, D1NAMO, WIFI, and KETI datasets.