Longitudinal Self-Supervised Learning


Original Source Here

Longitudinal Self-Supervised Learning

Exploiting longitudinal measurements to learn better representations

Longitudinal Self-Supervised Learning is an exciting new avenue for learning on large datasets where repeated measures are taken. This is especially fitting for the medical domain (imaging, tabular health record data, etc.), where patients may have several data points over months or years but not all have exact labels (e.g. presence of a specific disease).

Photo by Aron Visuals on Unsplash


Self-supervised learning (SSL) has been one of the hottest areas of deep learning recently. In self-supervised learning, we train a model to perform a task on our data without using labels. This helps us learn a meaningful representation of the data which can then be used for a downstream task (like prediction on the downstream label of interest). In other words, we can do self-supervised learning on large unlabeled datasets, then “fine-tune” or continue training on smaller labeled datasets to extract the maximum performance from a combination of labeled and unlabeled data. This approach aligns with the massive availability of unlabeled data today (videos, free text, medical records, etc.).

The task we choose for SSL is important. It should allow us to learn some semantic representation. A simple example is a rotation task. We can manually rotate our images by a random degree and train a model to predict the amount of rotation. If our dataset has mostly people or faces in it, the model should be able to learn meaningful features such as eyes, hair, etc. in order to predict proper rotation (e.g. it should learn 1) to identify eyes and hair and 2) that hair is above/aside of eyes).

Rotation task. Image by author, inspired by Gidaris et al., 2018. Photo by Alvan Nee on Unsplash

In contrastive learning, a popular SSL approach, we gather positive and negative samples. We train the model to minimize the embedding distance between positive samples (anchor and positive) while maximizing the distance between negative samples (anchor and negative). In the image domain, the negative sample might be a different image while the positive sample could be an augmented version of the original image such that the semantic meaning of the image stays the same (e.g. cropping, rotating, color distortion).

Example of contrastive learning. Latent representations of the positive pair (same image with zoom+crop augmentation) are brought closer together while negative pairs are pushed further apart. Note: we are not using label knowledge for self-supervised learning, so the negative image could be another dog. Image by author, photos by Alvan Nee and Alexander London on Unsplash.

Longitudinal Self-Supervised Learning

Data in the medical field offer a new opportunity for self-supervised learning tasks beyond just augmenting a single data point. We have complex data, such as health records or 3D medical images, and we often have many data points from a single subject over time. This allows us to learn representations that capture time-relevant information related to progressive diseases, aging, or development. I will discuss a few approaches to self-supervised learning, first some contrastive approaches and then some non-contrastive ones.

Permutation Contrastive Learning

An early approach to Longitudinal SSL is permutation contrastive learning (PCL) [4]. In PCL, we create positive pairs by taking two consecutive windows, or data points in time, in the correct order from the same subject. For negative pairs, we take two random windows from the same subject. We can then train a binary classifier to predict whether the image pairs are positive or negative. A visual example of positive and negative pairs is shown below:

Example single-patient trajectory with 5 visits. Image by author.
Two possible positive samples for PCL. Positive samples must be consecutive and in the correct order. Image by author.
Two possible negative samples for PCL. Negative are pairs chosen at random. Image by author.

Note that the task does not necessarily learn across-subject differences because pairs are always from the same subject. This allows our model to focus on important longitudinal processes.

Order Contrastive Pre-training

An issue with PCL is that negative samples are chosen completely at random. They may actually be in the correct order but still be labeled as positive. A more recent approach, called Order Contrastive Pre-training (OCP) [1], improves on PCL by constraining positive samples to be consecutive, correctly ordered windows and negative pairs to be consecutive, incorrectly ordered windows.

Two possible positive samples for OCP. Positive samples must be consecutive and in the correct order. Image by author.
Two possible negative samples for PCL. Negative samples must be consecutive and in the wrong order. Image by author.

The authors of OCP showed that it empirically and theoretically outperforms PCL for longitudinal representation learning. They used the examples of representation learning via 1) feature selection (i.e. L1 regression) on electronic health records and 2) NLP with clinical radiology notes.

Latent space alignment

Another class of longitudinal SSL methods takes a completely different approach than the contrastive OCP and PCL methods and has primarily been applied to medical imaging. In this approach [7], we learn a standard autoencoder, which encodes an image into a latent representation and decodes it into a reconstructed image.

Standard auto-encoder. Image X is encoded into latent representation Z, then reconstructed back into X. Image by author.

In addition to a standard reconstruction loss (i.e. Mean Square Error), there is an alignment loss. We learn some unit vector τ in the latent space, and force same-subject image pairs to only vary in the τ direction, as shown below.

Latent alignment approach to Longidutinal SSL. Representations for each subject are forced to align with the trajectory vector τ.Image by author, inspired by [7].

This is done using a cosine loss term:

Image by author;

where g is the encoder function, I^s and I^t are the image pairs from two separate time points and the same subject, and θ is the model parameters. We try to maximize this term. The intuition is this: make the difference in same-subject image representations parallel to τ, which means the angle between this difference vector and τ should be brought to 0. This maximizes the cosine function.

This alignment allows us to learn a meaningful and interpretable longitudinal age factor (in the context of human subjects) in the latent space that 1) does not require an age or disease label and 2) outperforms simply training a model to predict age for obtaining age-relevent representations. The self-supervised training led to better performance on downstream disease prediction tasks with brain images than other popular SSL methods.

A follow-up study [5] to the original longitudinal alignment method proposed a more flexible extension called Longitudinal Neighborhood Embedding (LNE). A big-picture idea of this method is that, rather than enforce the global trajectory direction τ, we can enforce that similar subjects (based on distance in the latent space) have similar longitudinal trajectories. The loss function for LNE is almost the same as for the global alignment, but instead maximizing the cosine between the image pair latent difference and τ, we substitute τ with Δh, a distance-weighted average trajectory of the N nearest embeddings.

Longitudinal Neighborhood Embedding. For an image pair (noted by z1, z2), we encourage their trajectory to match the mean trajectory Δh of their neighborhood (shaded circle). Image by author, inspired by [5].

This method is again geared towards brain imaging. Brain-age, or change in brain anatomy over time, is highly heterogeneous even within a single disease like Alzheimer’s disease. However, the pathways can fall relatively well-defined clusters [6], making a trajectory neighborhood model very suitable for modeling the brain aging process.

Image Correction

A final method for longitudinal self-supervised learning involves taking a pair of images from the same subject over time, patch-scrambling or blacking out portions of the second image, and using the two images (intact first image, damaged second image) to predict the corrected image [2]. The paper which proposed this method used it in the context of lung CT scans. Both images are used to form this representation; and the enocder can be used on downstream tasks such as segmentation or pathology classification. (Note: image pairs must be available for the downstream task in this method, as the pre-trained encoder requires two images as input).

Method for image correction-based longitudinal SSL. Images by author, photos by NCI and CDC on Unsplash, inspired by [2].


Longitudinal Self-Supervised Learning is a new and exciting approach for learning representations from unlabeled longitudinal data. It is particularly useful when 1) you have a lot of image (or record) pairs from same subjects over time, and 2) the downstream task of interest (e.g. predicting chronic disease) can have large changes over time. Large longitudinal publicly available datasets are becoming more and more common. It will be interesting to see the impact that longitudinal SSL will have on ML research in these areas.


  1. Agrawal, M., Lang, H., Offin, M., Gazit, L., Sontag, D., 2021. Leveraging Time Irreversibility with Order-Contrastive Pre-training 151.
  2. Czempiel, T., Rogers, C., Keicher, M., Paschali, M., Braren, R., Burian, E., Makowski, M., Navab, N., Wendler, T., Kim, S.T., 2022. Longitudinal Self-Supervision for COVID-19 Pathology Quantification 1–10.
  3. Gidaris, S., Singh, P., Komodakis, N., 2018. Unsupervised representation learning by predicting image rotations. 6th Int. Conf. Learn. Represent. ICLR 2018 — Conf. Track Proc. 1–16.
  4. Hyvärinen, A., Morioka, H., 2017. Nonlinear ICA of temporally dependent stationary sources. Proc. 20th Int. Conf. Artif. Intell. Stat. AISTATS 2017.
  5. Ouyang, J., Zhao, Q., Adeli, E., Sullivan, E. V., Pfefferbaum, A., Zaharchuk, G., Pohl, K.M., 2021. Self-supervised Longitudinal Neighbourhood Embedding, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer International Publishing. https://doi.org/10.1007/978-3-030-87196-3_8
  6. Vogel, J.W et al., 2021. Four distinct trajectories of tau deposition identified in Alzheimer’s disease. Nat. Med. 27, 871–881. https://doi.org/10.1038/s41591-021-01309-6
  7. Zhao, Q., Liu, Z., Adeli, E., Pohl, K.M., 2021. Longitudinal self-supervised learning. Med. Image Anal. 71, 102051. https://doi.org/10.1016/j.media.2021.102051


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: