How to Encode Medical Records for Deep Learning

Original Source Here

How to Encode Medical Records for Deep Learning

Simplified illustration based on Google Brain’s scalable and accurate deep learning with electronic health records and its supplementary materials.

Image source:


The healthcare industry started systematically digitizing healthcare data more than a decade ago. The hope is that one day, we’ll be able to harmonize data from disparate sources to form a “longitudinal view” of a patient, which includes everything about the patient’s health, ranging from clinic visit, hospital stay, and medication history to immunization records, family history, and lifestyle observations. It’s widely held that once equipped with the richness and completeness of the patient’s healthcare data, we can greatly improve care quality, reduce operational cost, and advance medical/drug research.

If you pay attention to the recent trends in the healthcare tech world, you’ll see that the hope is starting to come true. In some controlled settings, healthcare providers, payers, and researchers are starting to piece together more and more data from idiosyncratic upstream systems to construct comprehensive patient records. With that exciting development, now comes the million-dollar question: how do we actually go and harness the power of such data records?

While the data analysis or machine learning model development typically draws more attention, the importance of the data engineering leading to that should not be overlooked. This blog post has a very specific focus. It examines how to encode medical records and make them suitable for deep learning, in particular, time series learning. It illustrates with simplified examples and visual graphs the methods proposed in Google Brain’s “scalable and accurate deep learning with electronic health records” and its supplementary materials. If you don’t want to read the original 30+ pages of abstruse content, this blog post is a good alternative for you. Without further ado, let’s dive in.

Settings of the Learning Task

The setting of the learning task is to make predictions for intensive care unit (ICU) patients. Google Brain and many other researchers chose ICU settings because ICU’s data is usually the most complete and available for research compared to other healthcare data. There are many things we can predict. We can predict clinical outcomes — in this case, mortality (death) events. We can predict resource utilization, in this case, larger than 7 days length of stay. We can predict quality of care, in this case, 30 days readmission after discharge.

Traditional methods/models rely on handpicked feature variables, evaluated at the prediction time. This blog post includes a summary of some popular ICU analysis models in the end. However, the Google Brain team hypothesized, which proved successful with data validation, that by taking into account all the available patient data, and performing a time series analysis, a more accurate prediction for all the above outcomes can be achieved. The significance here is that

  • it does not require manual feature selection — the model will learn to weigh and combine features
  • and all of the patient’s historical data is used, as opposed to other popular non deep learning models in which only the current snapshot is used.

Now with the setting of the learning task clarified, let’s dive into the encoding procedures to see exactly how the medical records are transformed for the learning task.

Encoding Procedures

The Google Brain team first converted ICU data they received from partner hospitals into an open healthcare data standard, called FHIR. It’ll take 10 other blog posts to fully explain FHIR. But its core idea is not hard to grasp. For the purpose of this blog post, we can think of it as JSON data. There is one JSON data schema per healthcare concept — Patient, Observation, Condition, MedicationRequest, and so on, — they are all healthcare concepts. The JSON data schema formalizes the fields of the respective healthcare concept. For example, Patient has a name, a birthday; Observation has a value, etc. The translation from the proprietary healthcare data to the JSON data we need is relatively straightforward since most proprietary healthcare systems also model their data based on the common healthcare concepts.

One important thing to note here is that while the structural transformation is attainable, the semantic translation in healthcare is extremely challenging. Because healthcare data employs various (sometimes proprietary) coding systems. "Heart failure" may be "123" in one coding system and "1a2b3c" in another. Different coding systems have different granularities and hierarchies. Harmonizing everything into one single coding system is a daunting task. The Google Brain team’s method does not require coding system harmonization, which is a supreme advantage. As we shall see later, they can do that because they treat data content as tokens and convert the tokens to embeddings. Therefore, as long as a healthcare dataset uses a consistent set of coding systems for itself (it doesn’t matter what those are exactly), it should be fine. Let’s look at the concrete steps below.

Step 1: tokenization

The first step is to tokenize all the fields in all healthcare concepts. If it’s a text field, we just split it by whitespaces. For example, "high glucose" becomes ["high", "glucose"]. For numeric fields, just treating the plain number as a token is meaningless. So we need to encode more context. We concatenate the name — usually a coded value, the quantity, and the unit together to form a token. For example, {code:"33747003",quantity:10.2,unit:"mmol/L"} becomes "33747003_10.2_mmol/L". See Figure-1 for an illustration. Optionally, we can quantile the quantity to reduce sparsity of the tokens. What we get after step 1 is instances of the healthcare concepts where each instance has all its fields tokenized. We calculate the timestamp of each instance as the delta time in seconds with respect to a predefined prediction time (the dt1 and dt2 in Figure 1). Prediction time is usually 24 hours after admission or at discharge time.

Figure 1: tokenization + embedding. Image by author.

Step 2: embedding

For every field in every healthcare concept, we build a vocabulary of a predefined size. We don’t use a global vocabulary because different fields carries distinct healthcare semantics. It does not make sense to mix up all tokens from all concepts or even all tokens from all fields in one concept to build the vocabulary. We then train an embedding for each token. The embeddings are learnt jointly with the prediction tasks. Note that we deliberately choose the same embedding dimension for all fields within a given healthcare concept. As we shall see later, it enables easier aggregation.

After step 2, what we get is similar to Figure 2. It starts to resemble a typical natural language processing input where you have a sequence of embeddings that you can feed to a recurrent neural network (RNN). There are, however, three problems with the current format:

  1. Each instance is one training example. There are hundreds or even thousands of data points in the ICU settings. We know that RNN does not perform well with long sequences.
  2. The training examples have variable dimensions, which are determined by the concept in question and the number of tokens in the instance. RNN can’t handle variable-length embeddings.
  3. The timestamp of the instances / training examples are not evenly spaced. When a particular event occurs has significant clinical meaning. That’s not captured by the embeddings.

Step 3 will address all 3 problems.

Figure 2: a sequence of embeddings. Image by author.

Step 3: aggregation

The first problem of long sequences is easy to solve. We just need to divide the data points into even time-step. The time-step may be 1 hour or a few hours, which can be tuned as a hyperparameter. We’ll aggregate all data points in a given time-step to form a single training example.

Remember that all fields in a given healthcare concept share the same embedding dimension. So we can take the average of all field embeddings in an instance to form the aggregated embedding for that instance. If there are multiple instances of a concept, we can further average the instance embeddings to form the embedding for that concept. Since the healthcare concepts are a predefined enum list, we can concatenate the concept embeddings together to form a fixed sized example. If a concept does not appear in the time-step, we just set its embedding to all 0s. The second problem of variable dimensions is gone now. Note that we used average throughout for aggregation, but you can imagine using other aggregation schemes.

To tackle the final problem of timestamp meaning, we take the average of the timestamps of all instances in the time-stamp, and append it to the end of the fixed size embedding we obtained via fields->instance, and instances->concept aggregation. That way the signal of timestamp is encoded in the training example as well.

See Figure 3 for an illustration of the final encoding result. We now have a not-so-long sequence of fixed sized embeddings that takes into account the event timestamps. We can now feed it to a RNN.

Figure 3: final input to RNN. Image by author.


In this blog post, we go over the step-by-step illustration of how to encode medical records for RNN models. It starts with transforming the proprietary data format received from partner organizations into an open standard JSON format. That enables us to make structural sense of the source dataset. It then tokenizes the content in each JSON fields, and builds an embedding for each token. In the end, it aggregates the token embedding over a reasonable time horizon to create a reasonably long sequence of fixed sized embeddings, which is suitable for RNN model input.

The significance of the data processing above is two-fold:

  1. It does not require converging the medical records to a global coding system, which saves a huge amount of manual work.
  2. It encodes the all the records in the entire patient history in a homogeneous way so that all of them can be taken into account during model training.

Appendix: Other Common non Deep Learning ICU Data Analysis Models

Early warning score (EWS) is usually calculated at 24 hours after admission to predict the likelihood of death. It uses respiratory rate, oxygen saturation, temperature, blood pressure, heart rate, and consciousness rating as variables. Each variable has a normal range as established by common medical knowledge. A score is computed based on a lookup table to characterise how far away the variable is from its normal range. If the sum of all scores surpasses a threshold, it means a high likelihood of death.

Hospital score for readmission is typically calculated at discharge time. It takes into account of hemoglobin level, sodium level, type of the admission, number of previous admissions, length of stay, whether the hospital stay is cancer related, and whether medical procedures were performed during the stay. Similar to the EWS above, based on established medical knowledge, the values of each factor is translated to a risk score, and the sum of which depicts the overall risk of readmission.

Liu score for long length of stay is commonly computed 24 hours after admission. It’s logistic regression model (with proper regularization techniques) that factors in variables such as age, gender, condition category, diagnosis code, hospital service received, and lab tests of vital signs to produce a probability number for long length of stay.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: