COVID-19 image databases— problems & hints

Original Source Here

The COVID-19 virus is still a relatively new disease, and there is difficulty in access to high-quality medical imaging databases. For this reason, we analysed in detailed available data resources focusing on the dynamic range of images, data processing, and preparation for scientific experiments. The results are visible in Table 1.

Table from paper (Hryniewska, 2021)

What is a DICOM format?

For medical imaging, the standard format for representing measured and/or reconstructed data is Digital Imaging and Communications in Medicine (DICOM). Significant features of this format are the ability to faithfully record 16-bit dynamics of data in grayscale (CT, radiography), control of acquisition parameters, and the ability to adapt to presentation conditions at diagnostic stations. The use of full data dynamics, i.e., all information about the imaged objects, taking into account the characteristic properties of the entire measurement and reconstruction process of the data enables the construction of models based on complete measurement information about the examined object. Unfortunately, the vast majority of COVID-19 resources do not retain the source image information on the diagnosed objects. The data is converted from DICOM to typical multimedia image formats (mainly JPEG, PNG, TIFF standards) with the omission of information about the imaging process itself and often with a loss of quality and informative value of the compressed data. Data dynamics is often reduced to the 8 most significant bits, quantized to simplified 8-bits representation or all image information is lossy compressed (JPEG) using standard quantization tables.

Scarcity of publicly available COVID-19 data sources with images in raw DICOM format. It was observed that only 1 out of 5 repositories with the DICOM extension presented in Table 1 contains COVID-19 cases. Most databases regarding COVID-19 images are in 8-bit JPG or PNG formats. There are concerns that the quality of shared images is degraded, which may render the trained models less accurate. The quality degradation includes: the Hounsfield unit values (that describes radiodensity) are inaccurately converted into grayscale image data, and the number of bits per pixel and the resolution of images are reduced.

An extreme case is the use of digital scans of printed lung images with no standard regarding image size, e.g., images extracted from the manuscripts. Comparative statistical analysis based on the value of systematic measurement errors for the COVID-19 data, including the raw data and the metadata extracted from official reports, showed noticeable and increasing measurement errors (Barata, 2021). This matter showed the importance of the accuracy, timeliness and completeness of COVID-19 datasets for better modeling and proper interpretation.

Too few images with low and moderate severity cases. Most studies are based on data sources publicly available on the Internet on a popular sharing platform, such as GitHub and Kaggle. The most commonly used data source with COVID-19 cases was created at the beginning of the epidemic. The first publicly available repository was published on January 15, 2020. In (Tabik, 2020), it is stressed that available data sources contain too few images with low and moderate severity cases. Most of the data sources have only class labels without fine-grained pixel-level annotations.

Relatively low number of COVID-19 images. The image format is one problem, while the amount of available data is another problem. The median number of COVID-19 images in the considered data resources is 250. With so little data, it is difficult to train a deep neural network (DNN).

The use of imbalanced datasets requires more attention during the model training. Either proper data resampling (oversampling, undersampling) should be applied, or an appropriate loss function should be chosen, unless acquiring a greater amount of less common data is possible. It is also possible to use micro-base metrics. However, most ML algorithms do not work very well with imbalanced datasets.

The data sources lack descriptions. Data resources: 4), 6), 9), 10), 12), 13), and 23) did not include metadata. At a minimum, the description of the dataset should include the following factors. First of all, the total number of images and the number of images in each class should be given. Additionally, the balance in terms of age and sex is another important factor because of the differences in anatomy. Information about smokers or previous lung diseases might be relevant. For analyzing model responses, the information about concurrent diseases, the severity of COVID-19, and the number of days between the exposure and the acquisition of the image of the chest are also useful.

A mix of CT and X-ray images. The problem that we found in these datasets is data purity. If we look closer, it appears that sometimes CT and X-ray images are mixed in the X-ray dataset. These two techniques are so different that networks for CT and X-ray images should be trained separately.

Inappropriate CT windows. For COVID-related lung analysis, it is essential to have Hounsfield Units equivalent for “lung” window (width: 1,500, level: -600). Otherwise, the lung structures are obscured or not visible at all. This is a basic, but key, issue because we do not want to assess soft tissues or bones. Photos taken in other windows do not have any real diagnostic value.

“Children are not small adults”. In some databases, in some cases, e.g. 7), 8), 9), the X-rays of children and adults are mixed. The next problem is related to the mixture of images of patients of different ages. There are crucial differences between the X-rays of children’s and adults’ chests: technical (hands are often located above the head), anatomical (different shape of the heart and mediastinum, as well as bone structures), and pathological (different pathologies). It is important to mark the age of patients in data resources and to separate children from adults when preparing data for training.

CT and X-rays images are not in color. Despite that fact, some databases, e.g. 5) and 11), include images in RGB color space. It introduces redundant information, because values in all channel are the same (R=G=B). This situation leads to increasing the number of input neurons in the neural network by three times. Due to that fact, the number of parameters will also rise, and the training may require more data and time, however, it lacks extra information.

Incorrect categorization of pathologies. We have noticed that some images are incorrectly categorized — into normal or pathologic, e.g. in database 10), 13), and also within the class of pathology, e.g. in database 14). An additional problem is that, from a medical point of view, some images should be multi-categorized. This means that there is more than one pathology in one image. For instance, pneumonia (main class) can manifest itself as lung consolidations, which, however, can also appear with pleural effusion or atelectasis (two additional classes). On the other hand, atelectasis itself, with a mediastinal shift, can be a sign of a different pathology, such as a lung tumor. Thus, databases should be verified by experienced radiologists for proper categorization of multi-class images. This, however, would be time-consuming and — what is more important — very difficult or impossible with low-quality images or images without appropriate descriptions.

Figure from paper (Hryniewska, 2021)

Lack of information about chest projection for X-ray imaging. This problem is present, for example in 2), 4), 9). There are two main chest projections, see Figure 3, Posterior-Anterior (PA) and Anterior-Posterior (AP). The first one is acquired while the patient is standing. The X-ray beam comes through the patient’s chest from its back (posterior) to front (anterior) — i.e., PA projection. The second one is the opposite — the beam enters through the front (anterior) of the chest and exits out of the back (posterior) — i.e., AP projection. This type of examination is mostly conducted in more severe cases, with lying patients, with comorbidities, often in Intensive Care Units. As the X-ray beam is cone-shaped, both projections have one very important difference, which is the size of the heart. In PA projection, the heart is close to the detector, resulting in a similar heart size on the X-ray as in reality. In AP projection, the heart is away from the detector, resulting in a larger heart size on the X-ray, which can be confused with cardiomegaly. In databases, AP and PA images are often mixed, which can cause bias because AP projections are performed on severely ill patients (Tabik, 2020). From a medical point of view, it is impossible to perform chest X-rays in only one projection as this depends on patients condition. However, projection should be specified for every X-ray in dataset, and possible bias in model classification should be evaluated.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: