VinDr-CXR: An open dataset and benchmarks for disease classification and abnormality localization on chest radiographs

VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations

Dataset Description

In an effort to provide a large dataset of chest x-ray (CXR) images with high-quality labels for the research community, we have built the VinDr-CXR dataset from more than 100,000 raw images in DICOM format that were retrospectively collected from the Hospital 108 and the Hanoi Medical University Hospital, two of the largest hospitals in Vietnam. The published dataset consists of 18,000 postero-anterior (PA) view CXR scans that come with both the localization of critical findings and the classification of common thoracic diseases. These images were annotated by a group of 17 radiologists with at least 8 years of experience for the presence of 22 critical findings (local labels) and 6 diagnoses (global labels); each finding is localized with a bounding box. The local and global labels correspond to the “Findings” and “Impressions” sections, respectively, of a standard radiology report.

We divide the dataset into two parts: the training set of 15,000 scans and the test set of 3,000 scans. Each image in the training set was independently labeled by 3 radiologists, while the annotation of each image in the test set was even more carefully treated and obtained from the consensus of 5 radiologists. The labeling process was performed via our own web-based framework called VinDr Lab, which was built on top of a Picture Archiving and Communication System (PACS). A demonstration of this framework can be found here.

This dataset was used for the VinBigData Chest Xray Abnormalities Detection Competition hosted on the Kaggle.com platform.

Examples of CXRs with radiologist’s annotations. Abnormal findings (local labels) marked by radiologists are plotted on the original images for visualization purpose. The global labels are in bold and listed at the bottom of each example. Better viewed on a computer and zoomed in for details.

Dataset Statistics

Note: the numbers of positive labels were reported based on the majority vote of the participating radiologists. (*) The calculations were only based on the CXR scans where patient’s sex and age were known. (-) To preserve the integrity of the test set, its labels are not released to the public. The statistic of the labels on the test set is therefore not shown here.

Distribution of findings and pathologies on the training set of the VinDr-CXR Dataset.

Download

The full version of the VinDr-CXR dataset can be obtained from PhysioNet. Note that only credentialed users who sign the specified DUA can access the files. In addition, a slightly modified version of the dataset can be downloaded from the webpage of the VinBigdata Chest X-ray Abnormalities Detection Competition.

Visualization

The images and annotations of the dataset can be visualized via VinDr Laboratory – our hub for all public datasets.

Citation

For any publication that explores this resource, the authors must cite the original paper as follows:

Ha Q. Nguyen et al. “VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations” – A preprint is available on ArXiv

BibTeX citation:

@misc{nguyen2020vindrcxr,
      title={VinDr-CXR: An open dataset of chest X-rays with radiologist's annotations}, 
      author={Ha Q. Nguyen and Khanh Lam and Linh T. Le and Hieu H. Pham and Dat Q. Tran and Dung B. Nguyen and Dung D. Le and Chi M. Pham and Hang T. T. Tong and Diep H. Dinh and Cuong D. Do and Luu T. Doan and Cuong N. Nguyen and Binh T. Nguyen and Que V. Nguyen and Au D. Hoang and Hien N. Phan and Anh T. Nguyen and Phuong H. Ho and Dat T. Ngo and Nghia T. Nguyen and Nhan T. Nguyen and Minh Dao and Van Vu},
      year={2020},
      eprint={2012.15029},
      archivePrefix={arXiv},
      primaryClass={eess.IV}
}

We also encourage such authors to release their code and models, which will help the community to reproduce experiments and to boost the research in the field of medical imaging.

Contact

Correspondence should be addressed to Ha Nguyen (v.hanq3@vinbigdata.com)