R25 VOICE Section 3 - Datasets

Papers discussed in this Section 3 podcast:

  • Liao, Fangzhou; Liang, Ming; Li, Zhe; Hu, Xiaolin; and Song, Sen. Evaluate the Malignancy of Pulmonary Nodules Using the 3D Deep Leaky Noisy-or Network. eprint arXiv:1711.08324, 2017
  • Pollard, T. J., & Johnson, A. E. W. The MIMIC-III Clinical Database. http://dx.doi.org/10.13026/C2XW26 (2016)
  • Pranav Rajpurkar, Jeremy Irvin, Aarti Bagul, Daisy Ding, Tony Duan, Hershel Mehta, Brandon Yang, Kaylie Zhu, Dillon Laird, Robyn L. Ball, Curtis Langlotz, Katie Shpanskaya, Matthew P. Lungren, and Andrew Ng. MURA Dataset: Towards Radiologist-Level Abnormality Detection in Musculoskeletal Radiographs. arXiv:1712.06957, 2017
  • X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, R. M. Summers. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE CVPR (spotlight);  arXiv:1705.02315, 2017

Podcast Contents:

  • Why Datasets are important?
  • Kinds of Datasets?
  • What's a gold standard?
  • Best practices in dataset descriptions.
    • Sample distribution
    • Meta-data
      • Patients
      • Radiologists
      • PACS Systems Used for Annotation
      • Images
  • Strategies for Labeling Data
    • Natural Language Processing
    • Amazon Mechanical Turk
    • Natural Language Processing Validation Sets