Skip to Main Content
COVID-19 ISO Insights

Update: Publicly Available Datasets Combat COVID-19

September 14, 2020

By: Christopher Sirota, CPCU

Back in May 2020, we posted about some preliminary attempts at using artificial intelligence systems (AI) to assist healthcare experts during the pandemic.

Around the same time, the National Institute of Health (NIH) published a study that discussed some limitations for the use of AI to: 1) forecast the spread of the virus, 2) increase the speed of diagnosis, and 3) assist in the development of a future vaccine.

At the time, the NIH study generally focused on the lack of quality data available for AI systems to use. In this post, we look at the status of a handful of the publicly available data collection initiatives noted in the NIH study.

According the database search page, it is updated daily, and as of August 28, 2020, there are over 59,000 articles available in full text.

The GISAID website includes multiple maps to globally visualize the epidemiology of SARS-CoV-2 based in part on submitted genomic information of the virus. Per the website, each submission is classified into groups, also called "clades." Currently, per the website, about 92,000 viral genomic sequences have been submitted to the database.

Kaggle's COVID-19 research competition web page explains that the "[COVID-19 Open Research Dataset] was created by the Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, IBM, and the National Library of Medicine - National Institutes of Health, in coordination with The White House Office of Science and Technology Policy."

According to the Allen Institute's Semantic Scholar database, its COVID-19 Open Research Dataset currently contains over 130,000 scholarly articles.

The Kaggle web page lists 17 tasks as part of the competition.

According to the web page, the database currently includes over 41,000 articles, and the page includes several interactive maps, including a worldwide map of institutions conducting COVID-19 research.

The web page displays search results as a dashboard. It currently notes that the latest update was made on May 19, 2020.

  • Researchers at the University of Southern California, Information Sciences Institute have created and continue to update a public COVID-19 Twitter dataset.

According to the dataset web page, the "repository contains an ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020."

The first paper on this initiative is available here. A more in-depth paper, with a more recent list of monitored keywords and Twitter account names is available here. The dataset of collected Tweet-IDs is available here.

The dataset contains 479,056,112 Tweet-IDs as of August 21, 2020. Per the initiative's papers, the researchers are seeking to provide this information, in part, to "also help track COVID-19-related misinformation and unverified rumors or enable the understanding of fear and panic […]."

Note, according to the dataset page, Tweet-IDs need to be "hydrated" to re-connect them to the related tweets; regarding the texts of the tweets, about 66% of the tweets are reportedly in English.

Regarding the size of the dataset, to "hydrate" the first version of the dataset reportedly required one researcher about "[…] 25 hours to complete, and […approximately] 6% of the Tweets […had apparently been] deleted at the time of hydration, [and the final file had a] gzipped data size of 6.9 GB."

Of interest, the dataset web page includes the following two studies that reportedly leveraged the Twitter dataset:

  • COVID-19 Evidence Navigator, was developed by the Institute for Technology and Innovation Management, RWTH Aachen University, Gruenwald, E., Antons, D. & Salge, T.O. (2020).

This web-based tool allows spatial, temporal and network searches of medical journals with a metric to adjust for article quality. According to the web page, it was last updated on May 23, 2020.

  • COVID-Net Open Source Initiative, is "a deep convolutional neural network design tailored for the detection of COVID-19 cases from chest X-ray (CXR) images that is open source and available to the general public."

According to their dataset web page, as of July 8, 2020, "COVIDNet-CT,[…] was trained and tested on 104,009 CT images from 1,489 patients."

You will soon be redirected to the 3E website. If the page has not redirected, please visit the 3E site here. Please visit our newsroom to learn more about this agreement: Verisk Announces Sale of 3E Business to New Mountain Capital.