Datasets

SciMON: Scientific Inspiration Machines Optimized for Novelty
This repositiory contains datasets for SciMon Paper. The NLP dataset is based on 67,409 ACL anthology papers from 1952 to 2022. The biomedical dataset is based on 5,704 papers from PubMed. [Paper] [Code/Dataset] [Slides] [Poster]

Chem-FINESE: Validating Fine-Grained Few-shot Entity Extraction through Text Reconstruction
This repositiory contains two chemical few-shot fine-grained entity extraction dataset based on ChemNER and CHEMET. We choose the values 6, 9, 12, 15, 18 as the potential maximum entity mentions for k-shot for both datasets. [Paper] [Code] [Dataset]

Multimedia Goal-oriented Generative Script Learning Dataset
This repository contains 5,652 tasks and 79,089 multimedia steps for gardening (Training: 20,258/Development: 2,428/Test: 2,684) and crafts category (Training: 32,082/Development: 4,064/Test: 3,937). [Paper] [Code] [Dataset] [Bib]

Wikipedia Pre-train Pairs Dataset

This repository contains 542,192 data pairs used for the Wikipedia fine-tuning stage . The data folder contains 166 JSON files which include graph-to-text pairs related to 15 categories (Astronaut, University, Monument, Building, ComicsCharacter, Food, Airport, SportsTeam, WrittenWork, Athlete, Artist, City, MeanOfTransportation, CelestialBody, Politician) that appear in the WebNLG dataset. [Paper] [Code] [Dataset] [Slides] [Poster] [Bib]

ReviewRobot Dataset
This dataset contains 8,110 paper and review pairs and background KG from 174,165 papers. It also contains information extraction results from SciIE and various knowledge graph built on the IE results. The detailed information can be found here. [Paper] [Dataset] [Bib]

Covid-KG
This dataset currently gathers knowledge extraction result from 14,229 papers and 6217 abstracts about Semantic Scholar’s CORD-19 Dataset, Best Demo Award at NAACL-HLT 2021. [Paper] [KG]

PubMed Paper Reading Dataset
This dataset gathers 14,857 entities, 133 relations, and entities corresponding tokenized text from PubMed. It contains 875,698 training pairs, 109,462 development pairs, and 109,462 test pairs. [Paper] [Bib] [Dataset]

PubMed Term, Abstract, Conclusion, Title Dataset
This dataset gathers three types of pairs: Title-to-Abstract (Training: 22,811/Development: 2095/Test: 2095), Abstract-to-Conclusion and Future work (Training: 22,811/Development: 2095/Test: 2095), Conclusion and Future work-to-Title (Training: 15,902/Development: 2095/Test: 2095) from PubMed. Each pair contains a pair of input and output as well as the corresponding terms(from original KB and link prediction results). [Paper] [Bib] [Dataset]

Wikipedia Person and Animal Dataset
This dataset gathers 428,748 person and 12,236 animal infobox with descriptions based on Wikipedia dump (2018/04/01) and Wikidata (2018/04/12). [Paper] [Bib] [Dataset]

ACL Title and Abstract Dataset
This dataset gathers 10,874 title and abstract pairs from the ACL Anthology Network (until 2016). [Paper] [Bib] [Dataset]

Patents

Structured Graph-To-Text Generation with Two Step Fine-Tuning.
Qingyun Wang, Semih Yavuz, Xi Lin, Nazneen Rajani, US2022050964A1, Issued Feb 17, 2022

Remote control moving table
Qingyun Wang. CN202774939U. Issued Mar 13, 2013

Soap making device using household waste oil
Qingyun Wang. CN201817462U. Issued May 4, 2011