Resources

Subword embeddings trained on scientific texts

subword-vectors is a repository to download (or train) subword embeddings from the arXiv dataset of 1.7M+ scholarly papers.

A Manually Annotated Test Collection for Citation Recommendation

acm-cr is a repository that contains a test collection for (context-aware) citation recommendation constructed from bibliographic records and open-access papers collected from the ACM Digital Library.

A Large-Scale Dataset for Biomedical Keyphrase Generation (kp-biomed)

kp-biomed is a large-scale dataset of 5.6 million PubMed abstracts with author assigned keyphrases for training and evaluating neural keyphrase generation models on the biomedical domain.

Silver-standard keyphrases from citation contexts for domain adaptation

silk is a dataset of synthetic samples for adapting keyphrase generation models to new domains. It contains synthetic samples for three specific domains: Natural Language Processing, Astrophysics and Paleontology, along with human-labeled test sets to evaluate keyphrase generation performance across these domains.