Subword embeddings trained on scientific texts
subword-vectors is a repository to download (or train) subword embeddings from the arXiv dataset of 1.7M+ scholarly papers.
A Manually Annotated Test Collection for Citation Recommendation
acm-cr is a repository that contains a test collection for (context-aware) citation recommendation constructed from bibliographic records and open-access papers collected from the ACM Digital Library.
A Large-Scale Dataset for Biomedical Keyphrase Generation (kp-biomed)
kp-biomed is a large-scale dataset of 5.6 million PubMed abstracts with author assigned keyphrases for training and evaluating neural keyphrase generation models on the biomedical domain.
Silver-standard keyphrases from citation contexts for domain adaptation
silk is a dataset of synthetic samples for adapting keyphrase generation models to new domains. It contains synthetic samples for three specific domains: Natural Language Processing, Astrophysics and Paleontology, along with human-labeled test sets to evaluate keyphrase generation performance across these domains.