Sha256: 863419eaba185d45107ea8019d446bbcec748433f3476f4c91cbeb1baaae0482
Contents?: true
Size: 1.75 KB
Versions: 24
Compression:
Stored size: 1.75 KB
Contents
lemma key collection: 10 random PubMed documents with ASCII text split into sentences and tokens by the MedPost tokenizer Original source sentence.xml source: PubMed date: yyyymmdd. Date documents downloaded from PubMed document: Title and possibly abstract from a PubMed reference id: PubMed id passage: Either title or abstract infon["type"]: "title" or "abstract" offset: The original Unicode byte offsets were not updated after the ASCII conversion. PubMed is extracted from an XML file, so literal offsets would not be useful. Title has an offset of zero, while the abstract is assumed to begin after the title and one space. These offsets at least sequence the abstract after the title. sentence: One sentence of the passage as determined by the MedPost sentence splitter offset: A document offset to where the sentence begins in the passage. Sum of the passage offset and the local offset within the passage. annotation: tokens in the sentence with their part-of-speech. the annotations are of "type" "token" infon["POS"]: The Penn Treebank part of speech tag as determined by the MedPost biomedical part-of-speech tagger infon["lemma"]: The tokens lemma as determined by BioLemmatizer location: offset: A document offset to where the annotated text begins in the sentence. Sum of the sentence offset and the local offset within the sentence. length: The length of the token. text: ASCII text of the token.
Version data entries
24 entries across 24 versions & 1 rubygems