oakx-spacy
Spacy + SciSpacy plugin for OAK.
ALPHA
Usage
Non-developers:
Create a preferred virtual environment (conda
, poetry
, venv
etc.). Install oakx-spacy
using pip install
.
pip install oakx-spacy
Next, desired models (Spacy and/or SciSpacy) need to be downloaded/installed. Following is the list of models available.
Spacy models
English pipelines optimized for CPU.
In order to install any of the below run python -m spacy download en_core_web_xxx
en_core_web_sm
: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.en_core_web_md
: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.en_core_web_lg
: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.en_core_web_trf
: Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer.
SciSpacy models
In order to install any of the below use the corresponding line in pyproject.toml
For example, if CRAFT corpus trained model is desired, do the following:
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_craft_md-0.5.1.tar.gz
Available models:
en_ner_craft_md
: A spaCy NER model trained on the CRAFT corpus.en_ner_jnlpba_md
: A spaCy NER model trained on the JNLPBA corpus.en_ner_bc5cdr_md
: A spaCy NER model trained on the BC5CDR corpus.en_ner_bionlp13cg_md
: A spaCy NER model trained on the BIONLP13CG corpus.en_core_sci_scibert
: A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model.en_core_sci_sm
: A full spaCy pipeline for biomedical data.en_core_sci_md
: A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors.en_core_sci_lg
: A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors.
SciSpacy linkers
These come preinstalled with scispacy
package itself. Available linkers are:
umls
: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.mesh
: Links to the Medical Subject Headings. This contains a smaller set of higher quality entities, which are used for indexing in Pubmed. MeSH contains ~30k entities. NOTE: The MeSH KB is derived directly from MeSH itself, and as such uses different unique identifiers than the other KBs.rxnorm
: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.go
: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.hpo
: Links to the Human Phenotype Ontology. The Human Phenotype Ontology contains 16k concepts focused on phenotypic abnormalities encountered in human disease.
Developers:
Clone the repository
git clone https://github.com/hrshdhgd/oakx-spacy.git
Install poetry
pip install poetry
SciSpacy models
In pyproject.toml
, uncomment the 2 lines corresponding to the models desired. For example, if the desired model is the CRAFT corpus, uncomment the following:
[tool.poetry.dependencies.en_ner_craft_md]
url = "https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_craft_md-0.5.1.tar.gz"
Install dependencies
poetry install
Spacy models
Instructions similar to non-developers. Just make sure to prepend the command by poetry run
The default model is set to en_ner_craft_md
and default linker to umls
.
How it works
Using an ontology
The input argument can be expressed as spacy:sqlite:obo:name-of-ontology
e.g. spacy:sqlite:obo:bero
.
A
.txt
file [runoak -i spacy:sqlite:obo:bero annotate --text-file tests/input/text.txt
]Words that need to be annotated.[
runoak -i spacy:sqlite:obo:bero annotate Myeloid derived suppressor cells \(MDSC\) are immature myeloid cells with immunosuppressive activity.
] should yield:
info: 'JsonObj(alias_map=JsonObj(**{''rdfs:label'': [''Myeloid-Derived Suppressor
Cell'']}))'
subject_end: 30
subject_label: Myeloid-Derived Suppressor Cell
subject_source: myeloid derive suppressor cell ( mdsc ) be immature myeloid cell with
immunosuppressive activity .
subject_start: 0
subject_text_id: NCIT:C129908
---
info: 'JsonObj(alias_map=JsonObj(**{''rdfs:label'': [''Immature Myeloid Cell'']}))'
subject_end: 64
subject_label: Immature Myeloid Cell
subject_source: myeloid derive suppressor cell ( mdsc ) be immature myeloid cell with
immunosuppressive activity .
subject_start: 43
subject_text_id: NCIT:C113503
Using SciSpacy.
The input argument can be expressed as spacy:linker-name
e.g. spacy:mesh
.
There are two possible inputs to this plugin:
A
.txt
file [runoak -i spacy: annotate --text-file text.txt
]Words that need to be annotated.[
runoak -i spacy: annotate Myeloid derived suppressor cells \(MDSC\) are immature myeloid cells with immunosuppressive activity.
] should yield (shortened):
confidence: 0.9999999403953552
info: JsonObj(aliases=['t cell suppressor', 'suppressor cell', 'T suppressor cell',
'suppressor cells', 'Suppressor cell', 'suppressor T lymphocyte', 'cells suppressor
t', 'Suppressor cells', 'Suppressor cell (cell)'], canonical_name='Suppressor T
Lymphocyte', concept_id='C0038856', definition='subpopulation of CD8+ T-lymphocytes
which suppress antibody production or inhibit cellular immune responses.', types=['T025'])
subject_end: 30
subject_label: suppressor cell
subject_source: myeloid derive suppressor cell ( mdsc ) be immature myeloid cell with
immunosuppressive activity .
subject_start: 15
subject_text_id: C0038856
---
...
---
confidence: 0.8391554355621338
info: JsonObj(aliases=['Myeloid Cell Leukemia Sequence 1', 'Myeloid Cell Leukemia
Sequence 1 Protein', 'Induced Myeloid Leukemia Cell Differentiation Protein Mcl-1',
'Myeloid Cell Factor-1', 'Myeloid Cell Factor 1', 'Induced Myeloid Leukemia Cell
Differentiation Protein Mcl 1', 'Factor-1, Myeloid Cell', 'Cell Factor-1, Myeloid'],
canonical_name='Myeloid Cell Leukemia Sequence 1 Protein', concept_id='C1510444',
definition='A member of the myeloid leukemia factor (MLF) protein family with multiple
alternatively spliced transcript variants encoding different protein isoforms. In
hematopoietic cells, it is located mainly in the nucleus, and in non-hematopoietic
cells, primarily in the cytoplasm with a punctate nuclear localization. MLF1 plays
a role in cell cycle differentiation.', types=['T116', 'T123'])
subject_end: 64
subject_label: myeloid cell
subject_source: myeloid derive suppressor cell ( mdsc ) be immature myeloid cell with
immunosuppressive activity .
subject_start: 52
subject_text_id: C1510444
Acknowledgements
This cookiecutter project was developed from the oakx-plugin-cookiecutter template and will be kept up-to-date using cruft.