oakx-spacy

Spacy + SciSpacy plugin for OAK.

ALPHA

Usage

Non-developers:

Create a preferred virtual environment (conda, poetry, venv etc.). Install oakx-spacy using pip install.

pip install oakx-spacy

Next, desired models (Spacy and/or SciSpacy) need to be downloaded/installed. Following is the list of models available.

Spacy models

English pipelines optimized for CPU. In order to install any of the below run python -m spacy download en_core_web_xxx

  1. en_core_web_sm: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.

  2. en_core_web_md: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.

  3. en_core_web_lg: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.

  4. en_core_web_trf: Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer.

SciSpacy models

In order to install any of the below use the corresponding line in pyproject.toml

For example, if CRAFT corpus trained model is desired, do the following:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_craft_md-0.5.1.tar.gz

Available models:

  1. en_ner_craft_md: A spaCy NER model trained on the CRAFT corpus.

  2. en_ner_jnlpba_md: A spaCy NER model trained on the JNLPBA corpus.

  3. en_ner_bc5cdr_md: A spaCy NER model trained on the BC5CDR corpus.

  4. en_ner_bionlp13cg_md: A spaCy NER model trained on the BIONLP13CG corpus.

  5. en_core_sci_scibert: A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model.

  6. en_core_sci_sm: A full spaCy pipeline for biomedical data.

  7. en_core_sci_md: A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors.

  8. en_core_sci_lg: A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors.

SciSpacy linkers

These come preinstalled with scispacy package itself. Available linkers are:

  1. umls: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.

  2. mesh: Links to the Medical Subject Headings. This contains a smaller set of higher quality entities, which are used for indexing in Pubmed. MeSH contains ~30k entities. NOTE: The MeSH KB is derived directly from MeSH itself, and as such uses different unique identifiers than the other KBs.

  3. rxnorm: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.

  4. go: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.

  5. hpo: Links to the Human Phenotype Ontology. The Human Phenotype Ontology contains 16k concepts focused on phenotypic abnormalities encountered in human disease.

Developers:

Clone the repository

git clone https://github.com/hrshdhgd/oakx-spacy.git

Install poetry

pip install poetry

SciSpacy models

In pyproject.toml, uncomment the 2 lines corresponding to the models desired. For example, if the desired model is the CRAFT corpus, uncomment the following:

[tool.poetry.dependencies.en_ner_craft_md]
url = "https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_craft_md-0.5.1.tar.gz"

Install dependencies

poetry install

Spacy models

Instructions similar to non-developers. Just make sure to prepend the command by poetry run

The default model is set to en_ner_craft_md and default linker to umls.

How it works

Using an ontology

The input argument can be expressed as spacy:sqlite:obo:name-of-ontology e.g. spacy:sqlite:obo:bero.

  1. A .txt file [runoak -i spacy:sqlite:obo:bero annotate --text-file tests/input/text.txt]

  2. Words that need to be annotated.[runoak -i spacy:sqlite:obo:bero annotate Myeloid derived suppressor cells \(MDSC\) are immature myeloid cells with immunosuppressive activity.] should yield:

info: 'JsonObj(alias_map=JsonObj(**{''rdfs:label'': [''Myeloid-Derived Suppressor
  Cell'']}))'
subject_end: 30
subject_label: Myeloid-Derived Suppressor Cell
subject_source: myeloid derive suppressor cell ( mdsc ) be immature myeloid cell with
  immunosuppressive activity .
subject_start: 0
subject_text_id: NCIT:C129908

---
info: 'JsonObj(alias_map=JsonObj(**{''rdfs:label'': [''Immature Myeloid Cell'']}))'
subject_end: 64
subject_label: Immature Myeloid Cell
subject_source: myeloid derive suppressor cell ( mdsc ) be immature myeloid cell with
  immunosuppressive activity .
subject_start: 43
subject_text_id: NCIT:C113503

Using SciSpacy.

The input argument can be expressed as spacy:linker-name e.g. spacy:mesh. There are two possible inputs to this plugin:

  1. A .txt file [runoak -i spacy: annotate --text-file text.txt]

  2. Words that need to be annotated.[runoak -i spacy: annotate Myeloid derived suppressor cells \(MDSC\) are immature myeloid cells with immunosuppressive activity.] should yield (shortened):

confidence: 0.9999999403953552
info: JsonObj(aliases=['t cell suppressor', 'suppressor cell', 'T suppressor cell',
  'suppressor cells', 'Suppressor cell', 'suppressor T lymphocyte', 'cells suppressor
  t', 'Suppressor cells', 'Suppressor cell (cell)'], canonical_name='Suppressor T
  Lymphocyte', concept_id='C0038856', definition='subpopulation of CD8+ T-lymphocytes
  which suppress antibody production or inhibit cellular immune responses.', types=['T025'])
subject_end: 30
subject_label: suppressor cell
subject_source: myeloid derive suppressor cell ( mdsc ) be immature myeloid cell with
  immunosuppressive activity .
subject_start: 15
subject_text_id: C0038856

---

...

---
confidence: 0.8391554355621338
info: JsonObj(aliases=['Myeloid Cell Leukemia Sequence 1', 'Myeloid Cell Leukemia
  Sequence 1 Protein', 'Induced Myeloid Leukemia Cell Differentiation Protein Mcl-1',
  'Myeloid Cell Factor-1', 'Myeloid Cell Factor 1', 'Induced Myeloid Leukemia Cell
  Differentiation Protein Mcl 1', 'Factor-1, Myeloid Cell', 'Cell Factor-1, Myeloid'],
  canonical_name='Myeloid Cell Leukemia Sequence 1 Protein', concept_id='C1510444',
  definition='A member of the myeloid leukemia factor (MLF) protein family with multiple
  alternatively spliced transcript variants encoding different protein isoforms. In
  hematopoietic cells, it is located mainly in the nucleus, and in non-hematopoietic
  cells, primarily in the cytoplasm with a punctate nuclear localization. MLF1 plays
  a role in cell cycle differentiation.', types=['T116', 'T123'])
subject_end: 64
subject_label: myeloid cell
subject_source: myeloid derive suppressor cell ( mdsc ) be immature myeloid cell with
  immunosuppressive activity .
subject_start: 52
subject_text_id: C1510444

Acknowledgements

This cookiecutter project was developed from the oakx-plugin-cookiecutter template and will be kept up-to-date using cruft.