Mapping Assembly Pipeline

The SeMRA assembly pipeline is a declarative way to say which sources should be used, which prior knowledge should be injected into processing, and how entities should be “prioritized” on output.

The assembly and inference of semantic mappings can solve the problem illustrated in the image below where a combination of incomplete mappings can can lead to the prioritization of an entity from a target namespace, and the creation of a priorization mapping set (i.e., a star graph).

_images/pipeline.svg

In the following demo, which closely resembles the configuration in semra.landscape.cell, we show how to fill out a configuration in the Python DSL.

import pystow

from semra import Configuration, Reference
from semra.pipeline import AssembleReturnType, Input, Mutation, assemble
from semra.vocabulary import CHARLIE

configuration = Configuration(
    # the key is a short name for the configuration. this is required
    key="cell",
    # the name is a human-readable representation of the configuration
    name="SeMRA Cell and Cell Line Mappings Database",
    # an (optional) description of the reason the configuration was created
    description="Originally a reproduction of the EFO/Cellosaurus/DepMap/CCLE scenario posed in "
    "the Biomappings paper, this configuration imports several different cell and cell line "
    "resources and identifies mappings between them.",
    # an (optional) list of references for creators
    creators=[CHARLIE],
    # the places where data should be acquired
    inputs=[
        Input(source="biomappings"),
        Input(source="gilda"),
        Input(prefix="cellosaurus", source="pyobo", confidence=0.99),
        Input(prefix="bto", source="bioontologies", confidence=0.99),
        Input(prefix="cl", source="bioontologies", confidence=0.99),
        Input(prefix="clo", source="custom", confidence=0.65),
        Input(prefix="efo", source="pyobo", confidence=0.99),
        Input(
            prefix="depmap",
            source="pyobo",
            confidence=0.99,
            extras={"version": "22Q4", "standardize": True, "license": "CC-BY-4.0"},
        ),
        Input(prefix="ccle", source="pyobo", confidence=0.99, extras={"version": "2019"}),
        Input(prefix="ncit", source="pyobo", confidence=0.99),
        Input(prefix="umls", source="pyobo", confidence=0.99),
    ],
    # configuration for how inputs should be subset'd. This is a dictionary
    # with keys that correspond to prefixes and values are collections of
    # references whose hierarhical descendants get kept. For example, this
    # is useful to take subsets from generic resources like NCIT, MeSH, and
    # UMLS
    subsets={
        "mesh": [Reference.from_curie("mesh:D002477")],
        "efo": [Reference.from_curie("efo:0000324")],
        "ncit": [Reference.from_curie("ncit:C12508")],
        "umls": [Reference.from_curie("sty:T025")],
    },
    # the prioritization of prefixes for creating star graphs. the prefixes
    # appearing earlier in the list are higher priority
    priority=[
        "mesh",
        "efo",
        "cellosaurus",
        "ccle",
        "depmap",
        "bto",
        "cl",
        "clo",
        "ncit",
        "umls",
    ],
    # only prefixes in this list are kept from raw mappings. If there are
    # relevant intermediate for mappings that you don't want to keep after
    # processing, use ``post_keep_prefixes``. This is often the same as the
    # priority list
    keep_prefixes=[],
    # should mappings in the imprecise mappings list (e.g., dbxrefs, rdfs:seeAlso)
    # be removed during processing? Defaults to True.
    remove_imprecise=False,
    # mutations allow you to specify your prior knowledge, for example, that
    # all dbxrefs in EFO should be upgraded to skos:exactMatch with a confidence
    # of 0.7. Mutations can be configured further to only apply to a subset
    # of targets, to change the source predicate from dbxref to something else,
    # or to change the target predicate from skos:exactMatch to somethign else
    mutations=[
        Mutation(source="efo", confidence=0.7),
        Mutation(source="bto", confidence=0.7),
        Mutation(source="cl", confidence=0.7),
        Mutation(source="clo", confidence=0.7),
        Mutation(source="depmap", confidence=0.7),
        Mutation(source="ccle", confidence=0.7),
        Mutation(source="cellosaurus", confidence=0.7),
        Mutation(source="ncit", confidence=0.7),
        Mutation(source="umls", confidence=0.7),
    ],
    # Should labels be looked up using PyOBO during SSSOM and Neo4j output?
    # this adds some build time. Defaults to False.
    add_labels=True,
    # If this configuration should get uploaded to Zenodo via the ``zenodo_client``
    # python package, use this record ID
    zenodo_record=...,
    # The directory where the results of build should get output. This is
    # required.
    directory=pystow.module("semra", "pipeline-example").base,
)

# these mappings induce a star graph based on the prioritization
priority_mappings = assemble(configuration)

# raw and processed mappings can be returned as well
mapping_pack = assemble(
    configuration,
    return_type=AssembleReturnType.all,
)

For reference, the semra.landscape module contains several pipeline configurations.

Functions

assemble()

Get prioritized mappings based on an assembly configuration.

get_raw_mappings(configuration[, ...])

Get raw mappings based on the inputs in a configuration.

process_raw_mappings(mappings[, ...])

Run a full deduplication, reasoning, and inference pipeline over a set of mappings.

Classes

AssembleReturnType(*values)

An enumeration for the return values for assemble().

Configuration(*, name, key, description, ...)

Represents the steps taken during mapping assembly.

Input(*, source, , , , , , ], prefix, ...)

Represents the input to a mapping assembly.

Mutation(*, source[, target, confidence, ...])

Represents a mutation operation on a mapping set.