Section outline

  • Querying SIB Swiss Institute of Bioinformatics resources with SPARQL

    SWAT4HCLS - Edinburgh (Dec. 2019)


    The SIB Swiss Institute of Bioinformatics has been publishing data using Resource Description Framework (RDF) since 2007, with the UniProt knowledgebase as the first SIB resource to provide its data on the semantic web. Since then, more and more SIB resources are modelling their knowledge with RDF and made them queryable and accessible through their own SPARQL endpoints.


    In this tutorial, we explain how you can use the data from nine independent SIB resources (GlyConnect, UniProt, Rhea, OrthoDB, OMA, Bgee, HAMAP, MetaNetX and neXtProt) to answer interesting biological questions.


    For each resource, we present an introduction about what kind of data is available, followed by how it is modelled and then how you can query it using SPARQL. Then we illustrate the strength of SPARQL 1.1 federated queries to show how the connected SIB databases can answer more than any of our databases could independently.


    Domain knowledge wise it covers proteins, glycans, reactions of biological interest, orthology, metabolic networks, chemical mapping, and genome/proteome annotations.


    The tutorial starts with a quick introduction to RDF and SPARQL 1.1 in general.


    At the end of the course, participants are expected to be able to:

    • Have a basic understanding on SIB resources
    • Have some understanding on RDF and SPARQL


    Authors

    Jerven Bolleman Introduction to RDF & SPARQL
    Glyconnect
    UniProt
    HAMAP
    EBI RDF Ensembl (Elixir friend)
    DisGeNET (Elixir friend)
    Dmitry Kuznetsov OrthoDB
    Thierry Lombardot Rhea
    IDSM (Elixir friend)
    Julien Mariethoz Glyconnect
    Tarcisio Mendes de Faria
       
    Bgee
    OMA browser
    Anne Morgat Rhea
    IDSM (Elixir friend)
    Marco Pagni MetaNetX
    Monique Zahn neXtProt


    • Presentation of the SIB Swiss Institute of Bioinformatics + quick introductions to RDF and SPARQL in general.

    • Rhea is a comprehensive expert-curated resource of biochemical transformations, transport reactions, and spontaneous reactions of biological interest.

    • Accompanying Jupyter Notebook (hands-on introduction to querying metabolism related data across multiple data sources using SPARQL).

    • MetaNetX/MNXref is a resource for systems biology and metabolomics

    • The neXtProt knowledgebase is an integrative resource providing both data on human protein and the tools to explore these.

    • In this tutorial, you will learn how to query and retrieve orthology and paralogy information from the OMA database with SPARQL.


    • In this tutorial, you will learn how to query gene expression patterns from the Bgee database with SPARQL.


    • GlyConnect is a platform integrating sources of information to help characterise the molecular components of protein glycosylation.

    • HAMAP is a system for the classification and annotation of protein sequences. It consists of a collection of manually curated family profiles for protein classification, and associated, manually created annotation rules that specify annotations that apply to family members. HAMAP is used to annotate protein records in UniProtKB via UniProt's automatic annotation pipeline.

    • The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation.

      In addition to capturing the core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and cross-references, and clear indications of the quality of annotation in the form of evidence attribution of experimental and computational data.

    • OrthoDB: The hierarchical catalog of orthologs
      mapping genomics to functional data

    • IDSM (Elixir Czech node): Integrated Database of Small Molecules

      Ensembl (EBI RDF platform): Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation

      DisGeNET: genes and variants associated to human diseases