„A molecule in a haystack: Find drug targets in cancer omics data“

Background

21st century biology developed large-scale methods for genomics, epigenomics, proteomics and generates vast amounts of data. This includes DNA sequences, gene mutations, epigenetic modifications, gene expression, post-transcriptional regulation, protein levels, drug-protein-interactions, clinical parameters and countless other data types. However, we generate lots of data but lack methods to efficiently store, manage and analyze them.
We need better solutions to robustly combine all the knowledge from molecular omics data to clinically relevant covariates. In this challenge we try to identify new drug targets by integrating public data sets of cancer related omics experiments.

Problem

Research projects such as TCGA and ENCODE produce huge amounts of omics data for all kinds of biological samples and diseases. These data sets are very heterogenous and span all the different levels of cellular activity. They are generated in different experiments and measures things on different scales. Currently, data integration is tedious and requires a lot of manual work and expert knowledge.
Before we can do anything with with the data we need to get an overview, see the connections and understand the relevant biological questions. For that we have to clean, structure and integrate everything and enrich it with prior knowledge. This enables the first steps in data analysis, such as identification of relevant genes and disease specific regulation.
In order to facilitate this, we will develop new ways to store data and generate noSQL database models for life science data.

Actual Challenge

1. Develop a data model
a. Develop a data model that can represent all the different data types
b. Get data sets for a couple of samples from TCGA/ENCODE and drug targeting data
2. Store data in a database
3. Analyze data
a. Find genes that are relevant in specific cancer types
b. Identify molecules targeting these genes
4. Data access
a. Develop an API to allow flexible access to the datasets for e.g. predictive analytics and machine learning
Can you find a solution to this challenge?
Challenge owner: Martin Preusse (Neo4j)