Functional Genomics Platform - overview
The IBM Functional Genomics Platform (formerly named OMXWare) is a relational database linking genotype to phenotype for over 300M biological sequences extracted from microbial genomes. This cloud-based platform is continuously updated with hundreds of thousands of genomes from NCBI GenBank, Sequence Read Archive (SRA), and other sources. Raw bacterial sequence datasets are self-consistently assembled and curated for quality yielding whole bacterial genomes. The complete collection of assembled genomes are then annotated to identify every gene and protein they contain. Another set of cloud processes discover all of the domains within each protein. Protein domains are fundamental objects of biochemistry delivering the biological activity of the microorganism. They evolve function and can exist independently of the larger protein chain in which they are found. They are also assigned standardized codes representing their molecular function, cellular component, and biological process they are associated with. All of these biological entities are then linked in the Functional Genomics Platform. Linking genomes to protein domains is key to understanding phenotype in biology. The IBM Functional Genomics Platform provides a developer toolkit consisting of REST Services, Python SDK, and a Docker container to help researchers analyze this vast data repository at scale. This toolkit also allows developers to interact with the platform in their own compute environment and integrate with their existing workflows. Important applications can be built on top of the Functional Genomics Platform including services to annotate biological function in the microbiome, predict antimicrobial resistance (AMR), develop molecular targets for health interventions, or to expand our fundamental knowledge of microbial life.
Today the IBM Functional Genomics Platform has approximately 220,000 high quality bacterial and viral genomes, 64 million unique genes, 50 million unique proteins, and over 220 million unique protein domains. This data repository also includes public metadata including geography, food source (for foodborne isolates), AMR assay data, etc. This data is growing as new genomes are automatically detected and downloaded from public sources.