Boag: Boa for Genomics

Boa is a domain specific language initially designed for Mining Software Repository. Boag is a domain specific language designed for analyzing biological data.

Source code and Documentation

GitHub Repository

 

Background

Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boag is needed to efficiently process and parse data contained in large data repositories. The main features of Boag are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories.

Boag Results and Dataset

Boa for genomics, Boag, has been implemented to analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boag provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code.
Boag databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations.

Boag illustration

Examples here on this website are few illustrations of query results that were obtained by Boag. A web-based interface is also provided to write and submit more complex queries to our infrastructure. Please see our GitHub Repository for more information.

Tree of Life provides summary statistics for each node in the tree of life.

Summary Statistics display summary statistics of phylogenetic trees based on NCBI taxonomy.

Assembly programs example provides some insight regarding the programs that have been used for genome assembly as well as some statistics like Contigs, Scaffolds, etc.