Project AcidMiner

Project Leads:

Lev Yampolsky, Department of Biological Sciences East Tennessee State University

Michael (Misha) Bouzinier, Object Development, InterSystems Corporation

News


2010/09/21
AcidMiner is presented at JavaOne 2010 in San Francisco. See our presentation.
2010/11/10
Relevant information from Gene Ontology and FlyBase is imported directly into AcidMiner.
2010/12/02
AcidMiner results are published in BMC Genomics. See the Web Version or PDF.
2010/12/05
AcidMiner Database recalculated using updated technology.

Project Goals

Evolution of new gene functions is one of the persisting mysteries on modern biology. New genes and new functions can arise, for example, by gene duplication, with one copy retaining the old function and the other picking up a new one. Models show, however, that a more likely fate of duplicated genes is balanced degradation – the situation in which both copies sustain harmful mutations and loose part of the activity, so that both copies are now needed to maintain the pre-duplication level of the old function.

To get a genome-wide signature of balanced degradation and neofunctionalization one can measure how frequent and how radical are amino acid substitutions in duplicated genes as compared to single-copy genes and whether or not these patterns are different in the two branches of the phylogenetic tree generated by the duplication.

At the current stage we used AcidMiner to examine the topology of 15,000 gene trees encompassing 12 species of Drosophila with completely sequenced genomes, analyzing more than 8 million amino-acid substitutions and placing, whenever possible, each of them on specific branches, polarizing by the outgroup and grouping by gene molecular function.

Future projects (once we have the tool in place) will include the same analysis of mammalian genomes (Human/chimp/orang/baboon/tupaiya/mouse/rat/dog/pig/cattle/oppossum perhaps with chicken as the outgroup); analysis of amino acid substitutions within species (polymorphisms) using 200 Drosophila melanogaster genomes currently in the making; analysis of self-incompatibility genes in plants (known to evolve very fast in terms of frequency of substitutions, but not necessarily in terms of how radical these substitutions are).

Current Results

Our current results are published in BMC Genomics:

http://www.biomedcentral.com/1471-2164/11/S4/S10

Technology

Most of the original data we use for our calculations is taken from Indiana University database Drosophila Family Data: FRB Dataset. We are grateful to M. Hahn for providing alignments, phylogeny and useful discussion. We are using Java to program most of the algorithms and Cache Object Script for those algorithms that deal with the data already in the database.

We use JPA to define RDBMS mapping.

At the moment Intersystems Cache is used as RDBMS and Intersystems Jalapeño as DBMS driver, however all DBMS specific code is placed in a single small class (DAO.java) and can be adapted to be used with other RDBMS that support JDBC. Jalapeño technology allows transparently map complex object model into relational database. For example let us take family 1218 in Dfam database (signal transduction). It contains 48 proteins, its alignment contains about 5500 sites. That yields 523450 tree nodes and branches in phylogeny for all sites. Using parsimony we are reviewing 3000 substitutions possibilities. All this information for the whole family in structured yet "flatted" to fit relational requirement from Jalapeño transparently stores in a single call.

We will be interested in adding support for MySQL.

We are creating Virtual Appliance with DBMS containing all the data using VirtualBox and VMWare Workstation.

For further discussion about technology please see the presentation we made at JavaOne conference in September 2010.

Using the Project

Configuration of the environment for AcidMiner is not trivial therefore we release a Virtual Appliance that is made with VirtualBox. It runs SuSe Linux and has installed DBMS (InterSystems Cache) with the current data.  Installation of the VM on your system will require about 30G of space on your hard drive (or external hard drive) and 384M of RAM. The amount of RAM can be decreased in virtual machine settings if 384M is a problem, but decreasing it to less than 256M will probably noticeably affect performance.
The latest version of the virtual machine can be downloaded from: SourceForge or from a backup location.
You can view all current data and run any SQL queries using any SQL tool that supports connection over JDBC. Our favorite is DBVisulizer but any other will do. You can obtain Cache JDBC driver by downloading Cache from InterSystems website and selecting client install. Use the connection settings as described here. Below you will find brief description of the database structure
Some of the SQL queries that can be performed with the database are shown below. The full SQL for sample queries and others that we ran most often are included in SVN repository (see below).

 If you prefer ODBC connection you will need to set up ODBC datasource which is possible but tiresome.

If you would like to modify any of our algorithms, download source code from:

SVN Repositories:

Browse Source Code

Or check it out:

svn co https://acidminer.svn.sourceforge.net/svnroot/acidminer acidminer

Source code includes an IntelliJ IDEA Project.

Once you have Java source code you would need to adjust database connection settings in class com.intersys.bio.paralogs.db.DAO.

Setting Up

To set up Java development environment follow these steps:
  1.  Download source code and project files from the source repository.
  2. The project is an IntelliJ IDEA Project. Open it in IDEA. If you prefer not to use IDEA you can use ANT build scripts.
  3. In IDEA Plugin management install JalapenoDev Plugin (for DBMS integration). Configure connection to DBMS from the VM using settings below. Alternatively you can follow the prompt to download and install InterSystems Cache on your system.
  4. For more information on using Jalapeño Plugin for IDEA and Jalapeño technology see JalapeñoDev Group.
  5. In DAO.java class configure database connections (host, port, namespace, etc).
  6. In "Jalapeño Tools" menu of IDEA there is Help submenu that refer to some documentation and QuickStart Guide. there you can find instructions on how to build database schema. Basically you need to open Jalapeño Preferences in IDEA preferences and change host, port and namespace. Then select "Build Database Schema" from Build menu.
  7. Most of the options available through Toolkit.java class. There also several CSH scripts for batch computations.

Configuration Settings for Connection to Virtual Machine DBMS:

IP (may be different depending on VirtualBox configuration) 
192.168.56.101 
Namespace ACIDMINER
Port
1972
Username
_SYSTEM
Password
SYS


Database Structure

The database consists of 14 tables defining the base data model. Two additional tables contain preloaded data for gene ontology and radicality of substitutions including DPolarity and Expressibility. The main tables include:

Each table beside identification columns (included in the primary key) contains columns with navigation data allowing easily reach related records and some additional information used mostly to speed up various calculations inside SQL queries. Examples of such additional data include branch lengths, protein lengths and some aggregates for subtrees.

Standard SQL and some DBMS specific SQL extensions can be used to perform queries by any valid criteria. Any SQL tool capable of working with JDBC datasource can be used though we used DBVisualizer. There are no restrictions on the type of joins that can be included in the query.

Sample Queries

A query to calculate radicality data by molecular function (actual SQL is available in the source repository):

    select MolecularFunction, ambiguity, <all radicality aggregates>, <all radiaclity variances>
      from SubstitutionsTable join OntlogyTable join RadicalityTable
      group by MolecularFunction, ambiguity

A query to calculate distribution of substituions by source and destination amino acid (actual SQL is available in the source repository):

    select ambiguity, SourceAminoAcid, DestinationAminoAcid, count
      from SubstitutionsTable
      group by ambiguity, SourceAminoAcid, DestinationAminoAcid