Biological Sciences East Tennessee State University
Michael (Misha) Bouzinier, Object Development, InterSystems Corporation
|AcidMiner is presented at
JavaOne 2010 in San Francisco. See our presentation.
|Relevant information from Gene
Ontology and FlyBase is imported directly into AcidMiner.
|AcidMiner results are published
in BMC Genomics.
See the Web
Version or PDF.
|AcidMiner Database recalculated
using updated technology.
Evolution of new gene functions is one of the
mysteries on modern biology. New genes and new functions can arise, for
example, by gene duplication, with one copy retaining the old function
other picking up a new one. Models show, however, that a more likely
duplicated genes is balanced degradation – the situation in which both
sustain harmful mutations and loose part of the activity, so that both
are now needed to maintain the pre-duplication level of the old
To get a genome-wide signature of balanced
neofunctionalization one can measure how frequent and how radical are
acid substitutions in duplicated genes as compared to single-copy genes
whether or not these patterns are different in the two branches of the
phylogenetic tree generated by the duplication.
At the current stage
used AcidMiner to examine the topology of
15,000 gene trees encompassing 12 species of Drosophila with completely
sequenced genomes, analyzing more than 8 million amino-acid
placing, whenever possible, each of them on specific branches,
the outgroup and grouping by gene molecular function.
projects (once we have the tool in place) will include the same
of mammalian genomes
perhaps with chicken as the outgroup); analysis of amino acid
substitutions within species (polymorphisms) using 200 Drosophila
melanogaster genomes currently in the making; analysis of
self-incompatibility genes in plants (known to evolve very fast in
terms of frequency of substitutions, but not necessarily in terms of
how radical these substitutions are).
Our current results are published in BMC Genomics:
Most of the original data we use for our calculations is taken from
Indiana University database Drosophila Family
Data: FRB Dataset. We are grateful to M. Hahn for providing
alignments, phylogeny and useful discussion. We are using Java to
program most of the algorithms and Cache Object
Script for those algorithms that deal with the data already in the
We use JPA to define RDBMS mapping.
At the moment Intersystems Cache is used as RDBMS and Intersystems
Jalapeño as DBMS driver, however all DBMS specific code is
placed in a
single small class (DAO.java) and can be adapted to be used with other
RDBMS that support JDBC. Jalapeño technology allows
transparently map complex object model into relational database. For
example let us take family
1218 in Dfam database (signal
transduction). It contains 48 proteins, its alignment contains
about 5500 sites. That yields 523450 tree nodes and branches in
phylogeny for all sites. Using parsimony we are reviewing 3000
substitutions possibilities. All this information for the whole family
in structured yet "flatted" to fit relational requirement from
Jalapeño transparently stores in a single call.
We will be interested in adding support for
We are creating Virtual Appliance with DBMS containing all the data
using VirtualBox and VMWare
For further discussion about technology please see the presentation we made at
JavaOne conference in September 2010.
Using the Project
Configuration of the environment for AcidMiner is not trivial therefore
we release a Virtual Appliance that is made with VirtualBox. It runs
SuSe Linux and has installed DBMS (InterSystems Cache) with the current
data. Installation of the VM on your system will require about
30G of space on your hard drive (or external hard drive) and 384M of
RAM. The amount of RAM can be decreased in virtual machine settings if
384M is a problem, but decreasing it to less than 256M will probably
noticeably affect performance.
The latest version of the virtual machine can be downloaded from:
or from a
You can view all current data and run any SQL queries using any SQL
tool that supports connection over JDBC. Our favorite is DBVisulizer but any other will
do. You can obtain Cache JDBC driver by downloading Cache from
InterSystems website and selecting client install. Use the connection
settings as described here.
Below you will find brief description of the
Some of the SQL queries that can be performed with the database are
The full SQL for sample queries and others that we ran most often are
included in SVN repository (see below).
If you prefer ODBC connection you will need to set up ODBC
datasource which is possible but tiresome.
If you would like to modify any of our algorithms, download source code
Or check it out:
svn co https://acidminer.svn.sourceforge.net/svnroot/acidminer
Source code includes an IntelliJ IDEA Project.
Once you have Java source code you would need to adjust database
connection settings in class com.intersys.bio.paralogs.db.DAO.
To set up Java development environment follow these steps:
- Download source code and project files from the source
- The project is an IntelliJ IDEA Project. Open it in IDEA. If you
prefer not to use IDEA you can use ANT build scripts.
- In IDEA Plugin management install JalapenoDev Plugin (for DBMS
integration). Configure connection to DBMS from the VM using settings below.
Alternatively you can follow the prompt to download and install
Cache on your system.
- For more information on using Jalapeño Plugin for IDEA and
Jalapeño technology see JalapeñoDev
- In DAO.java class configure database connections (host, port,
- In "Jalapeño Tools" menu of IDEA there is Help submenu
to some documentation and QuickStart Guide. there you can find
instructions on how to build database schema. Basically you need to
open Jalapeño Preferences in IDEA preferences and change host,
namespace. Then select "Build Database Schema" from Build menu.
- Most of the options available through Toolkit.java class. There
also several CSH scripts for batch computations.
to Virtual Machine DBMS:
|IP (may be different depending
on VirtualBox configuration)
The database consists of 14 tables defining the base data model.
Two additional tables contain preloaded data for gene ontology and
of substitutions including DPolarity and Expressibility. The main
- Families table
- Tree Structure tables for protein and amino-acid trees with a
separate record for each tree node and a branch terminating in this
- Substitutions table with a record for each unambiguous and
ambiguous substitution including a reference to branch where it
occurred (or might have occurred for ambiguous substitutions), source
and destination amino-acid and various additional data.
- Duplications table that includes detailed information about each
duplication and its clades.
Each table beside identification columns (included in the primary
key) contains columns with navigation data allowing easily reach
related records and some additional information used mostly to speed up
various calculations inside SQL queries. Examples of such additional
data include branch lengths, protein lengths and some aggregates for
Standard SQL and some DBMS specific SQL extensions can be used to
perform queries by any valid criteria. Any SQL tool capable of working
with JDBC datasource can be used though we used DBVisualizer. There are
no restrictions on the type of joins that can be included in the query.
A query to calculate radicality data by molecular function
(actual SQL is available in the source repository):
select MolecularFunction, ambiguity, <all radicality aggregates>,
<all radiaclity variances>
from SubstitutionsTable join
OntlogyTable join RadicalityTable
group by MolecularFunction,
A query to calculate distribution of substituions by source and
amino acid (actual SQL is available in the source repository):
select ambiguity, SourceAminoAcid, DestinationAminoAcid, count
group by ambiguity,