Computational Bioinformatics & Bio-imaging Laboratory (CBIL)


 
 

Internal Use

caBIG: VIsual and Statistical Data Analyzer-VISDA(caBIG-ICR-100501)
 
 
 

INTRODUCTION

PROJECT MANAGEMENT AND DELIVERABLES

SOFTWARE DEVELOPMENT METHODOLOGY AND DESIGN

 

 

 

INTRODUCTION

The primary objective of this proposed project is to adapt/convert the VIsual and Statistical Data Analyzer (VISDA) software to the cancer Biomedical Informatics Grid (caBIG) architecture to allow users across the cancer research community to analyze their own molecular expression data as well as "grid" data using this powerful tool for clustering modeling, discovery, and visualization. The resulting software will be open-source in C++/Java compatible at the Silver maturity level of the caBIG. As part of the Integrative Cancer Research (ICR) Workspace, the project will be a demonstration of how a shared informatics platform and expertise can allow a comprehensive, federated grid of informatics tools to be made available to the cancer research community. The proposed development of an open-source VISDA software will build upon the current computational/statistical bioinformatics expertise at the Computational Bioinformatics and Bioimaging Laboratory (CBIL) (http://www.cbil.ece.vt.edu). The VISDA software will support the mission of caBIG-ICR in that the development is tasked to develop a well-documented and validated toolset for use throughout the cancer research community.

The main application/utility of VISDA is for multivariate cluster modeling, discovery, and visualization, particularly for data sets living in high dimensional space such as microarray gene expression profiles [Wang et al. 2000, 2004]. Many biomedical research efforts, when formulated, are to explore the hidden clustered structure of the data in one way or another. The applications can be found in microarray data analysis, proteomics data analysis, and clinical data analysis, etc. For example, define new cancer subtypes based on their gene expression patterns, construct hierarchical trees of multiclass cancer phenotypic composite, or discover the correlation between cancer statistics and risk factors.

Multivariate data modeling and visualization have proven to be the powerful yet critical tools for the analysis and interpretation of complex data. To reveal all of the interesting patterns within a data set, we have developed VISDA algorithm/software for cluster modeling, discovery, and visualization. The model-supported exploration of high-dimensional data space is achieved through two complementary schemes: dimensionality reduction by discriminatory component analysis and cluster formation by soft data clustering, whose parameters are estimated using the weighted Fisher criterion and expectation-maximization algorithm. VISDA uses an adaptive boosting of discriminatory subspaces involving hierarchical mixture modeling of the data set. The hierarchical mixture model, selected optimally by the minimum description length criterion, allows the complete data set to be visualized at the top level and so partitions the data set, with clusters and subclusters of data points visualized at deeper levels. Each subspace model is linear while the complete hierarchy maintains overall nonlinearity.

VISDA is capable of navigating into a high dimensional data set to discover the hidden clustered data structure, and model and visualize the discovery. It is particularly effective when dealing with highly complex data sets as compared to existing methods. To reveal all of the hidden clusters, our exploration of high-dimensional data space is both statistically-principled and visually-insightful. Our method can incorporate both the power of statistical methods and the human gift for pattern recognition, and is capable of capturing progressively all interesting aspects of the data set. To the best of our knowledge, it represents state-of-the-art in visual statistical data analysis and exploration. VISDA incorporates the most advanced theories, methods, and algorithms in statistical learning. It also works for both unsupervised and supervised scenarios.

As part of the caBIG-ICR workspace, VISDA will be 1) deployed to the Adopter site(s) for analysis; 2) adapted to the caBIG architecture; and 3) enhanced with a
graphical user interface to improve usability. The specific measurable objectives are:

· Develop a Functional Requirements and Design Specification document, in collaboration with the Adopter Center(s) and the Architecture Workspace
· Create a Risk Management Matrix for the project
· Document a Test Approach that ensures requirements are met
· Develop C++/Java code for the following functionality:
- Cluster modeling using hierarchical standard final normal mixture (HSFNM) models
- Dimension reduction by principle component analysis (PCA), discriminatory component analysis (DCA) and project pursuit method (PPM)
- Cluster formation by soft data clustering using expectation-maximization (EM) algorithm
- Cluster validation by minimum description length (MDL) criterion
- Cluster visualization by hierarchical cluster display
- Graphical user interface (GUI) for VISDA set-up, data input-output, data analysis, and data/results visualization
· Execute on Test Approach

 

 

PROJECT MANAGEMENT AND DELIVERABLES

The VISDA Team and Project Organization

CBIL a partnership among Virginia Tech, Georgetown University, and Catholic University of America, has been continuously funded by various NIH and DOD grants for over 8 years. This proposed project will leverage CBIL's extensive bioinformatics expertise for data modeling and analysis, experiences in broad/systematic engineering research and development, using machine learning and pattern recognition, statistical analysis, intelligent computing, computer graphics and visualization, and close interactions with the scientific community (e.g., partnerships with Children's National Medical Center, Food and Drug Administration, Johns Hopkins Medical Institutions).


Project Management

The VISDA team has the capability and experience to successfully perform the VISDA project as outlined in the Statement of Work. Furthermore, having participated in the caBIG meetings and been involved in the process of caBIG development, we have a thorough understanding of caBIG philosophy, as well as the NCI computing environment and technical approach. We will perform the specific activities as detailed in the Statement of Work, caBIG-ICR-10-05-01, and present deliverables on the target dates. We will ensure our support meets the following objectives:

· Develop a Functional Requirements and Design Specification document, in collaboration with the Adopter Center(s) and the Architecture Workspace
· Create a Risk Management Matrix for the project
· Document a Test Approach that ensures requirements are met
· Implement all VISDA modules and functionalities
· Execute on Test Approach

The project manager will manage the Cancer Center level project activities using General Contractor-provided online tools for tracking of project deliverables, i.e., cancer Management Portal (caMP). Regular and ad hoc communications will be scheduled to share project information, including face-to-face meetings, teleconferences, videoconferences, or use of the caBIG website and forums. The faculty lead will generate monthly status reports and submit to the General Contractor.

The VISDA Team will develop a Risk Management Matrix to identify the potential risks in the project and document the plan for managing these risks. Our plans for version control, and code archive, are to use the CVS on the CBIL Server for day to day work. We will submit monthly progress to the caBIG CVS. For each deliverable, we will ensure accuracy, clarity, and consistency to requirements, file editing, format, timeliness, user acceptance, and no infringement of copyright.


 

 

SOFTWARE DEVELOPMENT METHODOLOGY AND DESIGN

Our VISDA team specializes in the design and development of standalone system using the Java™ and C/C++ programming languages. Our development approach favors open source technologies for control, extensibility, and cost benefits. Most of our development work will be done using the open-source programming tool Eclipse, though Java's cross-platform compatibility allows supporting a full range of deployment platforms. The Eclipse Platform is designed for building integrated development environments (IDEs) that can be used to create applications as diverse as web sites, embedded Java™ programs, C++ programs, and Enterprise JavaBeans™.

The proposed project will be developed using the standard life cycle of software development (or sometimes called the "waterfall model"). At the design phase, the objectives are Low Complexity, Modularity, and Maintainability. At the implementation phase, developers will collaborate within a CVS environment for version control. In addition, Ant will be used for software release and deployment. The overall architecture of our proposed system is shown in Figure 1.


 

 
 
Back
 
 
 

Copyright ©2004, Computational Bioinformatics and Bioimaging Laboratory (CBIL), Advanced Research Institute, Virginia Tech.

Last Updated: 03/03/2009. Suggestions/Comments - Webmaster