caBIG: VIsual
and Statistical Data Analyzer-VISDA(caBIG-ICR-100501)
INTRODUCTION
PROJECT MANAGEMENT AND DELIVERABLES
SOFTWARE
DEVELOPMENT METHODOLOGY AND DESIGN
INTRODUCTION
The
primary objective of this proposed project is to adapt/convert
the VIsual and Statistical Data Analyzer (VISDA) software to the
cancer Biomedical Informatics Grid (caBIG) architecture to allow
users across the cancer research community to analyze their own
molecular expression data as well as "grid" data using
this powerful tool for clustering modeling, discovery, and visualization.
The resulting software will be open-source in C++/Java compatible
at the Silver maturity level of the caBIG. As part of the Integrative
Cancer Research (ICR) Workspace, the project will be a demonstration
of how a shared informatics platform and expertise can allow a
comprehensive, federated grid of informatics tools to be made
available to the cancer research community. The proposed development
of an open-source VISDA software will build upon the current computational/statistical
bioinformatics expertise at the Computational Bioinformatics and
Bioimaging Laboratory (CBIL) (http://www.cbil.ece.vt.edu). The
VISDA software will support the mission of caBIG-ICR in that the
development is tasked to develop a well-documented and validated
toolset for use throughout the cancer research community.
The
main application/utility of VISDA is for multivariate cluster
modeling, discovery, and visualization, particularly for data
sets living in high dimensional space such as microarray gene
expression profiles [Wang et al. 2000, 2004]. Many biomedical
research efforts, when formulated, are to explore the hidden clustered
structure of the data in one way or another. The applications
can be found in microarray data analysis, proteomics data analysis,
and clinical data analysis, etc. For example, define new cancer
subtypes based on their gene expression patterns, construct hierarchical
trees of multiclass cancer phenotypic composite, or discover the
correlation between cancer statistics and risk factors.
Multivariate
data modeling and visualization have proven to be the powerful
yet critical tools for the analysis and interpretation of complex
data. To reveal all of the interesting patterns within a data
set, we have developed VISDA algorithm/software for cluster modeling,
discovery, and visualization. The model-supported exploration
of high-dimensional data space is achieved through two complementary
schemes: dimensionality reduction by discriminatory component
analysis and cluster formation by soft data clustering, whose
parameters are estimated using the weighted Fisher criterion and
expectation-maximization algorithm. VISDA uses an adaptive boosting
of discriminatory subspaces involving hierarchical mixture modeling
of the data set. The hierarchical mixture model, selected optimally
by the minimum description length criterion, allows the complete
data set to be visualized at the top level and so partitions the
data set, with clusters and subclusters of data points visualized
at deeper levels. Each subspace model is linear while the complete
hierarchy maintains overall nonlinearity.
VISDA
is capable of navigating into a high dimensional data set to discover
the hidden clustered data structure, and model and visualize the
discovery. It is particularly effective when dealing with highly
complex data sets as compared to existing methods. To reveal all
of the hidden clusters, our exploration of high-dimensional data
space is both statistically-principled and visually-insightful.
Our method can incorporate both the power of statistical methods
and the human gift for pattern recognition, and is capable of
capturing progressively all interesting aspects of the data set.
To the best of our knowledge, it represents state-of-the-art in
visual statistical data analysis and exploration. VISDA incorporates
the most advanced theories, methods, and algorithms in statistical
learning. It also works for both unsupervised and supervised scenarios.
As
part of the caBIG-ICR workspace, VISDA will be 1) deployed to
the Adopter site(s) for analysis; 2) adapted to the caBIG architecture;
and 3) enhanced with a
graphical user interface to improve usability. The specific measurable
objectives are:
·
Develop a Functional Requirements and Design Specification document,
in collaboration with the Adopter Center(s) and the Architecture
Workspace
· Create a Risk Management Matrix for the project
· Document a Test Approach that ensures requirements are
met
· Develop C++/Java code for the following functionality:
- Cluster modeling using hierarchical standard final normal mixture
(HSFNM) models
- Dimension reduction by principle component analysis (PCA), discriminatory
component analysis (DCA) and project pursuit method (PPM)
- Cluster formation by soft data clustering using expectation-maximization
(EM) algorithm
- Cluster validation by minimum description length (MDL) criterion
- Cluster visualization by hierarchical cluster display
- Graphical user interface (GUI) for VISDA set-up, data input-output,
data analysis, and data/results visualization
· Execute on Test Approach
PROJECT
MANAGEMENT AND DELIVERABLES
The VISDA Team and Project Organization
CBIL
a partnership among Virginia Tech, Georgetown University, and
Catholic University of America, has been continuously funded by
various NIH and DOD grants for over 8 years. This proposed project
will leverage CBIL's extensive bioinformatics expertise for data
modeling and analysis, experiences in broad/systematic engineering
research and development, using machine learning and pattern recognition,
statistical analysis, intelligent computing, computer graphics
and visualization, and close interactions with the scientific
community (e.g., partnerships with Children's National Medical
Center, Food and Drug Administration, Johns Hopkins Medical Institutions).
Project Management
The
VISDA team has the capability and experience to successfully perform
the VISDA project as outlined in the Statement of Work. Furthermore,
having participated in the caBIG meetings and been involved in
the process of caBIG development, we have a thorough understanding
of caBIG philosophy, as well as the NCI computing environment
and technical approach. We will perform the specific activities
as detailed in the Statement of Work, caBIG-ICR-10-05-01, and
present deliverables on the target dates. We will ensure our support
meets the following objectives:
·
Develop a Functional Requirements and Design Specification document,
in collaboration with the Adopter Center(s) and the Architecture
Workspace
· Create a Risk Management Matrix for the project
· Document a Test Approach that ensures requirements are
met
· Implement all VISDA modules and functionalities
· Execute on Test Approach
The
project manager will manage the Cancer Center level project activities
using General Contractor-provided online tools for tracking of
project deliverables, i.e., cancer Management Portal (caMP). Regular
and ad hoc communications will be scheduled to share project information,
including face-to-face meetings, teleconferences, videoconferences,
or use of the caBIG website and forums. The faculty lead will
generate monthly status reports and submit to the General Contractor.
The
VISDA Team will develop a Risk Management Matrix to identify the
potential risks in the project and document the plan for managing
these risks. Our plans for version control, and code archive,
are to use the CVS on the CBIL Server for day to day work. We
will submit monthly progress to the caBIG CVS. For each deliverable,
we will ensure accuracy, clarity, and consistency to requirements,
file editing, format, timeliness, user acceptance, and no infringement
of copyright.
SOFTWARE
DEVELOPMENT METHODOLOGY AND DESIGN
Our VISDA team specializes in the design and development of standalone
system using the Java and C/C++ programming languages. Our
development approach favors open source technologies for control,
extensibility, and cost benefits. Most of our development work
will be done using the open-source programming tool Eclipse, though
Java's cross-platform compatibility allows supporting a full range
of deployment platforms. The Eclipse Platform is designed for
building integrated development environments (IDEs) that can be
used to create applications as diverse as web sites, embedded
Java programs, C++ programs, and Enterprise JavaBeans.
The proposed project will be developed using the standard life
cycle of software development (or sometimes called the "waterfall
model"). At the design phase, the objectives are Low Complexity,
Modularity, and Maintainability. At the implementation phase,
developers will collaborate within a CVS environment for version
control. In addition, Ant will be used for software release and
deployment. The overall architecture of our proposed system is
shown in Figure 1.
Copyright
©2004, Computational Bioinformatics and Bioimaging Laboratory
(CBIL), Alexandria Research Institute, Virginia Tech. Jointly
with The Catholic University of America.
Last
Updated: 03/22/2004. Suggestions/Comments
- Webmaster