Data Core

Documentation and distribution of protocols, analysis pipelines and workflows are essential components of a well coordinated, and maximally productive research team. Their availability minimizes reduplication of efforts, and maximizes the efficiency of new analyses. Availability of workflows also ensures that efforts at reproducing our work by other groups are productive and constructive. The overarching goal of the Data Management and Resource Dissemination Core is to ensure the seamless integration of existing and newly generated data, workflows, results and analyses, and additional resources for ready availability to other groups, and eventual dissemination to the broader research community.

The Host-Pathogen Network Portal (http://baliga.systemsbiology.net/portal/; Figure 1), hosted by the Institute for Systems Biology, serves as the gateway for our research on Transcriptional Regulatory Network (TRN) inference in host-pathogen associations. Previous efforts presented on the site include the preliminary TRN for Mycobacterium tuberculosis and activated macrophages (see Phase 0 Model in the modeling core). The site introduces each project with an accessible description of the organism, and provides comprehensive summaries of the network analyses, as well as detailed results on biclusters, identified regulatory motifs, etc. Associated with each project are links to access result files for further analysis, as well as a summary of the analysis methodology employed. Integration with other resources is achieved via links to relevant external sites.

Access to network inference tools used in our research, such as cMonkey (Reiss et al., 2006) and Inferelator (Bonneau et al., 2006), is also provided via the Host-Pathogen Network Portal. Lastly, the portal provides access to Gaggle (Shannon et al., 2006; Figure 2), a technology framework that enables seamless integration of diverse third party software and databases. Gaggle is now a widely accepted technology for data integration and exploration in systems biology. Developers from the Microarray Experiment Viewer at Dana-Farber; The Bioconductor group at the Fred Hutchinson Cancer Research Center; the Systems Biology Workbench (SBW) and Systems Biology Markup Language (SBML) at Caltech and the University of Washington; the Human Proteome Folding (HPF) at NYU; and Cytoscape at UCSD, Sloan-Kettering and the ISB, (among others) have participated actively in the development of Gaggle, within which their tools are cleanly integrated.

The Data Dissemination Core will be responsible for maintaining a centralized data and results repository, and to collect and curate analysis pipelines and workflows. The Data Dissemination core will facilitate the processing and analysis of data generated by the Omics Core, using computational infrastructure at Seattle BioMed, the Institute for Systems Biology, and Computational Cloud resources.

Data storage and integration will occurr via webportals and shared-access in-house servers or cloud space that will make our project-generated data available to all team members, and in due time to the broader research community. Data will reside locally in shared-access servers; once released, the project-generated data will also be deposited in public data repositories (see C.3).

Analysis pipelines and workflows will also be made available to all working members of the project via secured sections of the Host-Pathogen Network Portal (HPNP). Once curated, they will also be made publicly available to ensure replicability of our results by outside groups.

The Data Dissemination Core will integrate the Host-Pathogen Network Portal (HPNP), the systemsimmunology.org website, and the TB Database to provide harmonized, mutually complementary gateways for the outputs from this project. HPNP will serve as the primary portal for the output of this project. We have opted for expansion of the existing resources over the creation of separate portals as a mean to compliment previous and ongoing efforts. The integrated websites will be the platforms through which information about program-developed resources is shared with the scientific community.