Using Cloud Computing Infrastructure with CloudBioLinux, CloudMan, and Galaxy

Enis Afgan1, Brad Chapman2, Margita Jadan3, Vedran Franke4, James Taylor1

1 Department of Biology and Department of Mathematics and Computer Science, Emory University, Atlanta, Georgia, 2 Harvard School of Public Health, Boston, Massachusetts, 3 Division of Materials Chemistry, Laboratory for Ichthyopathology–Biological Materials, Ruder Bošković Institute (RBI), Zagreb, Croatia, 4 Department of Biology, University of Zagreb, Zagreb, Croatia
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 11.9
DOI:  10.1002/0471250953.bi1109s38
Online Posting Date:  June, 2012
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


Cloud computing has revolutionized availability and access to computing and storage resources, making it possible to provision a large computational infrastructure with only a few clicks in a Web browser. However, those resources are typically provided in the form of low‐level infrastructure components that need to be procured and configured before use. In this unit, we demonstrate how to utilize cloud computing resources to perform open‐ended bioinformatic analyses, with fully automated management of the underlying cloud infrastructure. By combining three projects, CloudBioLinux, CloudMan, and Galaxy, into a cohesive unit, we have enabled researchers to gain access to more than 100 preconfigured bioinformatics tools and gigabytes of reference genomes on top of the flexible cloud computing infrastructure. The protocol demonstrates how to set up the available infrastructure and how to use the tools via a graphical desktop interface, a parallel command‐line interface, and the Web‐based Galaxy interface. Curr. Protoc. Bioinform. 38:11.9.1‐11.9.20. © 2012 by John Wiley & Sons, Inc.

Keywords: accessible cloud computing; enabling bioinformatics analyses; turnkey computing system

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: An Introduction to Cloud Computing and Access to Cloud Resources via CloudBioLinux and CloudMan
  • Support Protocol 1: Access Your CloudBioLinux Instance Using Graphical Desktop Interface
  • Support Protocol 2: Access Your CloudBioLinux Instance Using the Command‐Line Method
  • Basic Protocol 2: Performing Visual Analysis with the CloudBioLinux Graphical User Interface
  • Basic Protocol 3: Using a CloudMan Cluster to Perform a Parallel Analysis
  • Basic Protocol 4: Using a Private, Scalable Galaxy Analysis Environment on Top of CloudMan
  • Commentary
  • Literature Cited
  • Figures
PDF or HTML at Wiley Online Library


Basic Protocol 1: An Introduction to Cloud Computing and Access to Cloud Resources via CloudBioLinux and CloudMan

  Necessary Resources
  • Computer with Internet access and any up‐to‐date Web browser (Firefox, Safari, Opera, Chrome, Internet Explorer)
  • An AWS account with the Elastic Compute Cloud (EC2) and Simple Storage Service (S3) services enabled. To sign up for an account, visit and click the Sign Up Now link. Basic background information on the EC2 and S3 services can also be found at this page.
PDF or HTML at Wiley Online Library


  •   FigureFigure 11.9.1 A snapshot of the BioCloudCentral portal showing all the form fields that are required to instantiate a CloudBioLinux and CloudMan instance.
  •   FigureFigure 11.9.2 BioCloudCentral monitor page showing the details about the started instance. This page provides a direct link to the new instance as well as an option to download user data. These user data can be used to restart this same instance from the AWS console by uploading it in the instance wizard request form.
  •   FigureFigure 11.9.3 The CloudMan Web console used to manage the cluster.
  •   FigureFigure 11.9.4 The initial CloudMan cluster configuration box. Here, it is possible to choose from the different cluster types supported by CloudMan. Depending on the cluster type, input may be required.
  •   FigureFigure 11.9.5 The main CloudMan interface used to control and manage the cloud cluster. Through this interface, it is possible to add and remove nodes from the cluster, monitor the status of cluster services, and manage cluster features such as auto‐scaling and instance sharing.
  •   FigureFigure 11.9.6 The NX client properties box specifying the IP address of the instance and the choice of GNOME desktop—both are required to establish a successful connection.
  •   FigureFigure 11.9.7 The remote CloudBioLinux graphical interface. Via this interface, it is possible to interact with the system as if it was a local workstation; standard Ubuntu menus and tools are available via the point‐and‐click interface.
  •   FigureFigure 11.9.8 ClustalX application on the remote instance with the sample dataset loaded.
  •   FigureFigure 11.9.9 A snapshot of the MyBayes block (file SP1_file2.nxs) that needs to be manually adjusted with the results from step 5 of . Append the edited block to the end of file SP1_file1.nxs and save the resulting file as SP1_file3.nxs.
  •   FigureFigure 11.9.10 Galaxy history view with the two RNA datasets transferred from modENCODE.
  •   FigureFigure 11.9.11 The Cuffcompare tool interface within Galaxy with all the options described in , step 7, selected.
  •   FigureFigure 11.9.12 List of Gene Ontology terms found over‐represented in the submitted dataset, ordered by their corresponding p values, as returned by the DAVID tool.


Literature Cited

   Afgan, E., Baker, D., Coraor, N., Chapman, B., Nekrutenko, A., and Taylor, J. 2010. Galaxy CloudMan: Delivering cloud compute clusters. BMC Bioinformatics 11:S4.
   Afgan, E., Baker, D., Nekrutenko, A., and Taylor, J. 2011a. A Reference Model for Deploying Applications in Virtualized Environments. Concurrency and Computation: Practice and Experience. John Wiley & Sons, Hoboken, New Jersey.
   Afgan, E., Goecks, J., Baker, D., Coraor, N., Nekrutenko, A., and Taylor, J. 2011b. Galaxy: A gateway to tools in e‐Science. In Guide to e‐Science: Next Generation Scientific Research and Discovery (X. Yang, L. Wang, and W. Xie, eds.) pp. 145‐177. Springer, New York.
   DePristo, M., Banks, E., Poplin, R., Garimella, K., Maguire, J., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., McKenna, A., Fennell, T.J., Kernytsky, A.M., Sivachenko, A.Y., Cibulskis, K., Gabriel, S.B., Altshuler, D., and Daly, M.J. 2011. A framework for variation discovery and genotyping using next‐generation DNA sequencing data. Nat. Genet. 43:491‐498.
   Field, D., Tiwari, B., Booth, T., Houten, S., Swan, D., Bertrand, N., and Thurston, M. 2006. Open software for biologists: From famine to feast. Nat. Biotechnol. 24:801‐803.
   Goecks, J., Nekrutenko, A., and Taylor, J. 2010. Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11:R86.
   Guindon, S. and Gascuel, O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696‐704.
   Huang, D.W., Sherman, B.T., and Lempicki, R.A. 2009a. Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37:1‐13.
   Huang, D.W., Sherman, B.T., and Lempicki, R.A. 2009b. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nat. Protoc. 4:44‐57.
   Huelsenbeck, J. and Ronquist, F. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754‐755.
   Larkin, M., Blackshields, G., Brown, N., Chenna, R., McGettigan, P., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., and Higgins, D.G. 2007. Clustal W and Clustal X version 2.0. Bioinformatics 23:2947‐2948.
   Li, H. and Durbin, R. 2009. Fast and accurate short read alignment with Burrows‐Wheeler transform. Bioinformatics 25:1754‐1760.
   Posada, D. 2008. jModelTest: Phylogenetic model averaging. Mol. Biol. Evol. 25:1253‐1256.
   Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J., and Pachter, L. 2010. Transcript assembly and quantification by RNA‐Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28:511‐515.
Key Reference
   Afgan, E., Baker, D., Coraor, N., Goto, H., Paul, I.M., Makova, K.D., Nekrutenko, A., and Taylor, J. 2011. Harnessing cloud computing with Galaxy cloud. Nat. Biotechnol. 29:972‐974.
  This article gives more detailed background and description as to the available features and perceived functionality when trying to use functionality described within this unit.
PDF or HTML at Wiley Online Library