Using Galaxy to Perform Large‐Scale Interactive Data Analyses

Jennifer Hillman‐Jackson1, Dave Clements2, Daniel Blankenberg1, James Taylor2, Anton Nekrutenko1, Galaxy Team2

1 Penn State University, University Park, Pennsylvania, 2 Emory University, Atlanta, Georgia
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 10.5
DOI:  10.1002/0471250953.bi1005s38
Online Posting Date:  June, 2012
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


Innovations in biomedical research technologies continue to provide experimental biologists with novel and increasingly large genomic and high‐throughput data resources to be analyzed. As creating and obtaining data has become easier, the key decision faced by many researchers is a practical one: where and how should an analysis be performed? Datasets are large and analysis tool set‐up and use is riddled with complexities outside of the scope of core research activities. The authors believe that Galaxy provides a powerful solution that simplifies data acquisition and analysis in an intuitive Web application, granting all researchers access to key informatics tools previously only available to computational specialists working in Unix‐based environments. We will demonstrate through a series of biomedically relevant protocols how Galaxy specifically brings together (1) data retrieval from public and private sources, for example, UCSC's Eukaryote and Microbial Genome Browsers, (2) custom tools (wrapped Unix functions, format standardization/conversions, interval operations), and 3rd‐party analysis tools. Curr. Protoc. Bioinform. 38:10.5.1‐10.5.47. © 2012 by John Wiley & Sons, Inc.

Keywords: Galaxy; comparative genomics; genomic alignments; Web application; genome variation

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density
  • Basic Protocol 2: Loading Data and Understanding Datatypes
  • Basic Protocol 3: Calling Peaks for ChIP‐seq Data
  • Basic Protocol 4: Compare Datasets Using Genomic Coordinates
  • Basic Protocol 5: Working with Multiple Sequence Alignments
  • Guidelines for Understanding Results
  • Commentary
  • Literature Cited
  • Figures
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

Literature Cited
   Birney, E., Andrews, D., Bevan, P., Caccamo, M., Cameron, G., Chen, Y., Clarke, L., Coates, G., Cox, T., Cuff, J., Curwen, V., Cutts, T., Down, T., Durbin, R., Eyras, E., Fernandez‐Suarez, X.M., Gane, P., Gibbins, B., Gilbert, J., Hammond, M., Hotz, H., Iyer, V., Kahari, A., Jekosch, K., Kasprzyk, A., Keefe, D., Keenan, S., Lehvaslaiho, H., McVicker, G., Melsopp, C., Meidl, P., Mongin, E., Pettett, R., Potter, S., Proctor, G., Rae, M., Searle, S., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Ureta‐Vidal, A., Woodwark, C., Clamp, M., and Hubbard, T. 2004. Ensembl 2004. Nucleic. Acids Res. 32:D468‐D470.
   Blankenberg, D., Taylor, J., Schenck, I., He, J., Zhang, Y., Ghent, M., Veeraraghavan, N., Albert, I., Miller, W., Makova, K.D., Hardison, R.C., and Nekrutenko, A. 2007. A frame‐work collaborative analysis of ENCODE data: Making large‐scale analyses biologist‐friendly. Genome Res. 17:960‐964.
   Blankenberg, D., Gordon, A., Von Kuster, G., Coraor, N., Taylor, J., Nekrutenko, A.; Galaxy Team. 2010. Manipulation of FASTQ data with Galaxy. Bioinformatics 26:1783‐1785.
   Blankenberg, D., Taylor, J., Nekrutenko, A.; Galaxy Team. 2011. Making whole genome multiple alignments usable for biologists. Bioinformatics 27:2426‐2428.
   Fujita, P.A., Rhead, B., Zweig, A.S., Hinrichs, A.S., Karolchik, D., Cline, M.S., Goldman, M., Barber, G.P., Clawson, H., Coelho, A., Diekhans, M., Dreszer, T.R., Giardine, B.M., Harte, R.A., Hillman‐Jackson, J., Hsu, F., Kirkup, V., Kuhn, R.M., Learned, K., Li, C.H., Meyer, L.R., Pohl, A., Raney, B.J., Rosenbloom, K.R., Smith, K.E., Haussler, D., and Kent, W.J. 2011. The UCSC Genome Browser database: Update 2011. Nucleic Acids Res. 39:D876‐D882.
   Giardine, B., Riemer, C., Hardison, R.C., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., Miller, W., Kent, W.J., and Nekrutenko, A. 2005. Galaxy: A platform for interactive large‐scale genome analysis. Genome Res. 15:1451‐1451.
   Goecks, J., Nekrutenko, A., Taylor, J.; Galaxy Team. 2010. Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11:R86.
   Gupta, R., Bhattacharyya, A., Agosto‐Perez, F.J., Wickramasinghe, P., and Davuluri, R.V. 2011. MPromDb update 2010: An integrated resource for annotation and visualization of mammalian gene promoters and ChIP‐seq experimental data. Nucleic Acids Res. 39:D92‐D97.
   Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., Weber, R.J., Haussler, D., Kent, W.J.; University of California Santa Cruz. 2003. The UCSC Genome Browser Database. Nucleic Acids Res. 31:51‐51.
   Karolchik, D., Hinrichs, A.S., Furey, T.S., Roskin, K.M., Sugnet, C.W., Haussler, D., and Kent, W.J. 2004. The UCSC table browser data retrieval tool. Nucleic Acids Res. 32:D493‐D496.
   Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. 2009. Ultrafast and memory‐efficient alignment of short DNA sequences to the human genome. Genome Biol. 10:R25.
   Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup. 2009. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25:2078‐2078.
   Maglott, D., Ostell, J., Pruitt, K.D., and Tatusova, T. 2005. Entrez gene: Gene‐centered information at NCBI. Nucleic Acids Res. 33:D54‐D58.
   Park, P.J. 2009. ChIP‐seq: Advantages and challenges of a maturing technology. Nat. Rev. Genet. 10:669‐680.
   Pepke, S., Wold, B., and Mortazavi, A. 2009. Computation for ChIP‐seq and RNA‐seq studies. Nat. Methods 6:S22‐S32.
   Phillips, J.E. and Corces, V.G. 2009. CTCF: Master weaver of the genome. Cell 137:1194‐1211.
   Pruitt, K.D., Tatusova, T., and Maglott, D.R. 2005. NCBI Reference Sequence (RefSeq): A curated non‐redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33:D501‐D504.
   Raney, B.J., Cline, M.S., Rosenbloom, K.R., Dreszer, T.R., Learned, K., Barber, G.P., Meyer, L.R., Sloan, C.A., Malladi, V.S., Roskin, K.M., Suh, B.B., Hinrichs, A.S., Clawson, H., Zweig, A.S., Kirkup, V., Fujita, P.A., Rhead, B., Smith, K.E., Pohl, A., Kuhn, R.M., Karolchik, D., Haussler, D., and Kent, W.J. 2011. ENCODE whole‐genome data in the UCSC genome browser (2011 update). Nucleic Acids Res. 39:D871‐D875.
   Rosenbloom, K.R., Dreszer, T.R., Pheasant, M., Barber, G.P., Meyer, L.R., Pohl, A., Raney, B.J., Wang, T., Hinrichs, A.S., Zweig, A.S., Fujita, P.A., Learned, K., Rhead, B., Smith, K.E., Kuhn, R.M., Karolchik, D., Haussler, D., and Kent, W.J. 2009. ENCODE whole‐genome data in the UCSC Genome Browser. Nucleic Acids Res. 38:D620‐D625.
   Schneider, K.L., Pollard, K.S., Baertsch, R., Pohl, A., and Lowe, T.M. 2006. The UCSC Archaeal Genome Browser. Nucleic Acids Res. 34:D407‐D410.
   Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., and Sirotkin, K. 2001. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 29:308‐311.
   Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nusbaum, C., Myers, R.M., Brown, M., Li, W., and Liu, X.S. 2008. Model‐based analysis of ChIP‐Seq (MACS). Genome Biol. 9:R137.
PDF or HTML at Wiley Online Library