Using Galaxy to Perform Large‐Scale Interactive Data Analyses

Jennifer Hillman‐Jackson1, Dave Clements2, Daniel Blankenberg1, James Taylor2, Anton Nekrutenko1, Galaxy Team2

1 Penn State University, University Park, Pennsylvania, 2 Emory University, Atlanta, Georgia
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 10.5
DOI:  10.1002/0471250953.bi1005s38
Online Posting Date:  June, 2012
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

Innovations in biomedical research technologies continue to provide experimental biologists with novel and increasingly large genomic and high‐throughput data resources to be analyzed. As creating and obtaining data has become easier, the key decision faced by many researchers is a practical one: where and how should an analysis be performed? Datasets are large and analysis tool set‐up and use is riddled with complexities outside of the scope of core research activities. The authors believe that Galaxy provides a powerful solution that simplifies data acquisition and analysis in an intuitive Web application, granting all researchers access to key informatics tools previously only available to computational specialists working in Unix‐based environments. We will demonstrate through a series of biomedically relevant protocols how Galaxy specifically brings together (1) data retrieval from public and private sources, for example, UCSC's Eukaryote and Microbial Genome Browsers, (2) custom tools (wrapped Unix functions, format standardization/conversions, interval operations), and 3rd‐party analysis tools. Curr. Protoc. Bioinform. 38:10.5.1‐10.5.47. © 2012 by John Wiley & Sons, Inc.

Keywords: Galaxy; comparative genomics; genomic alignments; Web application; genome variation

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density
  • Basic Protocol 2: Loading Data and Understanding Datatypes
  • Basic Protocol 3: Calling Peaks for ChIP‐seq Data
  • Basic Protocol 4: Compare Datasets Using Genomic Coordinates
  • Basic Protocol 5: Working with Multiple Sequence Alignments
  • Guidelines for Understanding Results
  • Commentary
  • Literature Cited
  • Figures
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

  •   FigureFigure 10.5.1 Galaxy interface contains four areas: the top bar, Tools panel (left column), detail panel (middle column), and History panel (right column). The top bar contains user account controls as well as help and contact links. The left panel lists the analysis tools and data sources available to the user. The middle panel displays interfaces for tools selected by the user. The right panel (the History panel) shows datasets and the results of analyses performed by the user. Pictured here are four history items in two different stages of completion: The two “FASTQ Groomer” items are yellow, meaning they are in progress, while the two “ungroomed” items are shown in green, meaning they have completed successfully. Every action by the user generates one or more new history items, which can then be used in subsequent analyses, downloaded, or visualized.
  •   FigureFigure 10.5.2 Uploading a list of protein‐coding exons (in BED format) of known human genes from the UCSC Table browser involves two steps (A and B) described in the text.
  •   FigureFigure 10.5.3 When a job is queued, a history item is initially gray. When a job is running, a history item is yellow. When a job is complete, a history item is green (successful) or red (error).
  •   FigureFigure 10.5.4 Close up of Galaxy history item. Clicking on links and icons triggers the following events: eye = shows a preview of the dataset in Galaxy's middle panel; pencil = open metadata editor. This brings up an interface in the middle panel of the Galaxy screen that allows one to edit the attributes of the current history item. For example, one may wish to give the history item a more descriptive name or change column assignments (see ); × = delete item from the history (To undelete or permanently delete, use the history's Options menu and select “View deleted datasets”.); “save” = copy dataset to your computer; “i” = view details about this dataset in center panel, including the dataset(s), if any, that it was generated from; “rerun” = display this tool in center panel with the same settings it was run with, allowing this step to be exactly rerun or to be modified and rerun; “tags” = add free text tags to this dataset; “sticky note” = add free text annotation. Finally, if the dataset can be visualized in a browser, links to the Galaxy Track Browser (stacked bars icon) and to UCSC, GeneTrack, Ensembl, and others will also be displayed.
  •   FigureFigure 10.5.5 The “Edit Attributes” form in the center panel. Each attribute can be modified and saved. In this figure, the system‐generated name has been copied to the Info field, and a short descriptive name entered in the Name field.
  •   FigureFigure 10.5.6 Data manipulation tools: Join (A), Count (B), Sort (C), Select first lines (D), and Compare two datasets (E).
  •   FigureFigure 10.5.7 Result of joining two interval datasets, highlighting a single exon that contains (overlaps with) 4 SNPs.
  •   FigureFigure 10.5.8 The data library ChIP‐Seq Mouse Example is imported from a library into a history.
  •   FigureFigure 10.5.9 Filezilla (http://filezilla‐project.org) is one example of a desktop FTP client that works well with Galaxy.
  •   FigureFigure 10.5.10 Get Data: Upload File tool. After a file has been uploaded using FTP, it appears in the “Files uploaded via FTP” section.
  •   FigureFigure 10.5.11 The Cut tool form and parameter options to select a single column (number 2, or “c2”) from a tab‐delimited dataset.
  •   FigureFigure 10.5.12 Edit Attributes form in center panel, showing default metadata attributes assigned for the Interval format dataset.
  •   FigureFigure 10.5.13 Diagram of the columns “Cut” from the Interval formatted dataset to create a BED formatted dataset. The result “BED6” format contains the six fields: chromosome, start (0‐based), end, name, score, and strand.
  •   FigureFigure 10.5.14 The Copy History form. The Source History on the left side of the center panel is the prior history from . The Destination History on the right side of the center panel in the new history for .
  •   FigureFigure 10.5.15 The FASTQ Groomer tool form in the center panel with input‐data specific quality score type option selected.
  •   FigureFigure 10.5.16 The Bowtie tool form in the center panel with appropriate options selected. The highlighted parameters are those that are configured differently than the tool's default options.
  •   FigureFigure 10.5.17 View of MACS tool form in the center panel with the appropriate options selected. The highlighted parameters are those that are configured differently than the tool's default options.
  •   FigureFigure 10.5.18 History result datasets and HTML report detail produced by the MACS run.
  •   FigureFigure 10.5.19 Graphical explanation showing input and output datasets for several interval operations, including (A) Overlapping intervals, (B) Overlapping pieces of intervals, (C) Intervals with no overlap,(D) Non‐overlapping pieces of intervals, (E) Concatenated intervals, (F) Merge.
  •   FigureFigure 10.5.20 Examples highlighting the functionality of coverage tools.
  •   FigureFigure 10.5.21 Graphical explanation of the (A) Complement, (B) Find clusters, and (C) Merge clusters interval tools.
  •   FigureFigure 10.5.22 Graphical explanation of genomic interval Join operations in Galaxy. (A) Only records that are joined, (B) All records of the first dataset, (C) Only records of second dataset, and (D) All records of both datasets. (E) Shows how all 4 variations are implemented on two small datasets.
  •   FigureFigure 10.5.23 Extract MAF blocks tool form highlighting a subset of the tool options.
  •   FigureFigure 10.5.24 Result file produced by the Extract MAF blocks tool. Data are the MAF alignment blocks corresponding to the query interval ranges.
  •   FigureFigure 10.5.25 MAF Coverage Stats tool form highlighting the tool options.
  •   FigureFigure 10.5.26 Result file produced by the MAF Coverage Stats tool using the option Coverage by Region. Data are counts for covered and not covered query bases that represent predicted evidence of conservation between the two species.
  •   FigureFigure 10.5.27 Result file produced by the MAF Coverage Stats tool using the option Summarize Coverage. Data has three columns: species, nucleotides, and coverage, where coverage is defined as number of nucleotides divided the by the total length of the provided intervals.
  •   FigureFigure 10.5.28 Result file produced by the Fetch Alignments: Stitch Gene blocks tool. Gapped bases are represented by the symbol “‐”. It is expected that some MAF blocks will contain results with sequence, sequence plus gaps, or gaps only. Large gaps in the query or target genome may be interpreted as a region that is not well conserved. Input type should be carefully evaluated when choosing a MAF (or any) tool. The complete absence of sequence in the input query (as in the case of a non‐coding RefSeq Gene, represented in the second block of this example) produces no results (sequence or gaps) in the output. As the Stitch Gene blocks tool is specifically designed to extract and stitch coding regions from the query input BED file, this is the correct result. To perform a similar function as Stitch Gene block for non‐coding genes, the tool Stitch MAF blocks would be a better choice.
  •   FigureFigure 10.5.29 Shared Data: Published Workflows on the Main Galaxy instance at usegalaxy.org with the features for an individual workflow highlighted: Name (of workflow), Annotation (free text), Owner (Galaxy user name), Community Rating, Community tags (searchable keywords), Last Updated.
  •   FigureFigure 10.5.30 Detailed view of an individual workflow's steps with the “Import workflow” link highlighted.
  •   FigureFigure 10.5.31 Your workflows page listing the newly imported workflow with the action menu highlighted. Menu selections: Edit, Run, Share or Publish, Download or Export, Clone, Rename, and Delete.
  •   FigureFigure 10.5.32 A workflow that is selected to Run is displayed as a form in the center panel. User‐specified input selections from the current history are made by using a step's pull‐down menu, as highlighted.
  •   FigureFigure 10.5.33 Confirmation display when a workflow is executed (started) successfully. As the workflow is run, individual datasets produced by the workflow steps/jobs will be independently colored as gray (waiting to run), yellow (running), green (successful), and red (error). Note that all steps in the workflow are listed, including steps that produce hidden datasets.
  •   FigureFigure 10.5.34 Tools can sometimes produce datasets that no longer should be assigned to the current (or any single) reference genome. Use the Edit Attributes form to assign/reassign a new reference genome (see Figure ) or to unassign a reference genome (as shown) by selecting the menu title (interpreted as a “null” database) from the list.
  •   FigureFigure 10.5.35 Filter tool form showing options, with the filter expression box highlighted containing a free text string. This specific filter string is designed to remove species rows that have no conserved genome sequence in the output of the Fetch Alignments: Stitch Gene blocks tool.
  •   FigureFigure 10.5.36 Select tool form showing options, with the select expression box highlighted containing a free text string. This specific select string is designed to extract lines from a file that start with “rheMac.”.
  •   FigureFigure 10.5.37 Tools can sometimes produce datasets that no longer should be assigned to the current (or any single) reference genome. Use the Edit Attributes form to assign/reassign a reference genome (as shown, in this case rheMac2) or to unassign a reference genome (see Figure ).

Videos

Literature Cited

Literature Cited
   Birney, E., Andrews, D., Bevan, P., Caccamo, M., Cameron, G., Chen, Y., Clarke, L., Coates, G., Cox, T., Cuff, J., Curwen, V., Cutts, T., Down, T., Durbin, R., Eyras, E., Fernandez‐Suarez, X.M., Gane, P., Gibbins, B., Gilbert, J., Hammond, M., Hotz, H., Iyer, V., Kahari, A., Jekosch, K., Kasprzyk, A., Keefe, D., Keenan, S., Lehvaslaiho, H., McVicker, G., Melsopp, C., Meidl, P., Mongin, E., Pettett, R., Potter, S., Proctor, G., Rae, M., Searle, S., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Ureta‐Vidal, A., Woodwark, C., Clamp, M., and Hubbard, T. 2004. Ensembl 2004. Nucleic. Acids Res. 32:D468‐D470.
   Blankenberg, D., Taylor, J., Schenck, I., He, J., Zhang, Y., Ghent, M., Veeraraghavan, N., Albert, I., Miller, W., Makova, K.D., Hardison, R.C., and Nekrutenko, A. 2007. A frame‐work collaborative analysis of ENCODE data: Making large‐scale analyses biologist‐friendly. Genome Res. 17:960‐964.
   Blankenberg, D., Gordon, A., Von Kuster, G., Coraor, N., Taylor, J., Nekrutenko, A.; Galaxy Team. 2010. Manipulation of FASTQ data with Galaxy. Bioinformatics 26:1783‐1785.
   Blankenberg, D., Taylor, J., Nekrutenko, A.; Galaxy Team. 2011. Making whole genome multiple alignments usable for biologists. Bioinformatics 27:2426‐2428.
   Fujita, P.A., Rhead, B., Zweig, A.S., Hinrichs, A.S., Karolchik, D., Cline, M.S., Goldman, M., Barber, G.P., Clawson, H., Coelho, A., Diekhans, M., Dreszer, T.R., Giardine, B.M., Harte, R.A., Hillman‐Jackson, J., Hsu, F., Kirkup, V., Kuhn, R.M., Learned, K., Li, C.H., Meyer, L.R., Pohl, A., Raney, B.J., Rosenbloom, K.R., Smith, K.E., Haussler, D., and Kent, W.J. 2011. The UCSC Genome Browser database: Update 2011. Nucleic Acids Res. 39:D876‐D882.
   Giardine, B., Riemer, C., Hardison, R.C., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., Miller, W., Kent, W.J., and Nekrutenko, A. 2005. Galaxy: A platform for interactive large‐scale genome analysis. Genome Res. 15:1451‐1451.
   Goecks, J., Nekrutenko, A., Taylor, J.; Galaxy Team. 2010. Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11:R86.
   Gupta, R., Bhattacharyya, A., Agosto‐Perez, F.J., Wickramasinghe, P., and Davuluri, R.V. 2011. MPromDb update 2010: An integrated resource for annotation and visualization of mammalian gene promoters and ChIP‐seq experimental data. Nucleic Acids Res. 39:D92‐D97.
   Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., Weber, R.J., Haussler, D., Kent, W.J.; University of California Santa Cruz. 2003. The UCSC Genome Browser Database. Nucleic Acids Res. 31:51‐51.
   Karolchik, D., Hinrichs, A.S., Furey, T.S., Roskin, K.M., Sugnet, C.W., Haussler, D., and Kent, W.J. 2004. The UCSC table browser data retrieval tool. Nucleic Acids Res. 32:D493‐D496.
   Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. 2009. Ultrafast and memory‐efficient alignment of short DNA sequences to the human genome. Genome Biol. 10:R25.
   Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup. 2009. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25:2078‐2078.
   Maglott, D., Ostell, J., Pruitt, K.D., and Tatusova, T. 2005. Entrez gene: Gene‐centered information at NCBI. Nucleic Acids Res. 33:D54‐D58.
   Park, P.J. 2009. ChIP‐seq: Advantages and challenges of a maturing technology. Nat. Rev. Genet. 10:669‐680.
   Pepke, S., Wold, B., and Mortazavi, A. 2009. Computation for ChIP‐seq and RNA‐seq studies. Nat. Methods 6:S22‐S32.
   Phillips, J.E. and Corces, V.G. 2009. CTCF: Master weaver of the genome. Cell 137:1194‐1211.
   Pruitt, K.D., Tatusova, T., and Maglott, D.R. 2005. NCBI Reference Sequence (RefSeq): A curated non‐redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33:D501‐D504.
   Raney, B.J., Cline, M.S., Rosenbloom, K.R., Dreszer, T.R., Learned, K., Barber, G.P., Meyer, L.R., Sloan, C.A., Malladi, V.S., Roskin, K.M., Suh, B.B., Hinrichs, A.S., Clawson, H., Zweig, A.S., Kirkup, V., Fujita, P.A., Rhead, B., Smith, K.E., Pohl, A., Kuhn, R.M., Karolchik, D., Haussler, D., and Kent, W.J. 2011. ENCODE whole‐genome data in the UCSC genome browser (2011 update). Nucleic Acids Res. 39:D871‐D875.
   Rosenbloom, K.R., Dreszer, T.R., Pheasant, M., Barber, G.P., Meyer, L.R., Pohl, A., Raney, B.J., Wang, T., Hinrichs, A.S., Zweig, A.S., Fujita, P.A., Learned, K., Rhead, B., Smith, K.E., Kuhn, R.M., Karolchik, D., Haussler, D., and Kent, W.J. 2009. ENCODE whole‐genome data in the UCSC Genome Browser. Nucleic Acids Res. 38:D620‐D625.
   Schneider, K.L., Pollard, K.S., Baertsch, R., Pohl, A., and Lowe, T.M. 2006. The UCSC Archaeal Genome Browser. Nucleic Acids Res. 34:D407‐D410.
   Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., and Sirotkin, K. 2001. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 29:308‐311.
   Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nusbaum, C., Myers, R.M., Brown, M., Li, W., and Liu, X.S. 2008. Model‐based analysis of ChIP‐Seq (MACS). Genome Biol. 9:R137.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library