Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration

Helga Thorvaldsdóttir; James T. Robinson; Jill P. Mesirov

doi:10.1093/bib/bbs017

Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration

Thorvaldsdóttir, Helga; Robinson, James T.; Mesirov, Jill P. 2012-04-19 00:00:00 Data visualization is an essential component of genomic data analysis. However, the size and diversity of the data sets produced by today’s sequencing and array-based profiling methods present major challenges to visualization tools. The Integrative Genomics Viewer (IGV) is a high-performance viewer that efficiently handles large heteroge- neous data sets, while providing a smooth and intuitive user experience at all levels of genome resolution. A key characteristic of IGV is its focus on the integrative nature of genomic studies, with support for both array-based and next-generation sequencing data, and the integration of clinical and phenotypic data. Although IGV is often used to view genomic data from public sources, its primary emphasis is to support researchers who wish to visualize and explore their own data sets or those from colleagues. To that end, IGV supports flexible loading of local and remote data sets, and is optimized to provide high-performance data visualization and exploration on standard desk- top systems. IGV is freely available for download from http://www.broadinstitute.org/igv, under a GNU LGPL open-source license. Keywords: visualization; next-generation sequencing; NGS; genome viewer; IGV resolution, from whole genome to base pairs. IGV INTRODUCTION Next-generation sequencing (NGS) and array-based is designed to be accessible to a wide range of users, including bench biologists and bioinformaticians. profiling methods now generate large quantities of While new users appreciate the user-friendly and in- diverse types of genomic data and are enabling re- searchers to study the genome at unprecedented tuitive interface, more experienced users can also resolution. Analysis of these large, diverse, data sets take advantage of the many advanced features and has become the rate-limiting step in many studies. preferences. There are a number of other desktop applications Although much of the analysis can be automated, available for visualization of genomic data, particu- human interpretation and judgment, supported by larly NGS data, including Tablet [2], BamView [3], rapid and intuitive visualization, is essential for gain- ing insight and elucidating complex biological rela- Savant [4] and Artemis [5]. In comparison to these tionships. We describe the Integrative Genomics tools, a notable characteristic of IGV is its breadth. IGV was developed to support a diverse range of data Viewer (IGV) [1], a high-performance desktop tool types, including NGS and array-based platforms, for interactive visual exploration of diverse genomic such as expression and copy-number arrays. These data. Even for very large data sets, IGV supports real-time interaction at all scales of genome different data types can be flexibly integrated, and Corresponding author. James T. Robinson, Broad Institute, 7 Cambridge Center (301B-5057), Cambridge, MA 02142, USA. Tel.: þ617-714-7491; Fax: þ617-714-8991; E-mail: [email protected] HelgaThorvaldsdo¤ ttir is a senior software project manager in the Cancer Program at the Broad Institute of MIT and Harvard. JamesT. Robinson is a principle software engineer in the Cancer Program at the Broad Institute of MIT and Harvard, where he has worked on omics visualization software since 2006. Jill P. Mesirov is Chief Informatics Officer of the Broad Institute of MIT and Harvard, where she directs the Computational Biology and Bioinformatics Organization, and a member of the Koch Institute for Integrative Cancer Research at MIT. The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Integrative Genomics Viewer 179 combined with clinical and other sample metadata to draw inferences based on related events that are dynamically group, sort and filter data sets. Another widely separated in genomic coordinates. IGVs mul- distinguishing feature of IGV is the ability to view tilocus views are described further in the ‘Features’ section below. data in multiple genomic regions simultaneously in adjacent panels, for example to view correlated events in distal regions. METHODS AND TECHNOLOGIES IGV is a desktop application written in the Java pro- gramming language and runs on all major platforms HISTORICAL BACKGROUND (Windows, Mac and Linux). Below, we describe in IGVs direction and focus are driven by our collab- more detail some components of the IGV imple- orations with investigators from a wide variety of mentation, including our data-tiling approach for large and small biomedical research projects. IGV supporting large data sets and IGVs support for dif- development started in 2007 in response to a need ferent categories of file formats. We also provide a by The Cancer Genome Atlas (TCGA) [6] project to high-level overview of IGVs software architecture. visualize integrated copy number data, expression, mutation and clinical data. The size of these data sets posed a challenge to desktop visualization tools Data tiling available at the time. To handle these large data sets A primary design goal of IGV is to support inter- in IGV, we introduced a data-loading scheme that active exploration of large-scale genomic data sets on includes indexing of data files as well as on-demand standard desktop computers. This poses a difficult loading. This strategy enabled viewing and inter- challenge as NGS and recent array-based technolo- active exploration of up to several hundred samples gies can generate data sets from gigabytes to terabytes from the largest copy-number array available at the in size. Simply loading these entire data sets into time, comprising approximately half a million memory is not a viable option. In addition, re- probes. searchers search for meaningful events at many dif- The next major application of IGV, visualization ferent genomic resolution scales, from whole of ChIP-seq data from whole-genome sequences to genome to individual base pairs. The problem is find de novo long intergenic noncoding RNAs analogous to that faced by interactive geographical (lincRNAs) [7], provided motivation to extend the mapping tools, which provide views of very large data size limit even further. We developed a binary geographical databases over many resolution scales. multiresolution tiled data format to support data sets Tools such as Google Maps solve this problem by of up to hundreds of gigabytes in size. At this time, precomputing images representing sections of maps we also added support for viewing genome annota- at multiple resolution scales, and providing fast access tions. In August 2008, we deployed the first public to the images as needed to construct a view. We release of the IGV software and web site (www. considered such an approach for IGV, based on broadinstitute.org/igv). precomputed images of genomic data. However, Support for visualization of short-read sequence millions of images would be required to support all alignments followed in May 2009. In this effort, resolution scales for a large genome, thus making we collaborated with members of the 1000 image management difficult without introducing Genomes Project [8] involved in development of the requirement of a database. Furthermore, the rep- the SAM/BAM alignment format [9]. Another col- resentation of the data would be fixed when the laboration with the 1000 Genomes Project, led to images are computed, making it difficult to provide IGV support for visualizing genome variation data in interactive graphing options. Consequently, we the VCF format [10] in the IGV 2.0 release of May adopted a different approach that is based on 2011. precomputing summarizations of data at multiple In IGV 2.0, we also introduced a flexible, ‘multi- resolution scales, with rendering of the data deferred locus’ mode for viewing multiple genomic regions, to runtime. We refer to this as ‘data tiling’, to dis- side by side and thereby eliminated the restriction of tinguish it from ‘image tiling’. viewing only a single contiguous genomic region at a IGVs data tiling implementation is built on a pyr- time. This new view mode is particularly useful amidal data structure that can be described as follows. when investigators seek to test hypotheses and For each resolution scale (‘zoom level’), the genome ¤ 180 Thorvaldsdottir et al. is divided into tiles that correspond to the region not require conversion to TDF before data can be viewable on the screen of a typical user display. loaded. In fact, IGV supports a variety of genomic The first zoom level consists of a single tile that file formats, which can be divided into three cate- covers the entire genome. The next zoom level con- gories: (i) nonindexed, (ii) indexed and (iii) multi- tains a single tile for each chromosome. The number resolution formats: of tiles then increases by a factor of 2 for each level, (i) Nonindexed formats include flat file formats so the next zoom level consists of two tiles per chromosome, then four, etc. Each tile is subdivided such as GFF [11], BED [12] and WIG [13]. into ‘bins’, with the width of a bin chosen to cor- Files in these formats must be read in their en- respond to the approximate genomic width repre- tirety and are only suitable for relatively small sented by a screen pixel at that resolution scale. The data sets. value of each bin is calculated from the underlying (ii) Indexed formats include BAM and Goby [14] genomic data with a summary statistic, such as for sequence alignments. Additionally, many ‘mean’, ‘median’ or ‘maximum’. By organizing data tab-delimited feature formats can be converted in this way, tile sizes for each zoom level are constant to an indexed file using Tabix [15] or ‘igvtools’. and small, containing only the data needed to render Indexed formats provide rapid and efficient the view at the resolution supported by the screen access to subsets of the data for display, but display. Hence, a single tile at the lowest resolution, only when zoomed in to a sufficiently small which spans the entire genome, has the same genomic region. Zooming out requires memory footprint as a tile at a high-resolution ever-larger portions of the file to be loaded. zoom level, which might span only a few kilobases. Thus, indexed formats can efficiently support As the user moves across the genome and through views only for a limited range of resolution zoom levels, IGV only retrieves the tiles required to scales. This range depends on the genomic dens- support the current view and discards tiles no longer ity of the underlying data and can span tens of in view to free memory. This method supports kilobases for NGS alignments, hundreds of browsing very large data sets at all resolution scales megabases for typical variant (SNP) files, or with minimal memory requirements. whole chromosomes for sparse feature files. For large genomes, precomputing tiles for all IGV uses heuristics to determine a suitable zoom levels would be inordinately expensive with upper limit on the genomic range that can be respect to disk space. For example, the human loaded quickly with a reasonable memory foot- genome requires approximately 23 zoom levels, or print. If zoomed out beyond this limit, the data are not loaded. on the order of 2 tiles, to cover the whole genome (iii) Multiresolution formats, such as our TDF to base pair resolution. In practice, IGV uses a hybrid approach; combin- described earlier and the bigWig and bigBed formats [16], include both an index for the ing precomputed lower-level zoom levels with high-resolution tiles computed on the fly. This is raw data, and precomputed indexed summary possible as the high-resolution tiles cover relatively data for lower resolution (zoomed out) scales. small portions of the genome. The number of Multiresolution formats can efficiently support precomputed zoom levels required to achieve good views at any resolution scale. performance varies by data density and genome size. In our experience, seven levels give acceptable per- Software architecture formance for even the highest density human The IGV software structure is designed around a genome data. core set of interfaces and extendable classes. These components can be separated into three conceptual File formats layers as illustrated in Figure 1: (i) a top-level appli- To support the multiresolution data model described cation layer, (ii) a data layer and (iii) a stream layer. earlier, we developed a corresponding file format. These are described in more detail below: The ‘tiled data format’, or TDF, stores the pyramidal data tile structure and provides fast access to individ- (i) The application layer includes the main IGV ual tiles. TDF files can be created using the auxiliary window and user interface elements, along package ‘igvtools’. We note however that IGV does with controllers for user interaction events. It Integrative Genomics Viewer 181 Figure 1: IGV class diagram, illustrating the IGV software structure. also contains representations of genomic features as well as requesting features as needed from the and data. IGV displays these in horizontal rows data layer and drawing these features on the known as ‘tracks‘. Tracks are displayed in a data panel. Most track implementations delegate panel, which is implemented as a class derived the drawing task to a renderer object. Renderers from Java Swing components. The data panel is are designed to be pluggable, and can be swap- responsible for the coordination of track layout ped at runtime, for example to switch graph and rendering, and managing mouse events. It types in response to a menu action. handles certain globally shared mouse actions, (ii) The data layer reads and parses the different such as zooming and panning, and delegates genomic file formats and supplies the application other events to the objects representing tracks. layer with data tiles on demand. It also imple- Tracks are responsible for handling these events, ments caching of tiles, for improved efficiency if ¤ 182 Thorvaldsdottir et al. a previously visited genomic region is requested and explore their own data sets aligned to the se- again. lected reference genome. Researchers can also make (iii) The stream layer is responsible for supporting their data sets available to others for view in IGV, random access to sections of files accessed by sharing them with colleagues or the community at any of the protocols supported by IGV, i.e. large. local file, HTTP, HTTPS or FTP. Random file access is necessary to take advantage of the Launching IGV indexed and multiresolution file formats. For IGV is available on all platforms that support Java. local files, this is straightforward using Java’s Installation and launching are accomplished with a RandomAccessFile class, or alternatively posi- click of a button on the IGV web site at www. tionable file channels. Remote files presented a broadinstitute.org/igv. Alternatively, users can challenge, as there are no Java built-in functions download a ZIP archive to install the application or libraries that support this access pattern. locally. IGV can also be launched from links Initially, we solved this problem using a web embedded in web pages or other documents. service. However, this approach was not ideal, as users who wished to host IGV files were also The IGV window required to install and run the web service on The IGV window is divided into a number of con- their systems. Consequently, we designed and trols and panels as illustrated in Figure 2. At the top is implemented a set of classes to provide a uni- a command bar with controls for selecting a refer- form interface for random file access for all the ence genome, navigating and defining regions of protocols. IGVs implementation for the HTTP interest. Just below the command bar is a header protocol uses byte range requests from the panel with an ideogram representation of the cur- standard HTTP specification. For the FTP rently viewed chromosome, along with a genome protocol, IGV uses the mechanism for restarting coordinate ruler that indicates the size of the downloads that is supported by most FTP ser- region in view. The ideogram also displays a red vers via the ‘REST’ command. rectangle that outlines the region in view. The re- mainder of the window is divided into one or more data panels and an attribute panel. Data are mapped to the genomic coordinates of the reference genome FEATURES and are displayed in the data panels as horizontal IGV is a desktop application for the visualization and rows called ‘tracks’. Each track typically represents interactive exploration of genomic data in the con- one sample, experiment or genomic annotation. If text of a reference genome. A key characteristic of any sample or track attributes have been loaded, they IGV is its focus on the integrative nature of genomic are displayed as a color-coded matrix in the attribute studies. It allows investigators to flexibly visualize panel as illustrated in Figure 3. Each column in the many different types of data together—and import- matrix corresponds to an attribute, and a track’s at- antly also integrate these data with the display of tribute values are displayed as a row of colored cells sample attribute information such as clinical and adjacent to the track. phenotypic information. To support interactive ex- ploration of data, IGV provides direct manipulation navigation in the style of Google Maps. For instance, Reference genome you click and drag to pan the view across the A reference genome must be selected before loading genome and double-click on a region to zoom in data. IGV provides dozens of hosted reference gen- for a more detailed view. It supports real-time inter- omes to choose from, but also provides the option of action at all scales of genome resolution, from whole importing others from the sequence data. The min- genome to base pairs, even for very large data sets. imal requirement for importing a genome is a The Broad IGV data server hosts many genome an- FASTA file containing chromosome or contig se- notation files and data sets from a variety of public quences. Other genome information is optional, sources (including from TCGA, 1000 Genomes including: (i) cytoband information for the chromo- Project, ENCODE Project [17] and others). some ideogram in the IGV window, (ii) annotations However, the primary emphasis is on supporting defining the features displayed in the gene track for biomedical researchers who wish to load, visualize the genome and (iii) chromosome alias information Integrative Genomics Viewer 183 Figure 2: The IGV application window. that defines synonyms for the sequence names attribute information, to annotate the genomic defined in the FASTA file, e.g. ‘1’ for ‘chr1’. We data. Data files are loaded into IGV by: (i) note that this option is intended primarily for fin- using the built-in file browser to select a file ished assemblies. IGV is not designed for visualiza- on the local file system, (ii) entering the URL tion of unassembled genomes. of a file accessible over a network via HTTP When the view is sufficiently zoomed in, IGV or FTP, (iii) entering the URL of a Distrib- displays the reference genome sequence as a separate uted Annotation System (DAS) feature source track in the data panel. Depending on the zoom [18] or (iv) selecting entries from the ‘File > Load level, the nucleotides are represented as colored from Server’ menu. By default, the menu provides bars or letters. By default, the forward strand is dis- access to data and annotation files that are hosted at played. Clicking on a strand indicator for the track the Broad Institute for viewing in IGV. This can toggles the strand direction. Another option enables easily be changed to point to any set of the display of three-frame translation of codons for web-accessible files. For example, the menu could the current strand. provide access to shared files on a research project’s central server. Loading data IGV was designed to accommodate any data that Viewing data can be mapped to genomic coordinates. It currently IGV supports simultaneous viewing of multiple data supports more than 30 different file formats, includ- sets, with the same or different types of data. A ing many of the common formats for genome track’s default appearance and available view options annotations, sequence alignments, variant calls and will vary depending on the data type. The following microarray data. Importantly, users can also load sections describe some of the commonly viewed metadata, such as clinical, phenotypic or other types. ¤ 184 Thorvaldsdottir et al. Figure 3: The attribute panel displays a color-coded matrix of phenotypic and clinical data. Clicking on a column header will sort the tracks by the corresponding attribute. NGS data for both application performance, as described in IGV includes a large number of specialized features the ‘Methods and Technologies’ section above, for exploring NGS read alignments, including fea- and to help investigators make sense of the massive tures tailored for variant visualization and validation, amount of data. When viewing a whole chromo- splicing of RNA transcripts and methylation from some it is not useful to display all the reads, as indi- bisulfite sequencing. IGV supports several read align- vidual reads are not distinguishable in the view at ment file formats, including SAM, BAM and Goby. this level of zoom. Therefore, when zoomed all When more than one file is loaded into IGV, it the way out, only a bar chart of the read coverage can display the reads from each file in a separate is displayed. IGV provides tools to precompute this panel or merge them together as if they came from coverage. When zoomed in past a user-settable visi- the same file. bility threshold (by default, 30 kb), the individual Due to the magnitude of the data stored in NGS read alignments come into view and are displayed alignment files, IGV displays varying level of data as horizontal bars. At this zoom level, IGV dynam- detail depending on the zoom level. This is necessary ically computes the read coverage in the viewed Integrative Genomics Viewer 185 region. Zooming further reveals the individual read deletions, inversions and duplications, can affect both the genomic distance between the mate alignments, bases. IGV uses color and transparency to highlight as well as their orientation. To highlight these events, IGV samples the alignment file to dynamically de- interesting events in the alignment data and to visu- termine the expected distance and orientations, and ally deemphasize others—in the coverage chart, at then uses color to flag aberrant pairs. the read level and for individual bases. Various prop- Another coloring scheme is used to view bisulfite erties can be used to change the read color scheme sequencing data. In this mode, the rules for what on the fly, and to interactively group, sort and filter constitutes a mismatch to the reference genome are the reads. These features can be used alone or in adjusted to account for the expected cytosine to combination, using one or more properties. The uracil conversions. Figure 5 illustrates a view of properties include sample identifier, strand, read Whole-Genome Bisulfite Sequencing (WGBS) data group, mapping quality, base call at a particular pos- from a colorectal tumor and a matched normal ition, pair distance and orientation, custom tags and sample [19]. Red indicates hypermethylated sites, more. and blue indicates hypomethylation. Zooming in past the alignment visibility threshold Figure 6 shows how IGV displays RNA-seq read will also add color bars to the gray coverage track at alignments, connecting segments of reads that are locations where a large number of read bases mis- split across exons with thin horizontal lines. This match the reference—helpful in identifying putative example demonstrates several RNA-seq data tracks SNPs. The relative size and color of these bars indi- for normal tissue samples from two different organs, cates the allele frequency of each base at that loca- including tracks for coverage, junctions, transcripts tion. An example of this can be seen in Figure 4A. In and the read alignments. The junction tracks high- Figure 4B, the view has been zoomed in further to light alternative splicing, also visible in the align- show the reads and individual bases in the region of ments themselves. one of the putative SNPs identified in the coverage track. Individual read bases that match the reference Variant calls genome are displayed in the same color as the read, IGV provides extended support for viewing variants while mismatches are color-coded by the called base stored in the VCF format. This format allows for the and are assigned a transparency value proportional to encoding of variant calls (SNPs, indels and genomic the base call quality (phred) score. This is the default rearrangements) as well as the supporting genotype coloring scheme for read bases, and has the effect of information for individual samples. Samples can also emphasizing high-quality mismatches. In this ex- be annotated with attribute information, including ample, the read alignments have been colored and pedigree and family information. IGV uses these an- sorted by read strand. Visual inspection quickly re- notations to group, sort and filter samples, for ex- veals a number of factors that indicate this is not a ample, to group samples by pedigree or population true SNP. First, the reads harboring the putative group. SNP clearly have a large number of additional mis- matched bases. Also, it is suspicious that all mis- Copy number and expression data matches occur on the negative read strand, and that Copy number and expression data can be loaded the mismatches tend to occur towards the end of the from a variety of file formats. These data types are read. displayed as heatmaps by default. Heatmaps are very A number of options are available for paired-end space efficient in comparison to bar charts and other alignments to help elucidate structural variants, such graph types, as the height of each track can be as deletions, insertions and rearrangements. To high- reduced to a single pixel. This is important for light potential inter-chromosomal rearrangements, these data types as experiments are often performed alignments whose mate falls on a different chromo- on high-throughput arrays and it is not uncommon some are assigned a color indicating the mate to view hundreds or thousands of samples chromosome. This makes it easy to distinguish po- simultaneously. tential rearrangements from noise caused by mis- Expression data require special treatment as the alignments, as rearrangements will appear as a expression values are usually not specified in gen- pileup of reads that are consistent in color and orien- omic coordinates, but rather are associated with tation. Intra-chromosomal events, such as insertions, gene names or chip probe identifiers. These data ¤ 186 Thorvaldsdottir et al. Figure 4: Read alignment views at 20 kb and base pair resolution. IGVdisplays varying level of data detail depending on the zoom level, and uses color and transparency to highlight interesting events in the data. (A)Reads are sum- marized as a coverage plot. Positions with a significant number of mismatches with respect to the reference are highlighted with color bars indicative of both the presence of mismatches and the allele frequency. (B)Individual base mismatches are displayed with alpha transparency proportional to quality. In this example, the reads have been sorted and colored by strand. must be mapped to genomic locations prior to dis- many of the conventions introduced by the UCSC play and IGV provides several options for this step: (i) Genome Browser. For example, gene exonic regions Automatically map data values associated with gene are displayed as solid blocks connected by thin lines names, based on information in the gene track of the representing introns. By default, annotations are reference genome. (ii) Automatically map data values drawn on a single row in ‘collapsed’ mode. Tracks that are associated with probes and probe-sets for that contain overlapping features, such as multiple many common platforms and chips, such as those isoforms for a gene, can be expanded to reveal all from Affymetrix, Agilent and Illumina, based on features. files published by the vendors. By default, IGV maps the data values to the loci of individual probe Sample attributes sets, which typically cover a small portion of a gene, Tracks can be annotated with metadata by loading a but users can choose to have values mapped to the tab-delimited sample information file. The metadata entire gene locus instead. (iii) Perform mappings pro- might include, for example, clinical, experimental or vided by the user in the input gene expression file. computational data such as patient identifier, pedi- gree, phenotype, outcome, cluster membership, etc. Genomic annotations Metadata is displayed as a color-coded matrix in the IGV supports a number of formats for genomic an- attribute panel. Each column in the matrix corres- notations, including BED, GFF, GTF2 [20] and PSL ponds to a specific attribute, and colors are used to [21]. Visual representation of annotations follows distinguish different values of that attribute. Colors Integrative Genomics Viewer 187 Figure 5: IGV bisulfite sequencing view. (A) Two views of the IGF2/H19 Imprinting Control Region (ICR), illustrat- ing allele-specific methylation of CTCF binding sites. The top view shows a 13-kb region of ChIP-seq histone marks from the ENCODE normal epithelial tissue (HMEC) cell line. The second view shows WGBS read alignments from normal colonic mucosa [19], zoomed in to 75 bp. CpG dinucleotides are shown as blue (unmethylated) and red (methylated) squares. A heterozygous C/T SNP is also apparent, and theT allele is overwhelmingly associated with reads that have methylated CpGs (from the paternal chromosome). (B) The enhancer region surrounding exons 2 and 3 of the B3GNTL1 gene is apparent from the ENCODE tracks showing characteristic enhancer histone marks in a normal epithelial (HMEC) cell line.The bisulfite sequencing view of the read alignments shows that this enhancer is methylated (red ^ lighter) in normal colon mucosa, but almost completely unmethylated (blue ^ darker) in the matched colon tumor sample [19].The cancer-specific de-methylation of this enhancer is consistent with the upregu- lation of the B3GNTL1 transcript in the tumor. ¤ 188 Thorvaldsdottir et al. Figure 6: Visualization of RNA-seq data from heart and liver tissue samples. Each panel includes tracks for total coverage, junction coverage, predicted transcripts and read alignments. Reads that span junctions are connected with thin blue lines. In the junction track, the height of each arc is proportional to the total number of reads span- ning the junction. There is clear evidence of alternative splicing between the two tissues. are assigned explicitly by the user or chosen auto- each track, based on the data type and any prefer- matically by IGV. Numeric attributes can have a ences the user may have set. A right click on any one- or two-color continuous color scale in lieu of track will bring up a menu with display options spe- specific colors for each value. cific to the type of data displayed in the track. A preferences window also provides many data type- specific options. Interacting with the view IGV provides a variety of ways to interact with views of the data. To select the genomic region to view Viewing multiple genomic regions and the zoom level, an investigator can enter gen- With the IGV 2.0 release, we introduced a flexible, omic coordinates, search for genomic features by ‘multilocus’ mode to support viewing multiple gen- name, zoom in or out by clicking on the railroad omic regions, side by side. In this view, the data track control in the upper right of the window or by panel in the IGV window is subdivided into a using mouse shortcuts, and pan by click-dragging in series of vertical panels, one for each region, all dis- the data panel. Tracks can be grouped and filtered, playing the same set of tracks. All track manipulation based on one or more sample attribute values. They features, e.g. grouping, overlaying and sorting func- also can be sorted based on attribute values or on data tions, are available and applied simultaneously to all values in a genomic subregion. IGV selects default panels. Panning and zooming within each panel are display parameters, such as graph type and colors, for supported and the user can change the order of the Integrative Genomics Viewer 189 panels by dragging them to different locations in the the genome. Selecting a read and choosing ‘Show window. Currently, IGV supports two different Mate Region’ from the options menu will split the types of multilocus views. view and display the genomic regions of both mates. The first type of multilocus view is invoked by Figure 8 illustrates viewing both sides of a balanced specifying regions, by genomic locus or gene translocation in a glioblastoma tumor sample. name, either by entering them in the search box or using the ‘Gene Lists’ option from the IGV menu. Saving and sharing sessions IGV provides a number of predefined gene lists rep- Users can save the current state of an IGV session to a resenting pathways from public databases, and users file. This file stores information on which data sets can also create and save their own lists. This is illu- are loaded (the data sets themselves are not stored), strated in Figure 7, with a view of copy number, how they have been grouped and sorted, and the mutation and clinical data from 202 glioblastoma current view and zoom level. Saved session files samples from the TCGA project [22]. The window can be sent to collaborators to replicate the same has been split into panels corresponding to four genes view, as long as they have access to the same data from the p53 signaling pathway. files. It is also possible to share sessions with remote The second type of multilocus view is useful for viewing paired-end sequence read alignments users by putting the session file, along with the asso- when the mates are aligned to distant regions of ciated data files, in a web-accessible location. Others Figure 7: Gene-list view of copy number, mutation and clinical data from 202 glioblastoma samples from theTCGA project. The IGV window has been split into panels corresponding to four genes from the p53 signaling pathway. Copy number is indicated by color, with blue denoting deletion and red amplification. Mutations are overlaid as small black rectangles. The samples have been sorted by copy number of CDKN2A. In this view it is apparent that deletion of CDKN2A and mutation of TP53 tend to be mutually exclusive. ¤ 190 Thorvaldsdottir et al. Figure 8: Split-screen view of read alignments from a glioblastoma multiforme tumor sample and matched normal, displaying regions of chromosomes1 and 6. In this example, alignments whose mate pairs are mapped to unexpected locations are color-coded by the chromosome of the mate; other alignments are displayed in light gray. The brown alignments on the left panel and purple alignments on the right are matepairs, indicating a fusion between these loci. There is no evidence of this rearrangement in the matched normal. can then reproduce the session in IGV by using an and image generation. This enables visiting a HTML link of the form: http://www.broadinstitute. large number of genomic sites quickly and pro- org/igv/projects/current/igv.php?sessionURL¼ ducing image snapshots that can later be viewed URL&locus¼locus. offline. Often, this is used to visually validate a For example, the following link opens IGV on the large number of variant sites and flag those ‘gbm_subtypes_session.xml’ session file hosted at the needed for follow-up inspection. Broad Institute and goes to the specified locus (ii) The port interface can support similar use cases, on chromosome 7. http://www.broadinstitute.org/ but also makes it possible to control IGV from igv/projects/current/igv.php?sessionURL¼http:// any language or tool that can write to a socket. www.broadinstitute.org/igvdata/tcga/gbmsubtypes/ For example, MATLAB users have used this gbm_subtypes_session.xml&locus¼chr7:55054218-5 capability to tie IGV to interactive analyses, loading files and jumping to loci in response to commands from MATLAB functions. Controlling IGV (iii) The HTTP interface supports creating links to As a desktop application, the most common mode of launch IGV on a specific data set or send data to interaction with IGV is through a graphical user an already running IGV. Users can easily embed interface. However, external programs can also con- these links in their own pages and documents. trol IGV using the following interfaces: (i) a batch This feature has been used to launch IGV from script interpreter, (ii) a socket port interface and (iii) Excel spreadsheets, Word documents and to an HTTP interface: view data presented by web portals, including the Tumorscape Portal [23] and the cBio (i) Scripting allows many IGV actions to be auto- mated, such as loading data, navigation, sorting Cancer Genomics Portal [24]. Integrative Genomics Viewer 191 Acknowledgements Utilities We thank the following collaborators for their contributions The ‘igvtools’ provide a set of utilities for preprocess- to components described in this manuscript: Damon May, ing data files. These include utilities for: (i) convert- Fred Hutchinson Cancer Research Center, for the RNA-seq ing files to the Binary Tiled Data (TDF) format for splice junction viewer; Fabien Campagne, Campagne faster loading and retrieval of large data sets, (ii) com- Laboratory, Institute for Computational Biomedicine, Weill Cornell Medical College, for the Goby alignment format mod- puting read alignment coverage, (iii) computing fea- ules and Benjamin Berman of the USC Epigenome Center ture density and (iv) creating an index file for feature for the bisulfite sequencing components. files and text SAM alignment files. The ‘igvtools’ utilities can be run from the IGV user interface, or downloaded as a separate package from the web site and run from the command line. FUNDING National Institute of General Medical Sciences (R01GM074024); National Cancer Institute (R21CA135827); National Human Genome Re- FUTURE DIRECTIONS search Institute (U54HG003067) and Starr Cancer While originally developed for use in cancer genome Consortium (I5-A500). characterization studies, IGV is now used extensively in a broad range of basic biology and biomedical studies, and will continue to evolve with the changing needs of the biomedical research community. Here, References we name a few of the opportunities and challenges on 1. Robinson JT, Thorvaldsdottir H, Winckler W, et al. the immediate horizon. First, the increasing scale of Integrative genomics viewer. Nat Biotechnol 2011;29:24–6. NGS data sets will continue to challenge the capabil- 2. Milne I, Bayer M, Cardle L, et al. Tablet—next gener- ities of existing visualization tools, including the IGV. ation sequence assembly visualization. Bioinformatics 2010; 26:401–2. Even now, large studies can comprise hundreds to 3. Carver T, Bohme U, Otto TD, et al. BamView: viewing thousands of whole-genome and whole-exome mapped read alignment data in the context of the reference sequencing experiments. Visualization of these data sequence. Bioinformatics 2010;26:676–7. will require new approaches for aggregating data in- 4. Fiume M, Williams V, Brook A, et al. Savant: genome telligently to reveal trends, while continuing to pro- browser for high-throughput sequencing data. Bioinformatics 2010;26:1938–44. vide access to lower-level details on demand. We also 5. Rutherford K, Parkhill J, Crook J, et al. Artemis: sequence anticipate the need for augmenting IGV with auxil- visualization and annotation. Bioinformatics 2000;16:944–5. iary nongenomic views, such as network views to 6. The Cancer Genome Atlas Research Network. highlight functional relationships in a pathway. Comprehensive genomic characterization defines human Finally, we plan to couple the IGV with external glioblastoma genes and core pathways. Nature 2008;455: tools to enable intelligent data-driven search and 1061–8. navigation. 7. Guttman M, Amit I, Garber M, et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 2009;458:223–7. 8. The 1000 Genomes Project Consortium. A map of human Key Points genome variation from population-scale sequencing. Nature The IGV is a high-performance desktop viewer that efficiently 2010;467:1061–73. handles large heterogeneous data sets, while providing a 9. Li H, Handsaker B, Wysoker A, et al. The Sequence smooth and intuitive user experience at all levels of genome Alignment/Map format and SAMtools. Bioinformatics 2009; resolution. 25:2078–9. IGV allows researchers to visualize many different types of gen- 10. Danecek P, Auton A, Abecasis G, et al. The variant call omic data together, including NGS data, variantcalls, microarray format and VCFtools. Bioinformatics 2011;27:2156–8. data and genome annotations. Importantly, it also supports inte- grating metadata, such as clinical, phenotypic and other attri- 11. Sanger Institute. GFF: an Exchange Format for Feature bute information. Description. http://www.sanger.ac.uk/resources/software/ IGV provides flexible and fast loading of local and remote data gff/ (21 December 2011, date last accessed). sets. For indexed files, IGV loads data as needed for regions in 12. UCSC Genome Bioinformatics. BED Format. http:// view, thereby minimizing memory usage and data transfer of genome.ucsc.edu/FAQ/FAQformat.html#format1 (21 remote files. December 2011, date last accessed). IGV has a flexible ‘multilocus’ mode that supports viewing mul- 13. UCSC Genome Bioinformatics. Wiggle Track Format tiple genomic regions, side by side. (WIG). http://genome.ucsc.edu/goldenPath/help/wiggle IGV is freely available at http://www.broadinstitute.org/igv. (21 December 2011, date last accessed). ¤ 192 Thorvaldsdottir et al. 14. Campagne Laboratory, Institute for Computational Biology, 20. Brent Laboratory, Washington University in St. Louis. Weill Cornell Medical School. Goby. http://campagnelab. GTF2 Format (Revised Ensembl GTF). http://mblab.wustl. org/software/goby/ (21 December 2011, date last accessed). edu/GTF2.html (21 December 2011, date last accessed). 15. Li H. Tabix: fast retrieval of sequence features from generic 21. UCSC Genome Bioinformatics. PSL Format. http:// TAB-delimited files. Bioinformatics 2011;27:718–9. genome.ucsc.edu/FAQ/FAQformat.html#format2 (21 December 2011, date last accessed). 16. Kent WJ, Zweig AS, Barber G, et al. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 22. Verhaak RG, Hoadley KA, Purdom E, et al. Integrated 2010;26:2204–7. genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in 17. The ENCODE Project Consortium. The ENCODE PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 2010; (ENCyclopedia Of DNA Elements) Project. Science 2004; 17:98–110. 306:636–40. 23. Beroukhim R, Mermel CH, Porter D, et al. The landscape 18. Dowell RD, Jokerst RM, Day A, et al. The distributed an- of somatic copy-number alteration across human cancers. notation system. BMC Bioinformatics 2001;2:7. Nature 2010;463:899–905. 19. Berman BP, Weisenberger DJ, Aman JF, et al. Regions of 24. Memorial Sloan-Kettering Cancer Center. cBio Cancer focal DNA hypermethylation and long-range hypomethy- Genomics Portal. http://www.cbioportal.org/ (21 lation in colorectal cancer coincide with nuclear December 2011, date last accessed). lamina-associated domains. Nat Genet 2011;44:40–6. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Briefings in Bioinformatics Pubmed Central http://www.deepdyve.com/lp/pubmed-central/integrative-genomics-viewer-igv-high-performance-genomics-data-HSq0zg6qPY

Loading next page...

References (31)

R. Beroukhim, C. Mermel, D. Porter, G. Wei, S. Raychaudhuri, Jerry Donovan, J. Barretina, J. Boehm, Jennifer Dobson, M. Urashima, Kevin Henry, Reid Pinchback, A. Ligon, Yoon-Jae Cho, Leila Haery, H. Greulich, Michael Reich, W. Winckler, M. Lawrence, B. Weir, K. Tanaka, Derek Chiang, A. Bass, Alice Loo, Carter Hoffman, John Prensner, T. Liefeld, Qing Gao, Derek Yecies, S. Signoretti, E. Maher, F. Kaye, H. Sasaki, J. Tepper, J. Fletcher, J. Tabernero, J. Baselga, M. Tsao, F. Demichelis, M. Rubin, P. Janne, M. Daly, C. Nucera, R. Levine, B. Ebert, S. Gabriel, A. Rustgi, C. Antonescu, M. Ladanyi, A. Letai, L. Garraway, M. Loda, D. Beer, L. True, A. Okamoto, S. Pomeroy, S. Singer, T. Golub, E. Lander, G. Getz, W. Sellers, M. Meyerson (2010)
The landscape of somatic copy-number alteration across human cancers
Nature, 463
(2011)
Genome Bioinformatics. PSL Format
(2010)
Associate Editor: Jonathan Wren
R. McLendon, A. Friedman, D. Bigner, Erwin Meir, D. Brat, Gena Mastrogianakis, J. Olson, T. Mikkelsen, N. Lehman, K. Aldape, W. Yung, O. Bogler, J. Weinstein, S. Vandenberg, M. Berger, M. Prados, D. Muzny, M. Morgan, S. Scherer, A. Sabo, L. Nazareth, L. Lewis, O. Hall, Yiming Zhu, Yanru Ren, Omar Alvi, Jiqiang Yao, A. Hawes, S. Jhangiani, G. Fowler, A. Lucas, C. Kovar, Andrew Cree, H. Dinh, J. Santibanez, Vandita Joshi, M. Gonzalez-Garay, Christopher Miller, A. Milosavljevic, L. Donehower, D. Wheeler, R. Gibbs, K. Cibulskis, C. Sougnez, T. Fennell, Scott Mahan, Jane Wilkinson, L. Ziaugra, R. Onofrio, Toby Bloom, R. Nicol, K. Ardlie, J. Baldwin, S. Gabriel, E. Lander, L. Ding, R. Fulton, M. McLellan, J. Wallis, D. Larson, Xiaoqi Shi, R. Abbott, L. Fulton, Ken Chen, D. Koboldt, M. Wendl, R. Meyer, Yuzhu Tang, Ling Lin, John Osborne, Brian Dunford-Shore, T. Miner, K. Delehaunty, C. Markovic, Gary Swift, W. Courtney, C. Pohl, S. Abbott, Amy Hawkins, Shin Leong, C. Haipek, Heather Schmidt, M. Wiechert, T. Vickery, S. Scott, D. Dooling, A. Chinwalla, G. Weinstock, E. Mardis, R. Wilson, G. Getz, W. Winckler, R. Verhaak, M. Lawrence, Michael O’Kelly, James Robinson, Gabriele Alexe, R. Beroukhim, S. Carter, Derek Chiang, Josh Gould, Supriya Gupta, Joshua Korn, C. Mermel, J. Mesirov, S. Monti, Huy Nguyen, Melissa Parkin, Michael Reich, Nicolas Stransky, B. Weir, L. Garraway, T. Golub, M. Meyerson, L. Chin, A. Protopopov, Jianhua Zhang, I. Perna, S. Aronson, N. Sathiamoorthy, Georgi Ren, Jun Yao, W. Wiedemeyer, Hyun Kim, Won Sek, Yonghong Xiao, I. Kohane, J. Seidman, P. Park, R. Kucherlapati, P. Laird, L. Cope, J. Herman, D. Weisenberger, F. Pan, D. Berg, L. Neste, Mingyu Joo, Kornel Schuebel, S. Baylin, D. Absher, Jun Li, Audrey Southwick, Shannon Brady, A. Aggarwal, Tisha Chung, G. Sherlock, J. Brooks, R. Myers, P. Spellman, E. Purdom, L. Jakkula, A. Lapuk, H. Marr, S. Dorton, Gi Yoon, Ju Han, A. Ray, V. Wang, S. Durinck, M. Robinson, Nicholas Wang, K. Vranizan, V. Peng, E. Name, G. Fontenay, J. Ngai, J. Conboy, B. Parvin, H. Feiler, T. Speed, J. Gray, C. Brennan, N. Socci, A. Olshen, B. Taylor, A. Lash, N. Schultz, B. Reva, Yevgeniy Antipin, Alexey Stukalov, Benjamin Gross, E. Cerami, Qingqing Wei, L. Qin, V. Seshan, Liliana Villafania, Magali Cavatore, L. Borsu, A. Viale, W. Gerald, C. Sander, M. Ladanyi, C. Perou, D. Hayes, M. Topal, K. Hoadley, Yuan Qi, S. Balu, Yan Shi, Junyuan Wu, R. Penny, M. Bittner, T. Shelton, E. Lenkiewicz, S. Morris, D. Beasley, Sheri Sanders, A. Kahn, R. Sfeir, Jessica Chen, D. Nassau, Larry Feng, E. Hickey, A. Barker, D. Gerhard, J. Vockley, C. Compton, J. Vaught, P. Fielding, M. Ferguson, C. Schaefer, Jinghui Zhang, Subha Madhavan, K. Buetow, F. Collins, P. Good, M. Guyer, B. Ozenberger, Jane Peterson, E. Thomson (2008)
Comprehensive genomic characterization defines human glioblastoma genes and core pathways
Nature, 455
R. Dowell, R. Jokerst, Allen Day, S. Eddy, L. Stein (2001)
The Distributed Annotation System
BMC Bioinformatics, 2
Memorial Sloan-Kettering Cancer Center. cBio Cancer Genomics Portal
The HTTP interface supports creating links to launch IGV on a specific data set or send data to an already running IGV. Users can easily embed these links in their own pages and documents
(2011)
GTF2 Format (Revised Ensembl GTF)
(2010)
BigWig and BigBed: enabling browsing of large distributed datasets
Heng Li, R. Handsaker, Alec Wysoker, T. Fennell, Jue Ruan, Nils Homer, Gabor Marth, G. Abecasis, R. Durbin (2009)
The Sequence Alignment/Map format and SAMtools
Bioinformatics, 25
Bioinformatics Applications Note Sequence Analysis Tabix: Fast Retrieval of Sequence Features from Generic Tab-delimited Files
T. Carver, U. Böhme, T. Otto, J. Parkhill, M. Berriman (2010)
BamView: viewing mapped read alignment data in the context of the reference sequence
Bioinformatics, 26
(2011)
GFF: an Exchange Format for Feature Description
December 2011, date last accessed)
James Robinson, H. Thorvaldsdóttir, W. Winckler, M. Guttman, E. Lander, G. Getz, J. Mesirov (2011)
Integrative Genomics Viewer
Nature biotechnology, 29
E. Feingold, P. Good, M. Guyer, S. Kamholz, L. Liefer, K. Wetterstrand, F. Collins, T. Gingeras, D. Kampa, E. Sekinger, Jill Cheng, H. Hirsch, Srinka Ghosh, Z. Zhu, Sandeep Patel, A. Piccolboni, A. Yang, H. Tammana, S. Bekiranov, P. Kapranov, R. Harrison, G. Church, K. Struhl, B. Ren, Tae Kim, Leah Barrera, Chunxu Qu, S. Calcar, R. Luna, C. Glass, M. Rosenfeld, R. Guigó, S. Antonarakis, E. Birney, M. Brent, L. Pachter, A. Reymond, E. Dermitzakis, Colin Dewey, Damian Keefe, F. Denoeud, Julien Lagarde, J. Ashurst, T. Hubbard, J. Wesselink, R. Castelo, E. Eyras, R. Myers, A. Sidow, S. Batzoglou, N. Trinklein, S. Hartman, S. Aldred, Elizabeth Anton, D. Schroeder, S. Marticke, Leann Nguyen, J. Schmutz, J. Grimwood, M. Dickson, G. Cooper, Eric Stone, G. Asimenos, M. Brudno, Anindya Dutta, N. Karnani, Christopher Taylor, H. Kim, G. Robins, G. Stamatoyannopoulos, J. Stamatoyannopoulos, M. Dorschner, P. Sabo, M. Hawrylycz, R. Humbert, J. Wallace, Min Yu, P. Navas, M. McArthur, William Noble, I. Dunham, Christof Koch, R. Andrews, Gayle Clelland, S. Wilcox, J. Fowler, K. James, Paul Groth, O. Dovey, P. Ellis, Vicki Wraight, A. Mungall, P. Dhami, H. Fiegler, C. Langford, N. Carter, D. Vetrie, M. Snyder, G. Euskirchen, A. Urban, U. Nagalakshmi, J. Rinn, George Popescu, Paul Bertone, S. Hartman, J. Rozowsky, O. Emanuelsson, Thomas Royce, Sambath Chung, M. Gerstein, Z. Lian, Jin Lian, Y. Nakayama, S. Weissman, V. Štolc, W. Tongprasit, H. Sethi, Steven Jones, M. Marra, H. Shin, J. Schein, M. Clamp, K. Lindblad-Toh, Jean Chang, D. Jaffe, Michael Kamal, E. Lander, T. Mikkelsen, J. Vinson, M. Zody, P. Jong, K. Osoegawa, M. Nefedov, B. Zhu, A. Baxevanis, T. Wolfsberg, G. Crawford, J. Whittle, I. Holt, T. Vasicek, D. Zhou, S. Luo, E. Green, G. Bouffard, E. Margulies, M. Portnoy, Nancy Hansen, Pamela Thomas, Jennifer Mcdowell, Baishali Maskeri, Alice Young, J. Idol, R. Blakesley, G. Schuler, W. Miller, R. Hardison, L. Elnitski, P. Shah, S. Salzberg, M. Perțea, W. Majoros, D. Haussler, D. Thomas, K. Rosenbloom, H. Clawson, A. Siepel, W. Kent, Z. Weng, S. Jin, Anason Halees, H. Burden, U. Karaoz, Yutao Fu, Yong Yu, Chunming Ding, C. Cantor, R. Kingston, J. Dennis, Roland Green, Michael Singer, T. Richmond, J. Norton, P. Farnham, M. Oberley, D. Inman, M. McCormick, H. Kim, C. Middle, M. Pirrung, Xiang-Dong Fu, Y. Kwon, Z. Ye, J. Dekker, T. Tabuchi, N. Gheldof, J. Dostie, S. Harvey (2004)
The ENCODE (ENCyclopedia Of DNA Elements) Project
Science, 306
(2011)
Campagne Laboratory, Institute for Computational Biology, Weill Cornell Medical School
R. Verhaak, K. Hoadley, E. Purdom, V. Wang, Y. Qi, M. Wilkerson, C. Miller, L. Ding, Todd Golub, J. Mesirov, Gabriele Alexe, M. Lawrence, Michael O’Kelly, P. Tamayo, B. Weir, S. Gabriel, W. Winckler, Supriya Gupta, L. Jakkula, H. Feiler, J. Hodgson, C. James, J. Sarkaria, C. Brennan, A. Kahn, P. Spellman, R. Wilson, T. Speed, J. Gray, M. Meyerson, G. Getz, C. Perou, D. Hayes (2010)
Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1.
Cancer cell, 17 1
M. Guttman, I. Amit, Manuel Garber, Courtney French, Michael Lin, D. Feldser, Maite Huarte, O. Zuk, B. Carey, John Cassady, M. Cabili, R. Jaenisch, T. Mikkelsen, T. Jacks, N. Hacohen, B. Bernstein, Manolis Kellis, A. Regev, J. Rinn, E. Lander (2009)
Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals
Nature, 458
(i) Nonindexed formats include flat file formats such as GFF [11], BED [12] and WIG [13]
(2011)
Wiggle Track Format (WIG)
B. Berman, D. Weisenberger, Joseph Aman, T. Hinoue, Z. Ramjan, Yaping Liu, H. Noushmehr, Christopher Lange, C. Dijk, R. Tollenaar, D. Berg, P. Laird (2011)
Regions of focal DNA hypermethylation and long-range hypomethylation in colorectal cancer coincide with nuclear lamina–associated domains
Nature Genetics, 44
Iain Milne, Micha Bayer, L. Cardle, Paul Shaw, Gordon Stephen, Frank Wright, David Marshall (2009)
Tablet—next generation sequence assembly visualization
Bioinformatics, 26
M. Fiume, Vanessa Williams, Andrew Brook, M. Brudno (2010)
Savant: genome browser for high-throughput sequencing data
Bioinformatics, 26 16
(2011)
Weill Cornell Medical School. Goby. http://campagnelab. org/software/goby
(2011)
Brent Laboratory, Washington University in St. Louis
(2011)
The variant call format and VCFtools
, 27
G. Abecasis, D. Altshuler, A. Auton, L. Brooks, R. Durbin, R. Gibbs, M. Hurles, G. McVean (2010)
A map of human genome variation from population-scale sequencing
Nature, 467
(2011)
Tabix: fast retrieval of sequence features from generic TAB-delimited files
Bioinformatics, 27
Leena Salmela, Eric Rivals (2020)
Sequence Analysis
Definitions
Kim Rutherford, J. Parkhill, James Crook, T. Horsnell, P. Rice, M. Rajandream, B. Barrell (2000)
Artemis: sequence visualization and annotation
Bioinformatics, 16 10

Publisher: Pubmed Central
Copyright: © The Author(s) 2012. Published by Oxford University Press.
ISSN: 1467-5463
eISSN: 1477-4054
DOI: 10.1093/bib/bbs017
Publisher site: See Article on Publisher Site

Abstract

Data visualization is an essential component of genomic data analysis. However, the size and diversity of the data sets produced by today’s sequencing and array-based profiling methods present major challenges to visualization tools. The Integrative Genomics Viewer (IGV) is a high-performance viewer that efficiently handles large heteroge- neous data sets, while providing a smooth and intuitive user experience at all levels of genome resolution. A key characteristic of IGV is its focus on the integrative nature of genomic studies, with support for both array-based and next-generation sequencing data, and the integration of clinical and phenotypic data. Although IGV is often used to view genomic data from public sources, its primary emphasis is to support researchers who wish to visualize and explore their own data sets or those from colleagues. To that end, IGV supports flexible loading of local and remote data sets, and is optimized to provide high-performance data visualization and exploration on standard desk- top systems. IGV is freely available for download from http://www.broadinstitute.org/igv, under a GNU LGPL open-source license. Keywords: visualization; next-generation sequencing; NGS; genome viewer; IGV resolution, from whole genome to base pairs. IGV INTRODUCTION Next-generation sequencing (NGS) and array-based is designed to be accessible to a wide range of users, including bench biologists and bioinformaticians. profiling methods now generate large quantities of While new users appreciate the user-friendly and in- diverse types of genomic data and are enabling re- searchers to study the genome at unprecedented tuitive interface, more experienced users can also resolution. Analysis of these large, diverse, data sets take advantage of the many advanced features and has become the rate-limiting step in many studies. preferences. There are a number of other desktop applications Although much of the analysis can be automated, available for visualization of genomic data, particu- human interpretation and judgment, supported by larly NGS data, including Tablet [2], BamView [3], rapid and intuitive visualization, is essential for gain- ing insight and elucidating complex biological rela- Savant [4] and Artemis [5]. In comparison to these tionships. We describe the Integrative Genomics tools, a notable characteristic of IGV is its breadth. IGV was developed to support a diverse range of data Viewer (IGV) [1], a high-performance desktop tool types, including NGS and array-based platforms, for interactive visual exploration of diverse genomic such as expression and copy-number arrays. These data. Even for very large data sets, IGV supports real-time interaction at all scales of genome different data types can be flexibly integrated, and Corresponding author. James T. Robinson, Broad Institute, 7 Cambridge Center (301B-5057), Cambridge, MA 02142, USA. Tel.: þ617-714-7491; Fax: þ617-714-8991; E-mail: [email protected] HelgaThorvaldsdo¤ ttir is a senior software project manager in the Cancer Program at the Broad Institute of MIT and Harvard. JamesT. Robinson is a principle software engineer in the Cancer Program at the Broad Institute of MIT and Harvard, where he has worked on omics visualization software since 2006. Jill P. Mesirov is Chief Informatics Officer of the Broad Institute of MIT and Harvard, where she directs the Computational Biology and Bioinformatics Organization, and a member of the Koch Institute for Integrative Cancer Research at MIT. The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Integrative Genomics Viewer 179 combined with clinical and other sample metadata to draw inferences based on related events that are dynamically group, sort and filter data sets. Another widely separated in genomic coordinates. IGVs mul- distinguishing feature of IGV is the ability to view tilocus views are described further in the ‘Features’ section below. data in multiple genomic regions simultaneously in adjacent panels, for example to view correlated events in distal regions. METHODS AND TECHNOLOGIES IGV is a desktop application written in the Java pro- gramming language and runs on all major platforms HISTORICAL BACKGROUND (Windows, Mac and Linux). Below, we describe in IGVs direction and focus are driven by our collab- more detail some components of the IGV imple- orations with investigators from a wide variety of mentation, including our data-tiling approach for large and small biomedical research projects. IGV supporting large data sets and IGVs support for dif- development started in 2007 in response to a need ferent categories of file formats. We also provide a by The Cancer Genome Atlas (TCGA) [6] project to high-level overview of IGVs software architecture. visualize integrated copy number data, expression, mutation and clinical data. The size of these data sets posed a challenge to desktop visualization tools Data tiling available at the time. To handle these large data sets A primary design goal of IGV is to support inter- in IGV, we introduced a data-loading scheme that active exploration of large-scale genomic data sets on includes indexing of data files as well as on-demand standard desktop computers. This poses a difficult loading. This strategy enabled viewing and inter- challenge as NGS and recent array-based technolo- active exploration of up to several hundred samples gies can generate data sets from gigabytes to terabytes from the largest copy-number array available at the in size. Simply loading these entire data sets into time, comprising approximately half a million memory is not a viable option. In addition, re- probes. searchers search for meaningful events at many dif- The next major application of IGV, visualization ferent genomic resolution scales, from whole of ChIP-seq data from whole-genome sequences to genome to individual base pairs. The problem is find de novo long intergenic noncoding RNAs analogous to that faced by interactive geographical (lincRNAs) [7], provided motivation to extend the mapping tools, which provide views of very large data size limit even further. We developed a binary geographical databases over many resolution scales. multiresolution tiled data format to support data sets Tools such as Google Maps solve this problem by of up to hundreds of gigabytes in size. At this time, precomputing images representing sections of maps we also added support for viewing genome annota- at multiple resolution scales, and providing fast access tions. In August 2008, we deployed the first public to the images as needed to construct a view. We release of the IGV software and web site (www. considered such an approach for IGV, based on broadinstitute.org/igv). precomputed images of genomic data. However, Support for visualization of short-read sequence millions of images would be required to support all alignments followed in May 2009. In this effort, resolution scales for a large genome, thus making we collaborated with members of the 1000 image management difficult without introducing Genomes Project [8] involved in development of the requirement of a database. Furthermore, the rep- the SAM/BAM alignment format [9]. Another col- resentation of the data would be fixed when the laboration with the 1000 Genomes Project, led to images are computed, making it difficult to provide IGV support for visualizing genome variation data in interactive graphing options. Consequently, we the VCF format [10] in the IGV 2.0 release of May adopted a different approach that is based on 2011. precomputing summarizations of data at multiple In IGV 2.0, we also introduced a flexible, ‘multi- resolution scales, with rendering of the data deferred locus’ mode for viewing multiple genomic regions, to runtime. We refer to this as ‘data tiling’, to dis- side by side and thereby eliminated the restriction of tinguish it from ‘image tiling’. viewing only a single contiguous genomic region at a IGVs data tiling implementation is built on a pyr- time. This new view mode is particularly useful amidal data structure that can be described as follows. when investigators seek to test hypotheses and For each resolution scale (‘zoom level’), the genome ¤ 180 Thorvaldsdottir et al. is divided into tiles that correspond to the region not require conversion to TDF before data can be viewable on the screen of a typical user display. loaded. In fact, IGV supports a variety of genomic The first zoom level consists of a single tile that file formats, which can be divided into three cate- covers the entire genome. The next zoom level con- gories: (i) nonindexed, (ii) indexed and (iii) multi- tains a single tile for each chromosome. The number resolution formats: of tiles then increases by a factor of 2 for each level, (i) Nonindexed formats include flat file formats so the next zoom level consists of two tiles per chromosome, then four, etc. Each tile is subdivided such as GFF [11], BED [12] and WIG [13]. into ‘bins’, with the width of a bin chosen to cor- Files in these formats must be read in their en- respond to the approximate genomic width repre- tirety and are only suitable for relatively small sented by a screen pixel at that resolution scale. The data sets. value of each bin is calculated from the underlying (ii) Indexed formats include BAM and Goby [14] genomic data with a summary statistic, such as for sequence alignments. Additionally, many ‘mean’, ‘median’ or ‘maximum’. By organizing data tab-delimited feature formats can be converted in this way, tile sizes for each zoom level are constant to an indexed file using Tabix [15] or ‘igvtools’. and small, containing only the data needed to render Indexed formats provide rapid and efficient the view at the resolution supported by the screen access to subsets of the data for display, but display. Hence, a single tile at the lowest resolution, only when zoomed in to a sufficiently small which spans the entire genome, has the same genomic region. Zooming out requires memory footprint as a tile at a high-resolution ever-larger portions of the file to be loaded. zoom level, which might span only a few kilobases. Thus, indexed formats can efficiently support As the user moves across the genome and through views only for a limited range of resolution zoom levels, IGV only retrieves the tiles required to scales. This range depends on the genomic dens- support the current view and discards tiles no longer ity of the underlying data and can span tens of in view to free memory. This method supports kilobases for NGS alignments, hundreds of browsing very large data sets at all resolution scales megabases for typical variant (SNP) files, or with minimal memory requirements. whole chromosomes for sparse feature files. For large genomes, precomputing tiles for all IGV uses heuristics to determine a suitable zoom levels would be inordinately expensive with upper limit on the genomic range that can be respect to disk space. For example, the human loaded quickly with a reasonable memory foot- genome requires approximately 23 zoom levels, or print. If zoomed out beyond this limit, the data are not loaded. on the order of 2 tiles, to cover the whole genome (iii) Multiresolution formats, such as our TDF to base pair resolution. In practice, IGV uses a hybrid approach; combin- described earlier and the bigWig and bigBed formats [16], include both an index for the ing precomputed lower-level zoom levels with high-resolution tiles computed on the fly. This is raw data, and precomputed indexed summary possible as the high-resolution tiles cover relatively data for lower resolution (zoomed out) scales. small portions of the genome. The number of Multiresolution formats can efficiently support precomputed zoom levels required to achieve good views at any resolution scale. performance varies by data density and genome size. In our experience, seven levels give acceptable per- Software architecture formance for even the highest density human The IGV software structure is designed around a genome data. core set of interfaces and extendable classes. These components can be separated into three conceptual File formats layers as illustrated in Figure 1: (i) a top-level appli- To support the multiresolution data model described cation layer, (ii) a data layer and (iii) a stream layer. earlier, we developed a corresponding file format. These are described in more detail below: The ‘tiled data format’, or TDF, stores the pyramidal data tile structure and provides fast access to individ- (i) The application layer includes the main IGV ual tiles. TDF files can be created using the auxiliary window and user interface elements, along package ‘igvtools’. We note however that IGV does with controllers for user interaction events. It Integrative Genomics Viewer 181 Figure 1: IGV class diagram, illustrating the IGV software structure. also contains representations of genomic features as well as requesting features as needed from the and data. IGV displays these in horizontal rows data layer and drawing these features on the known as ‘tracks‘. Tracks are displayed in a data panel. Most track implementations delegate panel, which is implemented as a class derived the drawing task to a renderer object. Renderers from Java Swing components. The data panel is are designed to be pluggable, and can be swap- responsible for the coordination of track layout ped at runtime, for example to switch graph and rendering, and managing mouse events. It types in response to a menu action. handles certain globally shared mouse actions, (ii) The data layer reads and parses the different such as zooming and panning, and delegates genomic file formats and supplies the application other events to the objects representing tracks. layer with data tiles on demand. It also imple- Tracks are responsible for handling these events, ments caching of tiles, for improved efficiency if ¤ 182 Thorvaldsdottir et al. a previously visited genomic region is requested and explore their own data sets aligned to the se- again. lected reference genome. Researchers can also make (iii) The stream layer is responsible for supporting their data sets available to others for view in IGV, random access to sections of files accessed by sharing them with colleagues or the community at any of the protocols supported by IGV, i.e. large. local file, HTTP, HTTPS or FTP. Random file access is necessary to take advantage of the Launching IGV indexed and multiresolution file formats. For IGV is available on all platforms that support Java. local files, this is straightforward using Java’s Installation and launching are accomplished with a RandomAccessFile class, or alternatively posi- click of a button on the IGV web site at www. tionable file channels. Remote files presented a broadinstitute.org/igv. Alternatively, users can challenge, as there are no Java built-in functions download a ZIP archive to install the application or libraries that support this access pattern. locally. IGV can also be launched from links Initially, we solved this problem using a web embedded in web pages or other documents. service. However, this approach was not ideal, as users who wished to host IGV files were also The IGV window required to install and run the web service on The IGV window is divided into a number of con- their systems. Consequently, we designed and trols and panels as illustrated in Figure 2. At the top is implemented a set of classes to provide a uni- a command bar with controls for selecting a refer- form interface for random file access for all the ence genome, navigating and defining regions of protocols. IGVs implementation for the HTTP interest. Just below the command bar is a header protocol uses byte range requests from the panel with an ideogram representation of the cur- standard HTTP specification. For the FTP rently viewed chromosome, along with a genome protocol, IGV uses the mechanism for restarting coordinate ruler that indicates the size of the downloads that is supported by most FTP ser- region in view. The ideogram also displays a red vers via the ‘REST’ command. rectangle that outlines the region in view. The re- mainder of the window is divided into one or more data panels and an attribute panel. Data are mapped to the genomic coordinates of the reference genome FEATURES and are displayed in the data panels as horizontal IGV is a desktop application for the visualization and rows called ‘tracks’. Each track typically represents interactive exploration of genomic data in the con- one sample, experiment or genomic annotation. If text of a reference genome. A key characteristic of any sample or track attributes have been loaded, they IGV is its focus on the integrative nature of genomic are displayed as a color-coded matrix in the attribute studies. It allows investigators to flexibly visualize panel as illustrated in Figure 3. Each column in the many different types of data together—and import- matrix corresponds to an attribute, and a track’s at- antly also integrate these data with the display of tribute values are displayed as a row of colored cells sample attribute information such as clinical and adjacent to the track. phenotypic information. To support interactive ex- ploration of data, IGV provides direct manipulation navigation in the style of Google Maps. For instance, Reference genome you click and drag to pan the view across the A reference genome must be selected before loading genome and double-click on a region to zoom in data. IGV provides dozens of hosted reference gen- for a more detailed view. It supports real-time inter- omes to choose from, but also provides the option of action at all scales of genome resolution, from whole importing others from the sequence data. The min- genome to base pairs, even for very large data sets. imal requirement for importing a genome is a The Broad IGV data server hosts many genome an- FASTA file containing chromosome or contig se- notation files and data sets from a variety of public quences. Other genome information is optional, sources (including from TCGA, 1000 Genomes including: (i) cytoband information for the chromo- Project, ENCODE Project [17] and others). some ideogram in the IGV window, (ii) annotations However, the primary emphasis is on supporting defining the features displayed in the gene track for biomedical researchers who wish to load, visualize the genome and (iii) chromosome alias information Integrative Genomics Viewer 183 Figure 2: The IGV application window. that defines synonyms for the sequence names attribute information, to annotate the genomic defined in the FASTA file, e.g. ‘1’ for ‘chr1’. We data. Data files are loaded into IGV by: (i) note that this option is intended primarily for fin- using the built-in file browser to select a file ished assemblies. IGV is not designed for visualiza- on the local file system, (ii) entering the URL tion of unassembled genomes. of a file accessible over a network via HTTP When the view is sufficiently zoomed in, IGV or FTP, (iii) entering the URL of a Distrib- displays the reference genome sequence as a separate uted Annotation System (DAS) feature source track in the data panel. Depending on the zoom [18] or (iv) selecting entries from the ‘File > Load level, the nucleotides are represented as colored from Server’ menu. By default, the menu provides bars or letters. By default, the forward strand is dis- access to data and annotation files that are hosted at played. Clicking on a strand indicator for the track the Broad Institute for viewing in IGV. This can toggles the strand direction. Another option enables easily be changed to point to any set of the display of three-frame translation of codons for web-accessible files. For example, the menu could the current strand. provide access to shared files on a research project’s central server. Loading data IGV was designed to accommodate any data that Viewing data can be mapped to genomic coordinates. It currently IGV supports simultaneous viewing of multiple data supports more than 30 different file formats, includ- sets, with the same or different types of data. A ing many of the common formats for genome track’s default appearance and available view options annotations, sequence alignments, variant calls and will vary depending on the data type. The following microarray data. Importantly, users can also load sections describe some of the commonly viewed metadata, such as clinical, phenotypic or other types. ¤ 184 Thorvaldsdottir et al. Figure 3: The attribute panel displays a color-coded matrix of phenotypic and clinical data. Clicking on a column header will sort the tracks by the corresponding attribute. NGS data for both application performance, as described in IGV includes a large number of specialized features the ‘Methods and Technologies’ section above, for exploring NGS read alignments, including fea- and to help investigators make sense of the massive tures tailored for variant visualization and validation, amount of data. When viewing a whole chromo- splicing of RNA transcripts and methylation from some it is not useful to display all the reads, as indi- bisulfite sequencing. IGV supports several read align- vidual reads are not distinguishable in the view at ment file formats, including SAM, BAM and Goby. this level of zoom. Therefore, when zoomed all When more than one file is loaded into IGV, it the way out, only a bar chart of the read coverage can display the reads from each file in a separate is displayed. IGV provides tools to precompute this panel or merge them together as if they came from coverage. When zoomed in past a user-settable visi- the same file. bility threshold (by default, 30 kb), the individual Due to the magnitude of the data stored in NGS read alignments come into view and are displayed alignment files, IGV displays varying level of data as horizontal bars. At this zoom level, IGV dynam- detail depending on the zoom level. This is necessary ically computes the read coverage in the viewed Integrative Genomics Viewer 185 region. Zooming further reveals the individual read deletions, inversions and duplications, can affect both the genomic distance between the mate alignments, bases. IGV uses color and transparency to highlight as well as their orientation. To highlight these events, IGV samples the alignment file to dynamically de- interesting events in the alignment data and to visu- termine the expected distance and orientations, and ally deemphasize others—in the coverage chart, at then uses color to flag aberrant pairs. the read level and for individual bases. Various prop- Another coloring scheme is used to view bisulfite erties can be used to change the read color scheme sequencing data. In this mode, the rules for what on the fly, and to interactively group, sort and filter constitutes a mismatch to the reference genome are the reads. These features can be used alone or in adjusted to account for the expected cytosine to combination, using one or more properties. The uracil conversions. Figure 5 illustrates a view of properties include sample identifier, strand, read Whole-Genome Bisulfite Sequencing (WGBS) data group, mapping quality, base call at a particular pos- from a colorectal tumor and a matched normal ition, pair distance and orientation, custom tags and sample [19]. Red indicates hypermethylated sites, more. and blue indicates hypomethylation. Zooming in past the alignment visibility threshold Figure 6 shows how IGV displays RNA-seq read will also add color bars to the gray coverage track at alignments, connecting segments of reads that are locations where a large number of read bases mis- split across exons with thin horizontal lines. This match the reference—helpful in identifying putative example demonstrates several RNA-seq data tracks SNPs. The relative size and color of these bars indi- for normal tissue samples from two different organs, cates the allele frequency of each base at that loca- including tracks for coverage, junctions, transcripts tion. An example of this can be seen in Figure 4A. In and the read alignments. The junction tracks high- Figure 4B, the view has been zoomed in further to light alternative splicing, also visible in the align- show the reads and individual bases in the region of ments themselves. one of the putative SNPs identified in the coverage track. Individual read bases that match the reference Variant calls genome are displayed in the same color as the read, IGV provides extended support for viewing variants while mismatches are color-coded by the called base stored in the VCF format. This format allows for the and are assigned a transparency value proportional to encoding of variant calls (SNPs, indels and genomic the base call quality (phred) score. This is the default rearrangements) as well as the supporting genotype coloring scheme for read bases, and has the effect of information for individual samples. Samples can also emphasizing high-quality mismatches. In this ex- be annotated with attribute information, including ample, the read alignments have been colored and pedigree and family information. IGV uses these an- sorted by read strand. Visual inspection quickly re- notations to group, sort and filter samples, for ex- veals a number of factors that indicate this is not a ample, to group samples by pedigree or population true SNP. First, the reads harboring the putative group. SNP clearly have a large number of additional mis- matched bases. Also, it is suspicious that all mis- Copy number and expression data matches occur on the negative read strand, and that Copy number and expression data can be loaded the mismatches tend to occur towards the end of the from a variety of file formats. These data types are read. displayed as heatmaps by default. Heatmaps are very A number of options are available for paired-end space efficient in comparison to bar charts and other alignments to help elucidate structural variants, such graph types, as the height of each track can be as deletions, insertions and rearrangements. To high- reduced to a single pixel. This is important for light potential inter-chromosomal rearrangements, these data types as experiments are often performed alignments whose mate falls on a different chromo- on high-throughput arrays and it is not uncommon some are assigned a color indicating the mate to view hundreds or thousands of samples chromosome. This makes it easy to distinguish po- simultaneously. tential rearrangements from noise caused by mis- Expression data require special treatment as the alignments, as rearrangements will appear as a expression values are usually not specified in gen- pileup of reads that are consistent in color and orien- omic coordinates, but rather are associated with tation. Intra-chromosomal events, such as insertions, gene names or chip probe identifiers. These data ¤ 186 Thorvaldsdottir et al. Figure 4: Read alignment views at 20 kb and base pair resolution. IGVdisplays varying level of data detail depending on the zoom level, and uses color and transparency to highlight interesting events in the data. (A)Reads are sum- marized as a coverage plot. Positions with a significant number of mismatches with respect to the reference are highlighted with color bars indicative of both the presence of mismatches and the allele frequency. (B)Individual base mismatches are displayed with alpha transparency proportional to quality. In this example, the reads have been sorted and colored by strand. must be mapped to genomic locations prior to dis- many of the conventions introduced by the UCSC play and IGV provides several options for this step: (i) Genome Browser. For example, gene exonic regions Automatically map data values associated with gene are displayed as solid blocks connected by thin lines names, based on information in the gene track of the representing introns. By default, annotations are reference genome. (ii) Automatically map data values drawn on a single row in ‘collapsed’ mode. Tracks that are associated with probes and probe-sets for that contain overlapping features, such as multiple many common platforms and chips, such as those isoforms for a gene, can be expanded to reveal all from Affymetrix, Agilent and Illumina, based on features. files published by the vendors. By default, IGV maps the data values to the loci of individual probe Sample attributes sets, which typically cover a small portion of a gene, Tracks can be annotated with metadata by loading a but users can choose to have values mapped to the tab-delimited sample information file. The metadata entire gene locus instead. (iii) Perform mappings pro- might include, for example, clinical, experimental or vided by the user in the input gene expression file. computational data such as patient identifier, pedi- gree, phenotype, outcome, cluster membership, etc. Genomic annotations Metadata is displayed as a color-coded matrix in the IGV supports a number of formats for genomic an- attribute panel. Each column in the matrix corres- notations, including BED, GFF, GTF2 [20] and PSL ponds to a specific attribute, and colors are used to [21]. Visual representation of annotations follows distinguish different values of that attribute. Colors Integrative Genomics Viewer 187 Figure 5: IGV bisulfite sequencing view. (A) Two views of the IGF2/H19 Imprinting Control Region (ICR), illustrat- ing allele-specific methylation of CTCF binding sites. The top view shows a 13-kb region of ChIP-seq histone marks from the ENCODE normal epithelial tissue (HMEC) cell line. The second view shows WGBS read alignments from normal colonic mucosa [19], zoomed in to 75 bp. CpG dinucleotides are shown as blue (unmethylated) and red (methylated) squares. A heterozygous C/T SNP is also apparent, and theT allele is overwhelmingly associated with reads that have methylated CpGs (from the paternal chromosome). (B) The enhancer region surrounding exons 2 and 3 of the B3GNTL1 gene is apparent from the ENCODE tracks showing characteristic enhancer histone marks in a normal epithelial (HMEC) cell line.The bisulfite sequencing view of the read alignments shows that this enhancer is methylated (red ^ lighter) in normal colon mucosa, but almost completely unmethylated (blue ^ darker) in the matched colon tumor sample [19].The cancer-specific de-methylation of this enhancer is consistent with the upregu- lation of the B3GNTL1 transcript in the tumor. ¤ 188 Thorvaldsdottir et al. Figure 6: Visualization of RNA-seq data from heart and liver tissue samples. Each panel includes tracks for total coverage, junction coverage, predicted transcripts and read alignments. Reads that span junctions are connected with thin blue lines. In the junction track, the height of each arc is proportional to the total number of reads span- ning the junction. There is clear evidence of alternative splicing between the two tissues. are assigned explicitly by the user or chosen auto- each track, based on the data type and any prefer- matically by IGV. Numeric attributes can have a ences the user may have set. A right click on any one- or two-color continuous color scale in lieu of track will bring up a menu with display options spe- specific colors for each value. cific to the type of data displayed in the track. A preferences window also provides many data type- specific options. Interacting with the view IGV provides a variety of ways to interact with views of the data. To select the genomic region to view Viewing multiple genomic regions and the zoom level, an investigator can enter gen- With the IGV 2.0 release, we introduced a flexible, omic coordinates, search for genomic features by ‘multilocus’ mode to support viewing multiple gen- name, zoom in or out by clicking on the railroad omic regions, side by side. In this view, the data track control in the upper right of the window or by panel in the IGV window is subdivided into a using mouse shortcuts, and pan by click-dragging in series of vertical panels, one for each region, all dis- the data panel. Tracks can be grouped and filtered, playing the same set of tracks. All track manipulation based on one or more sample attribute values. They features, e.g. grouping, overlaying and sorting func- also can be sorted based on attribute values or on data tions, are available and applied simultaneously to all values in a genomic subregion. IGV selects default panels. Panning and zooming within each panel are display parameters, such as graph type and colors, for supported and the user can change the order of the Integrative Genomics Viewer 189 panels by dragging them to different locations in the the genome. Selecting a read and choosing ‘Show window. Currently, IGV supports two different Mate Region’ from the options menu will split the types of multilocus views. view and display the genomic regions of both mates. The first type of multilocus view is invoked by Figure 8 illustrates viewing both sides of a balanced specifying regions, by genomic locus or gene translocation in a glioblastoma tumor sample. name, either by entering them in the search box or using the ‘Gene Lists’ option from the IGV menu. Saving and sharing sessions IGV provides a number of predefined gene lists rep- Users can save the current state of an IGV session to a resenting pathways from public databases, and users file. This file stores information on which data sets can also create and save their own lists. This is illu- are loaded (the data sets themselves are not stored), strated in Figure 7, with a view of copy number, how they have been grouped and sorted, and the mutation and clinical data from 202 glioblastoma current view and zoom level. Saved session files samples from the TCGA project [22]. The window can be sent to collaborators to replicate the same has been split into panels corresponding to four genes view, as long as they have access to the same data from the p53 signaling pathway. files. It is also possible to share sessions with remote The second type of multilocus view is useful for viewing paired-end sequence read alignments users by putting the session file, along with the asso- when the mates are aligned to distant regions of ciated data files, in a web-accessible location. Others Figure 7: Gene-list view of copy number, mutation and clinical data from 202 glioblastoma samples from theTCGA project. The IGV window has been split into panels corresponding to four genes from the p53 signaling pathway. Copy number is indicated by color, with blue denoting deletion and red amplification. Mutations are overlaid as small black rectangles. The samples have been sorted by copy number of CDKN2A. In this view it is apparent that deletion of CDKN2A and mutation of TP53 tend to be mutually exclusive. ¤ 190 Thorvaldsdottir et al. Figure 8: Split-screen view of read alignments from a glioblastoma multiforme tumor sample and matched normal, displaying regions of chromosomes1 and 6. In this example, alignments whose mate pairs are mapped to unexpected locations are color-coded by the chromosome of the mate; other alignments are displayed in light gray. The brown alignments on the left panel and purple alignments on the right are matepairs, indicating a fusion between these loci. There is no evidence of this rearrangement in the matched normal. can then reproduce the session in IGV by using an and image generation. This enables visiting a HTML link of the form: http://www.broadinstitute. large number of genomic sites quickly and pro- org/igv/projects/current/igv.php?sessionURL¼ ducing image snapshots that can later be viewed URL&locus¼locus. offline. Often, this is used to visually validate a For example, the following link opens IGV on the large number of variant sites and flag those ‘gbm_subtypes_session.xml’ session file hosted at the needed for follow-up inspection. Broad Institute and goes to the specified locus (ii) The port interface can support similar use cases, on chromosome 7. http://www.broadinstitute.org/ but also makes it possible to control IGV from igv/projects/current/igv.php?sessionURL¼http:// any language or tool that can write to a socket. www.broadinstitute.org/igvdata/tcga/gbmsubtypes/ For example, MATLAB users have used this gbm_subtypes_session.xml&locus¼chr7:55054218-5 capability to tie IGV to interactive analyses, loading files and jumping to loci in response to commands from MATLAB functions. Controlling IGV (iii) The HTTP interface supports creating links to As a desktop application, the most common mode of launch IGV on a specific data set or send data to interaction with IGV is through a graphical user an already running IGV. Users can easily embed interface. However, external programs can also con- these links in their own pages and documents. trol IGV using the following interfaces: (i) a batch This feature has been used to launch IGV from script interpreter, (ii) a socket port interface and (iii) Excel spreadsheets, Word documents and to an HTTP interface: view data presented by web portals, including the Tumorscape Portal [23] and the cBio (i) Scripting allows many IGV actions to be auto- mated, such as loading data, navigation, sorting Cancer Genomics Portal [24]. Integrative Genomics Viewer 191 Acknowledgements Utilities We thank the following collaborators for their contributions The ‘igvtools’ provide a set of utilities for preprocess- to components described in this manuscript: Damon May, ing data files. These include utilities for: (i) convert- Fred Hutchinson Cancer Research Center, for the RNA-seq ing files to the Binary Tiled Data (TDF) format for splice junction viewer; Fabien Campagne, Campagne faster loading and retrieval of large data sets, (ii) com- Laboratory, Institute for Computational Biomedicine, Weill Cornell Medical College, for the Goby alignment format mod- puting read alignment coverage, (iii) computing fea- ules and Benjamin Berman of the USC Epigenome Center ture density and (iv) creating an index file for feature for the bisulfite sequencing components. files and text SAM alignment files. The ‘igvtools’ utilities can be run from the IGV user interface, or downloaded as a separate package from the web site and run from the command line. FUNDING National Institute of General Medical Sciences (R01GM074024); National Cancer Institute (R21CA135827); National Human Genome Re- FUTURE DIRECTIONS search Institute (U54HG003067) and Starr Cancer While originally developed for use in cancer genome Consortium (I5-A500). characterization studies, IGV is now used extensively in a broad range of basic biology and biomedical studies, and will continue to evolve with the changing needs of the biomedical research community. Here, References we name a few of the opportunities and challenges on 1. Robinson JT, Thorvaldsdottir H, Winckler W, et al. the immediate horizon. First, the increasing scale of Integrative genomics viewer. Nat Biotechnol 2011;29:24–6. NGS data sets will continue to challenge the capabil- 2. Milne I, Bayer M, Cardle L, et al. Tablet—next gener- ities of existing visualization tools, including the IGV. ation sequence assembly visualization. Bioinformatics 2010; 26:401–2. Even now, large studies can comprise hundreds to 3. Carver T, Bohme U, Otto TD, et al. BamView: viewing thousands of whole-genome and whole-exome mapped read alignment data in the context of the reference sequencing experiments. Visualization of these data sequence. Bioinformatics 2010;26:676–7. will require new approaches for aggregating data in- 4. Fiume M, Williams V, Brook A, et al. Savant: genome telligently to reveal trends, while continuing to pro- browser for high-throughput sequencing data. Bioinformatics 2010;26:1938–44. vide access to lower-level details on demand. We also 5. Rutherford K, Parkhill J, Crook J, et al. Artemis: sequence anticipate the need for augmenting IGV with auxil- visualization and annotation. Bioinformatics 2000;16:944–5. iary nongenomic views, such as network views to 6. The Cancer Genome Atlas Research Network. highlight functional relationships in a pathway. Comprehensive genomic characterization defines human Finally, we plan to couple the IGV with external glioblastoma genes and core pathways. Nature 2008;455: tools to enable intelligent data-driven search and 1061–8. navigation. 7. Guttman M, Amit I, Garber M, et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 2009;458:223–7. 8. The 1000 Genomes Project Consortium. A map of human Key Points genome variation from population-scale sequencing. Nature The IGV is a high-performance desktop viewer that efficiently 2010;467:1061–73. handles large heterogeneous data sets, while providing a 9. Li H, Handsaker B, Wysoker A, et al. The Sequence smooth and intuitive user experience at all levels of genome Alignment/Map format and SAMtools. Bioinformatics 2009; resolution. 25:2078–9. IGV allows researchers to visualize many different types of gen- 10. Danecek P, Auton A, Abecasis G, et al. The variant call omic data together, including NGS data, variantcalls, microarray format and VCFtools. Bioinformatics 2011;27:2156–8. data and genome annotations. Importantly, it also supports inte- grating metadata, such as clinical, phenotypic and other attri- 11. Sanger Institute. GFF: an Exchange Format for Feature bute information. Description. http://www.sanger.ac.uk/resources/software/ IGV provides flexible and fast loading of local and remote data gff/ (21 December 2011, date last accessed). sets. For indexed files, IGV loads data as needed for regions in 12. UCSC Genome Bioinformatics. BED Format. http:// view, thereby minimizing memory usage and data transfer of genome.ucsc.edu/FAQ/FAQformat.html#format1 (21 remote files. December 2011, date last accessed). IGV has a flexible ‘multilocus’ mode that supports viewing mul- 13. UCSC Genome Bioinformatics. Wiggle Track Format tiple genomic regions, side by side. (WIG). http://genome.ucsc.edu/goldenPath/help/wiggle IGV is freely available at http://www.broadinstitute.org/igv. (21 December 2011, date last accessed). ¤ 192 Thorvaldsdottir et al. 14. Campagne Laboratory, Institute for Computational Biology, 20. Brent Laboratory, Washington University in St. Louis. Weill Cornell Medical School. Goby. http://campagnelab. GTF2 Format (Revised Ensembl GTF). http://mblab.wustl. org/software/goby/ (21 December 2011, date last accessed). edu/GTF2.html (21 December 2011, date last accessed). 15. Li H. Tabix: fast retrieval of sequence features from generic 21. UCSC Genome Bioinformatics. PSL Format. http:// TAB-delimited files. Bioinformatics 2011;27:718–9. genome.ucsc.edu/FAQ/FAQformat.html#format2 (21 December 2011, date last accessed). 16. Kent WJ, Zweig AS, Barber G, et al. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 22. Verhaak RG, Hoadley KA, Purdom E, et al. Integrated 2010;26:2204–7. genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in 17. The ENCODE Project Consortium. The ENCODE PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 2010; (ENCyclopedia Of DNA Elements) Project. Science 2004; 17:98–110. 306:636–40. 23. Beroukhim R, Mermel CH, Porter D, et al. The landscape 18. Dowell RD, Jokerst RM, Day A, et al. The distributed an- of somatic copy-number alteration across human cancers. notation system. BMC Bioinformatics 2001;2:7. Nature 2010;463:899–905. 19. Berman BP, Weisenberger DJ, Aman JF, et al. Regions of 24. Memorial Sloan-Kettering Cancer Center. cBio Cancer focal DNA hypermethylation and long-range hypomethy- Genomics Portal. http://www.cbioportal.org/ (21 lation in colorectal cancer coincide with nuclear December 2011, date last accessed). lamina-associated domains. Nat Genet 2011;44:40–6.

Journal

Briefings in Bioinformatics – Pubmed Central

Published: Apr 19, 2012

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration

Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration

Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration

References (31)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies