AGB Manual

Assembly Genome Browser (AGB) is a tool providing interactive visualization of assembly graphs, a wide range of tuning parameters, and various options for modifying/simplifying the graph.
AGB uses d3-graphviz, GfaPy, NetworkX-METIS, and QUAST-LG.

Contents

  1. Installation
  2. Running AGB
    1. For impatient people
    2. Input data
    3. Command line options
  3. AGB output
  4. Citation
  5. References

1. Installation

AGB can be run on Linux or macOS (OS X). Install conda if you don't have one:

        wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
        bash miniconda.sh -b -p ./miniconda
        unset PYTHONPATH
        export PATH=$(pwd)/miniconda/bin:$PATH
    
Create a new conda environment and install AGB into it:
        conda create -c almiheenko -c bioconda -n AGB agb
    
Activate the environment:
        conda activate agb
    

To compile AGB by yourself you will need the following libraries to be pre-installed:

If you meet these requirements, you can download the AGB source code:

   git clone https://github.com/almiheenko/AGB.git
   cd AGB

    ./setup.py install

2. Running AGB

2.1. For impatient people


Running AGB to visualize an assembly graph:
    agb.py --graph <GFA(1,2)/FASTG/Graphviz file> -a <assembler_name>
or running AGB to visualize an output of some of supported assemblers (Canu, Flye, SPAdes):
    agb.py -i <assembler_output_dir> -a <assembler_name>
The assembly graph viewer will be saved to agb_output/viewer.html

2.2. Input data


AGB takes as input an output folder of supported assemblers, assembly graphs, sequences, and reference.

Assembly output folder
Path to the assembly output folder. Up to the date, supported assemblers are Canu, Flye, and SPAdes.

Assembly graph
Assembly graph in GFA1/GFA2/FASTG/Graphviz format. Most of popular assemblers are supported.

Sequences
Sequences of graph edges in FASTA format.

Reference
If you provide the reference genome in FASTA format, the genome browser visualization will be available.

2.3. Command line options


AGB runs from a command line as follows:
    agb.py --graph <file> -a <assembler_name> [--fasta <file>] [-r <file>] [-o <output_dir>]
Also, if you have output generated by one of supported assemblers (Canu, Flye, SPAdes), you can run AGB as follows:
    agb.py -i <input_dir> -a <assembler_name> [-r <file>] [-o <output_dir>]
Options:
-a <assembler_name>
The name of the used assembler software (case insensitive). Required.
-i <input_dir>
Assembler output directory. AGB will attempt to automatically parse output files of supported assemblers (Canu, Flye, SPAdes).
--graph <path>
File with assembly graph in GFA/FASTG/Graphviz format.
--fasta <path>
File with graph edge sequences in FASTA format. Optional.
-o <output_dir>
Output directory. The default output directory is agb_output. Optional.
-r <path>
Reference genome file. Optional.

3. AGB output

If an output path is not specified manually (with -o), AGB generates its output into agb_output directory.

AGB output contains:

Open viewer.html to see visualisation.

3.1. Assembly Graph Browser



AGB visualizes the assembly graph produced by an assembler, where edges represent various genome segments (each genome segment is represented by its forward and reverse-complement edge). The top panel contains control buttons for iterating over connected components and buttons for exporting the graph in SVG and DOT formats. In addition, it contains a trigger for switching between default, repeat-focused, reference-based or contig-based modes. Each edge is labeled with its identifier, length, and read coverage. Unique edges are shown as thin black lines, while repetitive edges are shown as colored and thick lines (edge width depends on its coverage). All edges within each mosaic repeat are highlighted with the same color. Nodes with zero indegree or outdegree are shown as black circles. Unbalanced nodes with the difference in coverage of incoming and outgoing edges are highlighted in red.

The graph representation can be further modified using Additional options.

3.1.1. Default mode

This mode could be useful to assess the contiguity/complexity of the graph and to find problematic parts of the assembly.

3.1.2. Repeat-focused mode

This mode is designed for analyzing complex repeat structures. AGB removes all unique edges from the assembly graph, so each remaining connected component forms a mosaic repeat. By default, each mosaic repeat is highlighted with the same color (all unique edges are colored as black). Light green nodes present the hidden parts of the graph.

Some assemblers (e.g., Flye and Canu) provide information on whether an edge is repetitive. If such information is not available, AGB attempts to classify unique and repetitive edges in the assembly graph using the following simple criteria. For each edge, we estimate its multiplicity by dividing the read coverage of this edge by the median coverage. Edge multiplicity value is set to 1 if this ratio is less than 1.75. An edge is classified as unique if it has multiplicity equal to 1 and as repetitive otherwise.

3.1.3. Reference-based mode

If a reference genome is available, AGB runs QUAST-LG to align graph edges and contigs (scaffolds) produced by assemblers to the reference genome and detect assembly errors. This mode provides two additional options for edge coloring: either according to their mappings to the reference (same colors represent same chromosomes), or based on the presence of assembly errors. A subset of edges mapped to each chromosome can be visualized on a separate page. When edges are colored according to the presence of assembly errors detected by QUAST-LG, green edges do not contain errors, red edges belong to the misassembled contigs (but correspond to correct genomic sequences), and dark red edges represented by parallel lines are erroneous themselves.

At the top, corresponding edge alignments to the selected chromosome are displayed. Red blocks contain detected assembly errors, while green blocks were aligned correctly. The alignment of the selected edge is highlighted with dark green color. It is also possible to display brief information about an alignment by hovering.

3.1.4. Contig-based mode

If an assembler provides paths in the assembly graph corresponding to the assembled contigs/scaffolds, AGB displays each path separately. Given the reference genome, AGB also shows the number of assembly errors per contig.

3.2. Left panel

The left panel includes the menu with various options, the search bar, and the tables describing various graph elements.

3.2.1. Additional options

3.2.2. Search bar

The search bar allows to search all graph edges, contigs, and reference chromosomes by name and display them.

3.2.3. Tables

AGB displays interactive sortable tables containing information about edges, vertices, contigs, reference chromosomes, and connected components. All tables are affected by additional options (only edges satisfying the filtering criteria (read length, coverage, or uniqueness) are taken into account).

4. Citation


Alla Mikheenko, Mikhail Kolmogorov,
Assembly Graph Browser: interactive visualization of assembly graphs,
Bioinformatics. doi: 10.1093/bioinformatics/btz072

5. References

  1. https://github.com/magjac/d3-graphviz
  2. Giorgio Gonnella and Stefan Kurtz "GfaPy: a flexible and extensible software library for handling sequence graphs in Python", Bioinformatics (2017) btx398
  3. https://github.com/networkx/networkx-metis
  4. Gansner, E. R. and North, S. C. (2000). An open graph visualization system and its applications to software engineering. Softw., Pract. Exper., 30(11), 1203–1233.
  5. Mikheenko, A. et al. (2018). Versatile genome assembly evaluation with QUAST-LG. Bioinformatics, 34(13), i142–i150.
  6. Karypis, G. and Kumar, V. (1998). Multilevel algorithms for multi-constraint graph partitioning.In Proceedings of the 1998 ACM/IEEE conference on Supercomputing, page 28. IEEE Computer Society.