exonize_analysis module tutorial

This guide will walk you through the basics of using the exonize_analysis module for parsing and analyzing the output data generated by exonize.

Example: Human Y chromosome

Data representation

Step 1: Import the exonize_analysis module

from exonize.exonize_analysis import ExpansionsContainer
Step 2: Set the path to the exonize results database

For this example, we’ll use the results database generated for the Y chromosome in the usage example.

db_path='Homo_sapiens_chrom_Y_exonize/Homo_sapiens_chrom_Y_results.db'
Step 3: Create an ExpansionsContainer object

Initialize the ExpansionsContainer object using the specified database. This class is particularly useful for performing specific gene lookups and accessing their expansions. Its attributes consist of Gene objects composed by Expansion graphs.

ychrom_expansions = ExpansionsContainer(exonize_db_path=db_path)

Now let’s check the number of genes identified with duplication events:

len(ychrom_expansions)
15

Step 4: List genes with exon duplications

Display some of the gene IDs where exon duplications were found:

print(ychrom_expansions.genes[:4])
['gene:ENSG00000292357', 'gene:ENSG00000165246', 'gene:ENSG00000012817', 'gene:ENSG00000244395']

Step 5: Examine the DAZ1 gene

Let’s take a closer look at the DAZ1 gene (Ensembl ID: gene:ENSG00000205916).

ychrom_expansions['gene:ENSG00000205916']
<Gene gene:ENSG00000205916 with 6 expansions (iterable of expansion graphs)>
Note that the Gene object is an iterable of Expansion objects, where each Expansion is a NetworkX graph structure encoding the expansions. You can find more details about the attributes of these graph structures in the NetworkX documentation.

DAZ1 has 6 expansions:

len(ychrom_expansions['gene:ENSG00000205916'])
6

Step 6: Inspect the largest expansion

expansion_id = max(ychrom_expansions['gene:ENSG00000205916'], key=len).id

Check the size:

len(ychrom_expansions['gene:ENSG00000205916'][expansion_id])
18

This output indicates that DAZ1 contains 18 copies of an exon. Now, let’s look at the mode of these duplications:

mode = [attrib.get('mode') for node, attrib in ychrom_expansions['gene:ENSG00000205916'][expansion_id].nodes(data=True)]
mode_counts = {t: mode.count(t) for t in set(mode)}
mode_counts
{'FULL': 15, 'INTRONIC': 3}
This tells us that there are matches between 15 exons and 3 intronic regions.

Extracting sequence data

Step 1: Extract the sequence of the largest expansion

If you haven't already, we need to parse the gene hierarchy dictionary which can be found in the Exonize output directory.

ychrom_expansions.parse_gene_hierarchy_dictionary(
    gene_hierarchy_dictionary_path='Homo_sapiens_chrom_Y_exonize/Homo_sapiens_chrom_Y_gene_hierarchy.pkl'
)
Now we need to parse the sequence data used for running exonize.

ychrom_expansions.read_genome(
    file_path='Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gz'
)
Use the write_expansion_sequences method of the Gene object to write event sequences in FASTA format for a specified expansion. Set the full_expansion parameter to True if you wish to include only full events. If the output_path parameter is not specified, the file will be saved in the current directory with the filename <gene_id>_expansion_<expansion_id>.fa.

ychrom_expansions['gene:ENSG00000205916'].write_expansion_sequences(
    expansion_id=expansion_id,
    full_expansion=True
)

The head of the file should look something like this.

>(24886148,24886220)
GCATTTCCTGCTTATCCAAGTTCACCATTTCAGGTCACCACTGGATATCAGTTGCCTGTA
TATAATTATCAG
>(24859665,24859737)
CCATTTCCTGCTTATCCAAGATCACCATTTCAGGTCACTGCTGGATATCAGTTGCCTGTA
TATAATTATCAG
>(24895687,24895759)
GCATTTCCTGCTTATCCAAATTCAGCAGTTCAGGTCACCACTGGATATCAGTTCCATGTA
TACAATTACCAG
>(24898107,24898176)
TATCCTACTTATCCAAATTCACCAGGTCAGGTCACCACTGGGTGTCAGTTGCCTGTATGT
AATTATCAG
>(24873975,24874047)
GCATTTCCTGCTTATCCAAATTCACCAGTTCAGGTCACCACTGGATATCAGTTGCCTGTA
TACAATTATCAG
>(24893302,24893374)
GCATTTCCTGCTTATCCAAGTTCACCATTTCAGGTCACCACTGGATATCAGTTGCCTGTA
TATAATTATCAG
>(24864445,24864517)
CCATTTCCTGCTTATCCAAGTTCACCATTTCAGGTCACTGCTGGATATCAGTTGCCTGTA
TATAATTATCAG

Plotting

Step 1: Visualize the expansion

Now, let’s visualize the expansion graph for the DAZ1 gene. Nodes represent coordinates and edges indicate matches found between them:

ychrom_expansions['gene:ENSG00000205916'].draw_expansions_multigraph(expansion_id=expansion_id)

Step 2: Visualize the gene structure

The draw_gene_structure method uses the dna_features_viewer library to illustrate the positions of expansion events within the gene, along with the coding exons it comprises.

First, we need to parse the gene hierarchy dictionary, which can be found in the Exonize output directory. This step is necessary, since information about the arrangement of the exons within the genes is not stored in the output database.

ychrom_expansions.parse_gene_hierarchy_dictionary(
    gene_hierarchy_dictionary_path='Homo_sapiens_chrom_Y_exonize/Homo_sapiens_chrom_Y_gene_hierarchy.pkl'
)
Let's now plot the gene's features:

ychrom_expansions['gene:ENSG00000205916'].draw_gene_structure(expansion_id=expansion_id)

The dark bars indicate the locations of the gene's coding sequences, while the light bars highlight the locations of the expansion events, colored according to the event mode.

Step 3: Visualize the full expansion

To visualize the full expansion, and matches between tandem exon pairs set the full_expansion and color_tandem_pair_edges parameters to True.

ychrom_expansions['gene:ENSG00000205916'].draw_expansions_multigraph(
    expansion_id=expansion_id,
    full_expansion=True,
    color_tandem_pair_edges=True
)

We can see that the expansion graph is composed of 15 tandemly duplicated exons.

Step 4: Visualize the match graph for DAZ1

You can also display all expansions associated with DAZ1 by ommiting the expansion_id parameter:

ychrom_expansions['gene:ENSG00000205916'].draw_expansions_multigraph(
    full_expansion=True,
    color_tandem_pair_edges=True
)

We can see that all 6 events found in step 5 refer to full exon duplications, where only the first expansion is found to be composed by tandem pairs.