exonize_analysis module tutorial¶
This guide will walk you through the basics of using the exonize_analysis module for parsing and analyzing the output data generated by exonize.
Example: Human Y chromosome¶
Data representation¶
Step 1: Import the exonize_analysis module
from exonize.exonize_analysis import ExpansionsContainer
For this example, we’ll use the results database generated for the Y chromosome in the usage example.
db_path='Homo_sapiens_chrom_Y_exonize/Homo_sapiens_chrom_Y_results.db'
ExpansionsContainer object
Initialize the ExpansionsContainer object using the specified database. This class is particularly useful for performing specific gene lookups and accessing their expansions. Its attributes consist of Gene objects composed by Expansion graphs.
ychrom_expansions = ExpansionsContainer(exonize_db_path=db_path)
Now let’s check the number of genes identified with duplication events:
len(ychrom_expansions)
15
Step 4: List genes with exon duplications
Display some of the gene IDs where exon duplications were found:
print(ychrom_expansions.genes[:4])
['gene:ENSG00000292357', 'gene:ENSG00000165246', 'gene:ENSG00000012817', 'gene:ENSG00000244395']
Step 5: Examine the DAZ1 gene
Let’s take a closer look at the DAZ1 gene (Ensembl ID: gene:ENSG00000205916).
ychrom_expansions['gene:ENSG00000205916']
<Gene gene:ENSG00000205916 with 6 expansions (iterable of expansion graphs)>
Gene object is an iterable of Expansion objects, where each Expansion is a NetworkX graph structure encoding the expansions. You can find more details about the attributes of these graph structures in the NetworkX documentation.
DAZ1 has 6 expansions:
len(ychrom_expansions['gene:ENSG00000205916'])
6
Step 6: Inspect the largest expansion
expansion_id = max(ychrom_expansions['gene:ENSG00000205916'], key=len).id
Check the size:
len(ychrom_expansions['gene:ENSG00000205916'][expansion_id])
18
This output indicates that DAZ1 contains 18 copies of an exon. Now, let’s look at the mode of these duplications:
mode = [attrib.get('mode') for node, attrib in ychrom_expansions['gene:ENSG00000205916'][expansion_id].nodes(data=True)]
mode_counts = {t: mode.count(t) for t in set(mode)}
mode_counts
{'FULL': 15, 'INTRONIC': 3}
Extracting sequence data¶
Step 1: Extract the sequence of the largest expansion
If you haven't already, we need to parse the gene hierarchy dictionary which can be found in the Exonize output directory.
ychrom_expansions.parse_gene_hierarchy_dictionary(
gene_hierarchy_dictionary_path='Homo_sapiens_chrom_Y_exonize/Homo_sapiens_chrom_Y_gene_hierarchy.pkl'
)
ychrom_expansions.read_genome(
file_path='Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gz'
)
write_expansion_sequences method of the Gene object to write event sequences in FASTA format for a specified expansion. Set the full_expansion parameter to True if you wish to include only full events. If the output_path parameter is not specified, the file will be saved in the current directory with the filename <gene_id>_expansion_<expansion_id>.fa.
ychrom_expansions['gene:ENSG00000205916'].write_expansion_sequences(
expansion_id=expansion_id,
full_expansion=True
)
The head of the file should look something like this.
>(24886148,24886220)
GCATTTCCTGCTTATCCAAGTTCACCATTTCAGGTCACCACTGGATATCAGTTGCCTGTA
TATAATTATCAG
>(24859665,24859737)
CCATTTCCTGCTTATCCAAGATCACCATTTCAGGTCACTGCTGGATATCAGTTGCCTGTA
TATAATTATCAG
>(24895687,24895759)
GCATTTCCTGCTTATCCAAATTCAGCAGTTCAGGTCACCACTGGATATCAGTTCCATGTA
TACAATTACCAG
>(24898107,24898176)
TATCCTACTTATCCAAATTCACCAGGTCAGGTCACCACTGGGTGTCAGTTGCCTGTATGT
AATTATCAG
>(24873975,24874047)
GCATTTCCTGCTTATCCAAATTCACCAGTTCAGGTCACCACTGGATATCAGTTGCCTGTA
TACAATTATCAG
>(24893302,24893374)
GCATTTCCTGCTTATCCAAGTTCACCATTTCAGGTCACCACTGGATATCAGTTGCCTGTA
TATAATTATCAG
>(24864445,24864517)
CCATTTCCTGCTTATCCAAGTTCACCATTTCAGGTCACTGCTGGATATCAGTTGCCTGTA
TATAATTATCAG
Plotting¶
Step 1: Visualize the expansion
Now, let’s visualize the expansion graph for the DAZ1 gene. Nodes represent coordinates and edges indicate matches found between them:
ychrom_expansions['gene:ENSG00000205916'].draw_expansions_multigraph(expansion_id=expansion_id)
Step 2: Visualize the gene structure
The draw_gene_structure method uses the dna_features_viewer library to illustrate the positions of expansion events within the gene, along with the coding exons it comprises.
First, we need to parse the gene hierarchy dictionary, which can be found in the Exonize output directory. This step is necessary, since information about the arrangement of the exons within the genes is not stored in the output database.
ychrom_expansions.parse_gene_hierarchy_dictionary(
gene_hierarchy_dictionary_path='Homo_sapiens_chrom_Y_exonize/Homo_sapiens_chrom_Y_gene_hierarchy.pkl'
)
ychrom_expansions['gene:ENSG00000205916'].draw_gene_structure(expansion_id=expansion_id)
The dark bars indicate the locations of the gene's coding sequences, while the light bars highlight the locations of the expansion events, colored according to the event mode.
Step 3: Visualize the full expansion
To visualize the full expansion, and matches between tandem exon pairs set the full_expansion and color_tandem_pair_edges parameters to True.
ychrom_expansions['gene:ENSG00000205916'].draw_expansions_multigraph(
expansion_id=expansion_id,
full_expansion=True,
color_tandem_pair_edges=True
)
We can see that the expansion graph is composed of 15 tandemly duplicated exons.
Step 4: Visualize the match graph for DAZ1
You can also display all expansions associated with DAZ1 by ommiting the expansion_id parameter:
ychrom_expansions['gene:ENSG00000205916'].draw_expansions_multigraph(
full_expansion=True,
color_tandem_pair_edges=True
)
We can see that all 6 events found in step 5 refer to full exon duplications, where only the first expansion is found to be composed by tandem pairs.