exonize_analysis
module tutorial¶
This guide will walk you through the basics of using the exonize_analysis
module for parsing and analyzing the output data generated by exonize
.
Example: Human Y chromosome¶
Data representation¶
Step 1: Import the exonize_analysis
module
from exonize.exonize_analysis import ExpansionsContainer
For this example, we’ll use the results database generated for the Y chromosome in the usage example.
db_path='Homo_sapiens_chrom_Y_exonize/Homo_sapiens_chrom_Y_results.db'
ExpansionsContainer
object
Initialize the ExpansionsContainer
object using the specified database. This class is particularly useful for performing specific gene lookups and accessing their expansions. Its attributes consist of Gene
objects composed by Expansion
graphs.
ychrom_expansions = ExpansionsContainer(exonize_db_path=db_path)
Now let’s check the number of genes identified with duplication events:
len(ychrom_expansions)
15
Step 4: List genes with exon duplications
Display some of the gene IDs where exon duplications were found:
print(ychrom_expansions.genes[:4])
['gene:ENSG00000292357', 'gene:ENSG00000165246', 'gene:ENSG00000012817', 'gene:ENSG00000244395']
Step 5: Examine the DAZ1 gene
Let’s take a closer look at the DAZ1 gene (Ensembl ID: gene:ENSG00000205916).
ychrom_expansions['gene:ENSG00000205916']
<Gene gene:ENSG00000205916 with 6 expansions (iterable of expansion graphs)>
Gene
object is an iterable of Expansion
objects, where each Expansion
is a NetworkX graph structure encoding the expansions. You can find more details about the attributes of these graph structures in the NetworkX documentation.
DAZ1 has 6 expansions:
len(ychrom_expansions['gene:ENSG00000205916'])
6
Step 6: Inspect the largest expansion
expansion_id = max(ychrom_expansions['gene:ENSG00000205916'], key=len).id
Check the size:
len(ychrom_expansions['gene:ENSG00000205916'][expansion_id])
18
This output indicates that DAZ1 contains 18 copies of an exon. Now, let’s look at the mode of these duplications:
mode = [attrib.get('mode') for node, attrib in ychrom_expansions['gene:ENSG00000205916'][expansion_id].nodes(data=True)]
mode_counts = {t: mode.count(t) for t in set(mode)}
mode_counts
{'FULL': 15, 'INTRONIC': 3}
Extracting sequence data¶
Step 1: Extract the sequence of the largest expansion
If you haven't already, we need to parse the gene hierarchy dictionary which can be found in the Exonize output directory.
ychrom_expansions.parse_gene_hierarchy_dictionary(
gene_hierarchy_dictionary_path='Homo_sapiens_chrom_Y_exonize/Homo_sapiens_chrom_Y_gene_hierarchy.pkl'
)
ychrom_expansions.read_genome(
file_path='Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gz'
)
write_expansion_sequences
method of the Gene
object to write event sequences in FASTA format for a specified expansion. Set the full_expansion
parameter to True
if you wish to include only full events. If the output_path
parameter is not specified, the file will be saved in the current directory with the filename <gene_id>_expansion_<expansion_id>.fa
.
ychrom_expansions['gene:ENSG00000205916'].write_expansion_sequences(
expansion_id=expansion_id,
full_expansion=True
)
The head of the file should look something like this.
>(24886148,24886220)
GCATTTCCTGCTTATCCAAGTTCACCATTTCAGGTCACCACTGGATATCAGTTGCCTGTA
TATAATTATCAG
>(24859665,24859737)
CCATTTCCTGCTTATCCAAGATCACCATTTCAGGTCACTGCTGGATATCAGTTGCCTGTA
TATAATTATCAG
>(24895687,24895759)
GCATTTCCTGCTTATCCAAATTCAGCAGTTCAGGTCACCACTGGATATCAGTTCCATGTA
TACAATTACCAG
>(24898107,24898176)
TATCCTACTTATCCAAATTCACCAGGTCAGGTCACCACTGGGTGTCAGTTGCCTGTATGT
AATTATCAG
>(24873975,24874047)
GCATTTCCTGCTTATCCAAATTCACCAGTTCAGGTCACCACTGGATATCAGTTGCCTGTA
TACAATTATCAG
>(24893302,24893374)
GCATTTCCTGCTTATCCAAGTTCACCATTTCAGGTCACCACTGGATATCAGTTGCCTGTA
TATAATTATCAG
>(24864445,24864517)
CCATTTCCTGCTTATCCAAGTTCACCATTTCAGGTCACTGCTGGATATCAGTTGCCTGTA
TATAATTATCAG
Plotting¶
Step 1: Visualize the expansion
Now, let’s visualize the expansion graph for the DAZ1 gene. Nodes represent coordinates and edges indicate matches found between them:
ychrom_expansions['gene:ENSG00000205916'].draw_expansions_multigraph(expansion_id=expansion_id)
Step 2: Visualize the gene structure
The draw_gene_structure
method uses the dna_features_viewer library to illustrate the positions of expansion events within the gene, along with the coding exons it comprises.
First, we need to parse the gene hierarchy dictionary, which can be found in the Exonize output directory. This step is necessary, since information about the arrangement of the exons within the genes is not stored in the output database.
ychrom_expansions.parse_gene_hierarchy_dictionary(
gene_hierarchy_dictionary_path='Homo_sapiens_chrom_Y_exonize/Homo_sapiens_chrom_Y_gene_hierarchy.pkl'
)
ychrom_expansions['gene:ENSG00000205916'].draw_gene_structure(expansion_id=expansion_id)
The dark bars indicate the locations of the gene's coding sequences, while the light bars highlight the locations of the expansion events, colored according to the event mode.
Step 3: Visualize the full expansion
To visualize the full expansion, and matches between tandem exon pairs set the full_expansion
and color_tandem_pair_edges
parameters to True
.
ychrom_expansions['gene:ENSG00000205916'].draw_expansions_multigraph(
expansion_id=expansion_id,
full_expansion=True,
color_tandem_pair_edges=True
)
We can see that the expansion graph is composed of 15 tandemly duplicated exons.
Step 4: Visualize the match graph for DAZ1
You can also display all expansions associated with DAZ1 by ommiting the expansion_id
parameter:
ychrom_expansions['gene:ENSG00000205916'].draw_expansions_multigraph(
full_expansion=True,
color_tandem_pair_edges=True
)
We can see that all 6 events found in step 5 refer to full exon duplications, where only the first expansion is found to be composed by tandem pairs.