-
Notifications
You must be signed in to change notification settings - Fork 0
HiReNET wiki
Before starting this pipeline, whole-genome sequencing short reads should be analyzed with RepeatExplorer TAREAN to generate consensus repeats and repeat variants for each repeat type. RepeatExplorer is available through the public Galaxy server at https://repeatexplorer-elixir.cerit-sc.cz/. Using the outputs from RepeatExplorer TAREAN, you can then select the repeat of interest and generate profile hidden Markov models (pHMMs).
Sequences of repeat variants of your interest are output from RepeatExplorer-TAREAN, which are required to generate the phmm’s.
HiReNET getphmm -i data/AthCEN178_consensus_variant.fasta -o phmm -p test
Repeat monomers can be identified using profile hidden Markov models (pHMMs) and then organized into customized bins. A bin size of 10 kb is typically a good starting point.
Non-repeat sequences, such as transposable elements (TEs), are typically sparse within the genome. To minimize noise from these regions, HiReNET focuses exclusively on repeat-enriched intervals (repeat arrays) for downstream analysis. The directory PREFIX_arrayout includes repeat array sequences (FASTA files) and coordinates (BED files).
HiReNET arrayfind -g data/test.fasta -c data/AthCEN178_consensus.fasta -o test_arrayout -p test
The default monomer length is 120bp. You can change this threshold with the --min-monomer-len option. Use the --chr flag to restrict analysis to specific chromosomes (e.g., --chr chr1,chr3,chr5). All monomer sequences are included under PREFIX_monomerout/PREFIX_monomers. Each monomer is named based on the coordinates, e.g., chr1_14389767_14389941.fa.
HiReNET monomerfind \ --arrays-dir test_arrayout \ --chrom-dir test_arrayout/split_seq \ --outdir test_monomerout \ --prefix test \ --hmm test_phmm/test.hmm \ --chr chr1,chr3 HiReNET monomerfind \ --arrays-dir test_arrayout \ --chrom-dir test_arrayout/split_seq \ --outdir test_monomerout_chr1 \ --prefix test \ --hmm test_phmm/test.hmm \ --chr chr1 HiReNET monomerfind \ --arrays-dir test_arrayout \ --chrom-dir test_arrayout/split_seq \ --outdir test_monomerout_chr1_2 \ --prefix test \ --hmm test_phmm/test.hmm \ --chr chr1 \ --min-monomer-len 150
The default bin size is 10 kb. HiReNET performs network analysis of monomers within these defined bins. Changing the bin size may alter which monomers are grouped together, potentially affecting the detected HOR patterns. It is therefore recommended to keep the default bin size unless there is a specific reason to modify it. Monomer sequences (FASTA) in each bin are located under PREFIX_arrangemonomer_10kb/PREFIX_bin_monomers, named according to their coordinates, e.g., chr1_14389697_14926924_14389767_14399767.fa.
HiReNET arrangemonomer \ --arrays-dir test_arrayout \ --genomic-bed-dir test_monomerout \ --monomer-dir test_monomerout/test_monomers \ --outdir test_arrangemonomer_10kb \ --prefix test \ --bin 10000 \ --chr chr1,chr3 HiReNET arrangemonomer \ --arrays-dir test_arrayout \ --genomic-bed-dir test_monomerout \ --monomer-dir test_monomerout/test_monomers \ --outdir test_arrangemonomer_10kb_chr3 \ --prefix test \ --bin 10000 \ --chr chr3 HiReNET arrangemonomer \ --arrays-dir test_arrayout \ --genomic-bed-dir test_monomerout_chr1 \ --monomer-dir test_monomerout_chr1/test_monomers \ --outdir test_arrangemonomer_10kb_chr1 \ --prefix test \ --bin 10000 \ --chr chr1
Part 3: Classify repeat bins into three classes (Order, HOR, Disorder) using the pre-trained LDA model.
Within each bin, monomers are compared in an all-to-all manner using BLAT. The resulting output is processed to calculate the Jaccard index score, which is used to construct a network. The network structure and monomer information are then combined into a feature table, which serves as input for a pre-trained LDA model. Each bin is classified with the LDA model, after which adjacent bins sharing the same class and threshold are merged, and their monomers are rearranged into the merged bins.
In this step, BLAT performs an all-to-all comparison of monomers. When the --consensus option is provided, each monomer is additionally compared against the consensus repeat sequence, allowing assessment of sequence similarity at the chromosome or genome level. The blat_output and blat_con_output are the original outputs from BLAT. The blat_output_sub and blat_con_output_sub include processed BLAT output, only containing information related to Jaccard index score calculation.
HiReNET comparemonomer \ --bins-dir test_arrangemonomer_10kb/test_bin_monomers \ --outdir test_comparemonomers \ --consensus data/AthCEN178_consensus.fasta
The R packages listed in the Dependencies section must be installed before running this step. Adding the --plot flag will generate a network plot for each HOR bin.
HiReNET classprediction \ --blatsub test_comparemonomers/blat_output_sub \ --outdir test_classpred_out \ --prefix test \ --bin 10000 \ --plot HiReNET classprediction \ --blatsub test_comparemonomers/blat_output_sub \ --outdir test_classpred_out_noplot \ --prefix test \ --bin 10000
For each merged bin, monomers are compared in an all-to-all manner again, and a network is constructed again using the optimal similarity threshold. Monomers are then grouped based on network communities, and higher-order repeat (HOR) patterns are identified within each merged bin.
The File test_fin_bins_combined.txt contains all information for each merged HOR bin.
HiReNET rearrangemonomers \ --bins test_classpred_out/test_fin_bins_combined.txt \ --class HOR \ --prefix test \ --monomer-dir test_monomerout/test_monomers \ --outdir test_rearrange_monomers_mergebin_chr1 \ --chr chr1
HiReNET comparemonomer \ --bins-dir test_rearrange_monomers_mergebin/re_arrange_monomers \ --outdir test_compare_rearrangemonomers
HiReNET networkHOR \ --blatsub test_compare_rearrangemonomers/blat_output_sub \ --bins test_classpred_out/test_fin_bins_combined.txt \ --coor test_rearrange_monomers_mergebin/test_monomer_bed_inbin.txt \ --outdir test_network_HOR_mergebin
In each merged HOR bin, monomers with the same label are extracted to generate consensus HOR monomers. Consensus HOR monomers that share the same threshold are combined, and these are compared in an all-to-all manner across thresholds ranging from 0.90 to 0.99. Monomers are then relabeled, and shared HOR patterns are identified for each threshold.
HiReNET arrangeHORmonomer \ --groupdir test_network_HOR_mergebin \ --monomer-dir test_monomerout/test_monomers \ --outdir test_network_mergebin_consensus
HiReNET consensusHORmonomer \
--outdir test_network_mergebin_consensus \
--threads 10 \
--chroms chr1
HiReNET compareConsensus \
--chr chr1 \
--consensdir test_network_mergebin_consensus/all_recluster_consensus_monomer \
--outdir test_compare_consensusHOR_chr1
HiReNET sharedHOR \
--chr chr1 \
--datadir test_compare_consensusHOR_chr1/blat_sub \
--outdir test_shared_out_chr1 \
--letter test_network_HOR_mergebin/mergebin_string_outputs \
--plotv V2
HiReNET sharedHOR \
--chr chr1 \
--datadir test_compare_consensusHOR_chr1/blat_sub \
--outdir test_shared_out_chr1 \
--letter test_network_HOR_mergebin/mergebin_string_outputs \
--plotv V1
HiReNET sharedHOR \
--chr chr1 \
--datadir test_compare_consensusHOR_chr1/blat_sub \
--outdir test_shared_out_chr1 \
--letter test_network_HOR_mergebin/mergebin_string_outputs \
--plotv V3
# Use loop to build consensus HORs per chromosome
for chr in chr1 chr3; do
HiReNET consensusHORmonomer \
--outdir test_network_mergebin_consensus \
--threads 10 \
--chroms "$chr"
HiReNET compareConsensus \
--chr "$chr" \
--consensdir test_network_mergebin_consensus/all_recluster_consensus_monomer \
--outdir test_compare_consensusHOR_${chr}
HiReNET sharedHOR \
--chr "$chr" \
--datadir test_compare_consensusHOR_${chr}/blat_sub \
--outdir test_shared_out_${chr}_2 \
--letter test_network_HOR_mergebin/mergebin_string_outputs \
--plotv V2
done