Introduction

G-quadruplexes (G4s) are non-canonical secondary structures that usually form in G-rich regions (Varshney et al. 2020). Bioinformatic studies have revealed the preferential enrichment of G4s in human functional regions, including promoters, telomeres, and enhancers (Huppert and Balasubramanian 2007; Phan and Mergny 2002; Zhang et al. 2024). Notably, G4s arise from specific sequence motifs that enable their structure formation. These motifs can be disrupted by single nucleotide variants (SNVs) — the most common genetic variants.

G4SNVHunter is designed to evaluate the impact of SNVs on G4 formation, but it can also assess insertions and deletions (indels) and other short nucleotide variants. This package leverages one of the most popular G4 structure prediction algorithms, G4Hunter (Bedrat, Lacroix, and Mergny 2016), conceptualized by Mergny et al. (Bedrat, Lacroix, and Mergny 2016), to evaluate the propensity of G4 formation before and after the introduction of variants. Users can then design ‘wet’ experiments based on G4SNVHunter output to explore and validate how variants affect biological functions through their impact on G4 structures.

Installation

As the first step, you need to install G4SNVHunter, which can be done with a simple command, please ensure that your network connection is stable.

# Currently, install from github is recommended, 
# as it provides the latest version.

devtools::install_github("rongxinzh/G4SNVHunter", dependencies = TRUE)

NOTE: alternatively, G4SNVHunter can be installed from Bioconductor. However, we currently do not recommend using the Bioconductor release version, as significant updates have been made that will not be included until the second half of 2025.

Then load G4SNVHunter to access its functions.

library(G4SNVHunter)

Load Tutorial Packages

During this tutorial, we need to use a few additional packages. Since we specified dependencies = TRUE when installing G4SNVHunter package, these additional packages have already been installed.

We can load them directly.

library(BSgenome.Hsapiens.UCSC.hg19)

library(GenomicRanges)

library(DT)

library(rtracklayer)

library(dplyr)

Part 1: A Quick Start

The first step is to load your variant file into a GRanges object.

# Example VCF file; both VCF and MAF are supported
vcf_file <- system.file("extdata", "example_variants_chr16.vcf", package = "G4SNVHunter")
# Load VCF; file_type must be specified
variants <- loadVariant(vcf_file, file_type = "vcf")

## VCF loading summary:
##     - Initial variants: 5000
##     - Filtered out by type: 0
##     - Final variants retained: 5000

seqlevels(variants) <- paste0("chr", seqlevels(variants))

Then, load your sequence file into a DNAStringSet object.

# Sequence file; chromosome 16 (hg19) as an example
hg19 <- BSgenome.Hsapiens.UCSC.hg19
chr16_seq <- DNAStringSet(hg19$chr16)
# Chromosome names are needed for analysis
names(chr16_seq) <- "chr16"

Next step is to detect G4s based on your sequence object.

# Predict G4s
G4_detected <- G4HunterDetect(chr16_seq)

## Start processing...

## Now process: chr16.

## Merging predicted G4s...

## Done!

Afterwards, we can calculate the impact of variants on G4 formation.

# Predict variants impact on G4 formation
result <- G4VarImpact(G4 = G4_detected, 
                      variants = variants, 
                      ref_col = "REF",
                      alt_col = "ALT")

## Start processing...

## Processing finished!

Since most of the variants may have limited impact on G4 formation, we should filter the prediction results.

# Filter for those with substantial changes in structural formation 
filtered_mutG4s <- filterVarImpact(result, 
                                   mut_score_threshold = 1.2,
                                   score_diff_threshold = -0.35)
print(filtered_mutG4s)

## GRanges object with 4 ranges and 13 metadata columns:
##       seqnames            ranges strand | G4.info.score G4.info.max_score
##          <Rle>         <IRanges>  <Rle> |     <numeric>         <numeric>
##   [1]    chr16 57500155-57500179      + |          1.56              1.56
##   [2]    chr16 28507628-28507653      - |         -1.58             -1.56
##   [3]    chr16 47215047-47215069      - |         -1.70             -1.52
##   [4]    chr16 55153357-55153382      - |         -1.62             -1.64
##             G4.info.sequence variant.info.ranges.start variant.info.ranges.end
##                  <character>                 <integer>               <integer>
##   [1] GGGTGGAATGTGAGGTTTGG..                  57500178                57500178
##   [2] CCCTTTCCCCAAACAGCCCC..                  28507644                28507644
##   [3] CGCCGCTGCCCCACCGCCCC..                  47215063                47215063
##   [4] CACCCCGCGCTCACACCCCC..                  55153360                55153360
##       variant.info.ranges.width variant.info.ID variant.info.REF variant.info.ALT
##                       <integer>     <character>      <character>      <character>
##   [1]                         1      hsa4564360                G                A
##   [2]                         1      hsa4476984                C                G
##   [3]                         1      hsa4516572                C                G
##   [4]                         1      hsa4551412                C                A
##               mutated.G4.seq    mutated.G4.anno.seq mutated.max_score score.diff
##                  <character>            <character>         <numeric>  <numeric>
##   [1] GGGTGGAATGTGAGGTTTGG.. GGGTGGAATGTGAGGTTTGG..              1.12      -0.44
##   [2] CCCTTTCCCCAAACAGGCCC.. CCCTTTCCCCAAACAG[C>G..             -1.16      -0.40
##   [3] CGCCGCTGCCCCACCGGCCC.. CGCCGCTGCCCCACCG[C>G..             -1.16      -0.36
##   [4] CACACCGCGCTCACACCCCC.. CAC[C>A]CCGCGCTCACAC..             -1.20      -0.44
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Perfectly, we have now obtained the potential variants that can disrupt G4 formation.

Let’s visualize an example. Since this G4 is on the minus strand, we specify the keep_gstrand as TRUE to ensure that the displayed sequence corresponds to the G-rich strand.

plotImpactedG4(filtered_mutG4s[3], keep_gstrand = TRUE)

Part 2: Advanced Tutorial

Input Data

To run G4SNVHunter, you need to provide two inputs:

the genomic sequences (a DNAStringSet object)
the set of variants (a GRanges object)

While the input formats for these data are quite flexible, they must ultimately be converted into the appropriate formats: DNAStringSet for sequences and GRanges for variants.

The initial design of G4SNVHunter only focused on SNVs; however, starting from version 1.1.0, it also supports indels, MNVs, and other simple variants.

Detailed descriptions on how to prepare these inputs are provided below.

Genomic sequences

The genomic sequences refer to the chromosome sequences or fragments where the variants are located. These can be entire chromosomes or large segments extracted from them. G4SNVHunter will predict G4 structures from the provided sequences and assess whether these variants can affect their formation.

Please note that the variant coordinates must be relative to the genomic sequence you provide and should be in 1-based coordinates.

The genomic sequences must be formatted as a DNAStringSet object. For convenience, we provide the built-in loadSequence function to facilitate this process. Alternatively, Alternatively, you may load a custom sequence file and convert it into a DNAStringSet object without using the loadSequence function.

The loadSequence function accepts three input types:

A two-column data.frame, with the first column containing the sequence identifiers and the second column containing their corresponding sequences. This should be specified using the genome_seq parameter.
The path to a stored FASTA file. The FASTA file must have a .fa, .fna, or .fasta extension. This should be specified using the seq_path parameter.
A text file (.txt) that stores the sequence identifiers and their corresponding sequences. The first column should contain the sequence identifiers, and the second column should contain the sequences. This should also be specified using the seq_path parameter.
Note: Please do not include column names in the file!

Here are some examples for you to load sequences into a DNAStringSet object using loadSequence function:

Load from a data.frame object

seq_df <- data.frame(chr = c("seq1", "seq2"),
                     sequence = c(paste0(rep("G", 100), collapse = ""), 
                                  paste0(rep("A", 100), collapse = "")))
seq <- loadSequence(genome_seq = seq_df)

Load from a fasta file

# File path to the sequences in fasta format
fa_path <- system.file("extdata", "seq.fa", package = "G4SNVHunter")
seq <- loadSequence(seq_path = fa_path)

Load from a txt file

# File path to the sequences in txt format
txt_path <- system.file("extdata", "seq.txt", package = "G4SNVHunter")
seq <- loadSequence(seq_path = txt_path)

We can also retrieve genome sequences from Bioconductor Annotation Packages. While this is convenient, it requires you to install some related packages in advance. For example:

# Load sequence for chromosome 21 (hg19)
hg19 <- BSgenome.Hsapiens.UCSC.hg19
chr21_seq <- DNAStringSet(hg19$chr21)
# Chromosome names are needed for analysis
names(chr21_seq) <- "chr21"

Variant data

The variant data needs to be processed into a GRanges object. The loadVariant function can be used to load the VCF or MAF file into a GRanges object.

# File path to the VCF file; file_type must be specified
vcf_file <- system.file("extdata", "example_variants_chr16.vcf", package = "G4SNVHunter")
# Load variants from the VCF file
var_vcf <- loadVariant(vcf_file, file_type = "vcf")
seqlevels(var_vcf) <- paste0("chr", seqlevels(var_vcf))
print(var_vcf)

## GRanges object with 5000 ranges and 3 metadata columns:
##              seqnames    ranges strand |          ID            REF                ALT
##                 <Rle> <IRanges>  <Rle> | <character> <DNAStringSet> <DNAStringSetList>
##   hsa4580922    chr16  60712602      * |  hsa4580922              T                  C
##   hsa4511691    chr16  35270160      * |  hsa4511691              C                G,A
##   hsa4420479    chr16  15115434      * |  hsa4420479              C                  T
##   hsa4402848    chr16  11905849      * |  hsa4402848              G                  A
##   hsa4567387    chr16  58072153      * |  hsa4567387              C                  T
##          ...      ...       ...    ... .         ...            ...                ...
##   hsa4445679    chr16  21313291      * |  hsa4445679              C                  A
##   hsa4462089    chr16  25367052      * |  hsa4462089              C                  T
##   hsa4679986    chr16  80173446      * |  hsa4679986              C                  A
##   hsa4695174    chr16  82172241      * |  hsa4695174              C                  T
##   hsa4523500    chr16  49187596      * |  hsa4523500              C                  T
##   -------
##   seqinfo: 1 sequence from an unspecified genome

Another example to load a MAF file.

# File path to the MAF file; file_type must be specified
maf_file <- system.file("extdata", "example_variants_chr16.maf", package = "G4SNVHunter")
# Load variants from the MAF file
var_maf <- loadVariant(maf_file, file_type = "maf")
# To simplify the dataset, we can select only the columns we needed for our analysis.
var_maf <- subset(var_maf, select = c("Reference_Allele", "Tumor_Seq_Allele2"))
seqlevels(var_maf) <- paste0("chr", seqlevels(var_maf))
print(var_maf)

## GRanges object with 13 ranges and 2 metadata columns:
##        seqnames    ranges strand | Reference_Allele Tumor_Seq_Allele2
##           <Rle> <IRanges>  <Rle> |      <character>       <character>
##    [1]    chr16   3299712      * |                A                 G
##    [2]    chr16  30004597      * |                C                 T
##    [3]    chr16  31093071      * |                G                 T
##    [4]    chr16  57918254      * |                G                 T
##    [5]    chr16    618163      * |                C                 G
##    ...      ...       ...    ... .              ...               ...
##    [9]    chr16  47732491      * |                G                 A
##   [10]    chr16  83994365      * |                C                 T
##   [11]    chr16   3107541      * |                C                 A
##   [12]    chr16  29818528      * |                C                 A
##   [13]    chr16   1308058      * |                A                 C
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

If you have custom variant data, you can easily convert it to a GRanges object yourself. For example,

# Path to your custom variant file
snp_path <- system.file("extdata", "snp.txt", package = "G4SNVHunter")
# Load your variants into memory
snp <- read.table(snp_path, sep = "\t", header = FALSE)
# Convert them to GRanges
snp <- GRanges(seqnames = snp$V1,
               ranges = IRanges(start = snp$V2, width = 1),
               rsid = snp$V3,
               ref = snp$V4,
               alt = snp$V5)
print(snp)

## GRanges object with 15 ranges and 3 metadata columns:
##        seqnames    ranges strand |         rsid         ref         alt
##           <Rle> <IRanges>  <Rle> |  <character> <character> <character>
##    [1]    chr21  15397335      * | rs1244782663           G           A
##    [2]    chr21  39931215      * |  rs910271214           G           A
##    [3]    chr21  22915361      * | rs1210166884           A           G
##    [4]    chr21  42644592      * |  rs554035836           A           G
##    [5]    chr21  10743898      * |  rs955290928           A           G
##    ...      ...       ...    ... .          ...         ...         ...
##   [11]    chr21  30836340      * | rs1212424980           T           C
##   [12]    chr21  36426864      * | rs1478442431           G           T
##   [13]    chr21  27584238      * | rs1191559536           T           C
##   [14]    chr21  36106373      * |  rs561521189           C           T
##   [15]    chr21  21840835      * | rs1301205088           T           C
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Predict G4s

Leverage G4Hunter for G4 detection

You can directly predict G4s from a DNAStringSet object using the G4HunterDetect function. This function is based on the G4Hunter algorithm.

For example,

# Sequence file in fasta file format 
fa_path <- system.file("extdata", "seq.fa", package = "G4SNVHunter")
# Load sequences
seq <- loadSequence(seq_path = fa_path)

# Predict G4s
G4_detected <- G4HunterDetect(seq)

## Start processing...

## Processing seq No.4/5 [==================>-----]  80% in  0sProcessing seq No.5/5 [========================] 100% in  0s
## Merging predicted G4s...
## Done!

Then, we can examine the prediction results, which are stored in a GRanges object.

You can use functions like print to directly view the results. However, in this instance, we’ll leverage the datatable function to view the predicted G4s, as this method provides a more user-friendly display interface.

Let’s take a look at the G4s,

datatable(as.data.frame(G4_detected), options = list(scrollX = TRUE))

This function will return a GRanges object containing all potential G4s predicted by the G4Hunter algorithms. Fow each row, it mainly includes:

The sequence identifier where the G4 is located (seqnames),
The position of the G4 relative to the sequence you provided (ranges, 1-based coordinate),
The strand on which the G4 is located (strand), with + indicating the G4 is on the positive strand, and - indicating it is on the negative strand,
The G4Hunter score for the entire G4 sequence (score), this G4 may represent a merged sequence of multiple overlapping G4s.
The maximum G4Hunter score of the windows covering that G4 during the execution of the G4Hunter algorithm (max_score, used to determine if the G4 surpasses the threshold),
The predicted G4 sequence on the positive strand (sequence, with a C-rich sequence means G4 on the reverse strand).

Please note that score represents the overall score of the entire G4 sequence, which may include contributions from multiple overlapping G4s. In contrast, max_score indicates the highest score within this region, calculated by sliding a window across the sequence. Starting with version 1.1.0, we will use the change in max_score to evaluate alterations in G4 formation.

See the illustration below.

Additionally, the prediction parameter settings are stored in the global metadata of the GRanges object, which can be accessed using:

# Predicting parameters
print(metadata(G4_detected))

## $threshold
## [1] 1.5
## 
## $window_size
## [1] 25
## 
## $include_sequences
## [1] TRUE
## 
## $strands
## [1] "b"

Users can customize several parameters for prediction. For example,

# Predict G4s by customizing parameters
G4_detected <- G4HunterDetect(seq, threshold = 1.5, window_size = 20)

## Start processing...

## Merging predicted G4s...

## Done!

threshold: G4Hunter will search for G4s in windows above this threshold (absolute value). Default is 1.5. Unless there are special needs, we do not recommend setting the threshold below 1.2.
window_size: The window size (bp) for G4Hunter prediction. Default is 25. Another commonly used window size is 20. However, 25 is generally preferred.
include_sequences: Whether to include the predicted G4 sequences in the output. Default is TRUE.
strands: Indicate which strand requires prediction, with b for both strands and p for positive strand only. Please note that if your genome is single-stranded, this parameter should be set to p to prevent the program from predicting G4s on a non-existent negative strand.

We generally do not recommend modifying certain parameters, such as window_size and threshold, as their default settings are already optimal.

Export predicted G4s

We provide the exportG4 function for you to easily export the predicted G4s. You don’t need to specify the file type, as our program will determine it automatically; however, only .txt, .csv, and .xlsx formats are supported.

For example, you can export the predicted G4s to a CSV file,

out_csv  <- file.path(tempdir(), "results.csv")
# export as csv format
exportG4(G4_detected, out_csv, revcomp_antisense = FALSE)

or export to other formats as needed,

out_txt  <- file.path(tempdir(), "results.txt")
out_xlsx <- file.path(tempdir(), "results.xlsx")

# export as .txt format
exportG4(G4_detected, out_txt, include_metadata = FALSE)
# export as .xlsx format
exportG4(G4_detected, out_xlsx)

Visualize predicted G4s

You can use plotG4Info function to visualize the statistical information of G4s predicted by the G4HunterDetect function.

plotG4Info(G4_detected)

The left panels (A, C, E) display the overall distributions of the absolute values of max_score (absolute value), score (absolute value), and G4 length, respectively. The right panels (B, D, F) show the distributions of max_score, score, and length separately for G4s on the positive and negative strands. G4s on the positive strand have positive scores, while those on the negative strand have negative scores, resulting in a bimodal distribution for both max_score and score. Please note that the sign of the score simply denotes the G4’s location on either the positive or negative strand; the absolute score value indicates its actual G4-forming potential.

Evaluate variant effects

We provide two modes in G4VarImpact function for users to assess the potential impact of variants on G4 formation:

Single-Variant Mode (s mode) - assess the impact of each variant individually.
Multi-Variant Mode (m mode) - for each sample, if multiple variants are located within the same G4 region, this mode will assess their combined effect on that G4.

In short, the single-variant mode assesses the impact of each variant on the G4 structure individually, whereas the multi-variant mode evaluates the combined effects of multiple variants on a given G4 in a sample-specific manner.

See the illustration below.

We have prepared the example variant data for human chromosome 21 (hg19). You can easily load them,

data(snv_gr)

You can quickly and conveniently predict the G4 sequences on chromosome 21 using the G4HunterDetect function.

# Predict the G4s in human chr 21 (hg19)
chr21_G4 <- G4HunterDetect(chr21_seq)

## Start processing...

## Now process: chr21.

## Merging predicted G4s...

## Done!

Single-Variant mode

The default mode of G4VarImpact is single-variant mode (mode = 's'), which requires only the G4 data in GRanges format (as returned by the G4HunterDetect function), the variant data in GRanges format, and the column names for the reference and alternate bases.

Then, the assessment can be easily done by,

snv_eff_s <- G4VarImpact(chr21_G4, 
                         snv_gr,
                         ref_col = "ref",
                         alt_col = "alt")

Let’s view the first three entries,

datatable(as.data.frame(snv_eff_s[1:3]), options = list(scrollX = TRUE))

This function will return a GRanges object with detailed information, including: the original G4s (G4.info.*), variants (variant.info.*), the mutated sequence (mutated.G4.seq), the annotated mutation sequence (mutated.G4.anno.seq), the new G4Hunter max_score (mutated.max_score), and the score difference (score.diff).

Please be aware that G4s with no overlap to any of the provided variants will not be included in the output.

Multi-Variant mode

You can set the mode parameter in G4VarImpact function to 'm' to enable multi-variant mode. This mode is particularly useful for specific analyses, such as examining variants derived from cancer patients.

When using this mode, you must specify the column name for sample IDs (sampleid_col).

# Column names of the Sample ID and SNV ID must be specified
snv_eff_m <- G4VarImpact(chr21_G4, 
                         snv_gr, 
                         ref_col = "ref",
                         alt_col = "alt", 
                         mode = "m", 
                         sampleid_col = "sampleid")

Let’s view the first three entries,

datatable(as.data.frame(snv_eff_m[1:3]), options = list(scrollX = TRUE))

Under 'm' mode, the G4VarImpact function will return a GRanges object containing: the original G4s (G4.info.*), the (combined) variant information (variant.info.*), the mutated sequence with all variants incorporated (mutated.G4.seq), the annotated mutation sequence (mutated.G4.anno.seq), the new G4Hunter max_score (mutated.max_score), and the score difference (score.diff).

Please note that in multi-variant mode, the combined impact of multiple variants within the same G4 region will be evaluated for each sample.

To illustrate, the following G4 is overlapped by two variants (id_3608 and id_49857) in sample GQ3Y, and consequently, the resulting mutant maximum G4Hunter score is calculated based on their combined effect.

datatable(as.data.frame(snv_eff_m[528]), options = list(scrollX = TRUE))

Alternate alleles

The variant data may contain alternate alleles, In 's' mode, G4SNVHunter will separate these alleles and evaluate each alternate form individually.

For example, let’s create the variants,

variants_aa <- GRanges(
  seqnames = "CHR",
  ranges = IRanges(start = c(2, 5, 13, 45), end = c(2, 5, 14, 47)),
  strand = "*",
  rsid = c("ID1", "ID2", "ID3", "ID4"),
  sid = "samplex",
  ref = c("G", "T", "GG", "GCT"),
  alt = c("C", "C,G", "G", "G")
)

print(variants_aa)

## GRanges object with 4 ranges and 4 metadata columns:
##       seqnames    ranges strand |        rsid         sid         ref         alt
##          <Rle> <IRanges>  <Rle> | <character> <character> <character> <character>
##   [1]      CHR         2      * |         ID1     samplex           G           C
##   [2]      CHR         5      * |         ID2     samplex           T         C,G
##   [3]      CHR     13-14      * |         ID3     samplex          GG           G
##   [4]      CHR     45-47      * |         ID4     samplex         GCT           G
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

also create the G4,

test_seq <- "GGGATGGGATGTGGTAGGGATGCGGGTGACATCAGCTAGCATCAGCTACGA"
test_seq <- DNAStringSet(test_seq)
names(test_seq) <- "CHR"
test_G4 <- G4HunterDetect(test_seq)

## Start processing...

## Now process: CHR.

## Merging predicted G4s...

## Done!

print(test_G4)

## GRanges object with 1 range and 3 metadata columns:
##       seqnames    ranges strand |     score max_score               sequence
##          <Rle> <IRanges>  <Rle> | <numeric> <numeric>            <character>
##   [1]      CHR      1-26      + |      1.58      1.52 GGGATGGGATGTGGTAGGGA..
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Let’s see what happens in 's' mode,

G4_var_impact <- G4VarImpact(test_G4, variants_aa, 
                             ref_col = "ref", alt_col = "alt")

## Start processing...

## Processing finished!

We have three variants overlapping the predicted G4, but obtained four evaluations because variant #ID2 is an alternate allele with two reference-to-alternate substitutions.

datatable(as.data.frame(G4_var_impact), options = list(scrollX = TRUE))

However, in 'm' mode, the situation would become more complex and different. In general, G4SNVHunter will consider all possible combinations within a particular sample.

G4_var_impact_m <- G4VarImpact(test_G4, variants_aa, 
                               ref_col = "ref", alt_col = "alt",
                               mode = "m", sampleid_col = "sid")

## Start processing...

## Processing finished!

We will obtain two evaluations, which are coming from the combinations of mutant alleles:

[G>C]-[T>C](alt1)-[GG>G]
[G>C]-[T>G](alt2)-[GG>G]

Obviously, the combination of [G>C]-[T>C](alt1)-[GG>G] shows higher effect on G4 formation than [G>C]-[T>G](alt2)-[GG>G]

datatable(as.data.frame(G4_var_impact_m), options = list(scrollX = TRUE))

Filtering High-Impact G4s

Given that some variants may have minimal effects on G4 formation, we need to filter out those variants that are worth investigating further for experimental validation or additional analysis.

We provide a convenient function, filterVarImpact, to help with this filtering process.

There are three threshold parameters for users to adjust: raw_score_threshold, mut_score_threshold, and score_diff_threshold.

If raw_score_threshold (a positive number, but no more than 4) is specified, filterVarImpact will filter out entries where the absolute value of the original maximum G4Hunter score is below this threshold.

If mut_score_threshold (a positive number, but no more than 4) is specified, filterVarImpact will retain entries where the maximum G4Hunter score of the mutated G4 sequences does not exceed this threshold.

If score_diff_threshold (a negative number, but no less than -4) is specified, filterVarImpact will retain entries where the decrease in the maximum G4Hunter score after mutation exceeds this threshold. For example, if score_diff_threshold = -0.2, those entries will be retained: \(\left| \text{G4HunteMaxScore}_{\text{mut_seq}} \right| - \left| \text{G4HunterMaxScore}_{\text{raw_seq}} \right| \leq -0.2\)

We recommend setting raw_score_threshold above 1.5 and mut_score_threshold below 1.2 (at least). This is an empirically based guideline, and you are certainly free to adjust them to be more stringent or to use these parameters for flexible customization as needed.

Please note that you must specify at least one threshold parameter, but specifying all of them is not required.

For example, using the results returned in mode m

filtered_snv_eff <- filterVarImpact(snv_eff_m,
                                    mut_score_threshold = 1.2,
                                    score_diff_threshold = -0.35)

We can examine the filtered entries to identify variants that have substantial impact on G4 formation.

datatable(as.data.frame(filtered_snv_eff), options = list(scrollX = TRUE))

Export Mutant G4s

The mutant G4s can be exported using the exportMutG4 function. Similar to the exportG4 function, the file extension can be
.txt, .csv, or .xlsx. Please note that explicit specification of the file extension is not required, as the program will automatically determine it.

For example,

out_csv  <- file.path(tempdir(), "mut_G4.csv")
# export as csv format
exportMutG4(filtered_snv_eff, out_csv)

Visualizing G4Hunter Score Changes

Density plot

You can use the plotVarImpact function to visualize changes in G4Hunter scores. It supports both single-variant and multi-variant mode outputs, including their filtered results.

We can observe a general trend of decreased formation potential for the mutated G4s.

plotVarImpact(snv_eff_m)

Now, let’s take a look at how the G4Hunter scores have changed in the filtered_snv_eff object.

plotVarImpact(filtered_snv_eff)

Seqlogo plot

Finally, you can use plotImpactedG4 to visualize the G4 sequences affected by the variants.

Please note that plotImpactedG4 supports only one entry at a time. To visualize multiple entries, consider using a custom loop (e.g., for or lapply).

plotImpactedG4(filtered_snv_eff[1])

Acknowledgements

The author would like to thank Dr. Wenyong Zhu and Dr. Xiao Sun from Southeast University for their contributions: Dr. Zhu for providing the illustrations and testing the package, and Dr. Sun for reviewing this Vignettes.

Session Info

sessionInfo()

## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.5
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] tools     stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] dplyr_1.1.4                       DT_0.33                          
##  [3] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.74.0                  
##  [5] rtracklayer_1.66.0                BiocIO_1.16.0                    
##  [7] Biostrings_2.74.1                 XVector_0.46.0                   
##  [9] GenomicRanges_1.58.0              GenomeInfoDb_1.42.3              
## [11] IRanges_2.40.1                    S4Vectors_0.44.0                 
## [13] BiocGenerics_0.52.0               G4SNVHunter_1.1.0                
## [15] testthat_3.2.3                   
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          rstudioapi_0.17.1           jsonlite_2.0.0             
##   [4] magrittr_2.0.3              GenomicFeatures_1.58.0      farver_2.1.2               
##   [7] rmarkdown_2.29              fs_1.6.6                    zlibbioc_1.52.0            
##  [10] vctrs_0.6.5                 memoise_2.0.1               Rsamtools_2.22.0           
##  [13] RCurl_1.98-1.17             ggpointdensity_0.1.0        RcppRoll_0.3.1             
##  [16] htmltools_0.5.8.1           S4Arrays_1.6.0              usethis_3.1.0              
##  [19] progress_1.2.3              curl_6.2.2                  SparseArray_1.6.2          
##  [22] sass_0.4.10                 bslib_0.9.0                 htmlwidgets_1.6.4          
##  [25] desc_1.4.3                  cachem_1.1.0                GenomicAlignments_1.42.0   
##  [28] mime_0.13                   lifecycle_1.0.4             pkgconfig_2.0.3            
##  [31] Matrix_1.7-3                R6_2.6.1                    fastmap_1.2.0              
##  [34] GenomeInfoDbData_1.2.13     MatrixGenerics_1.18.1       shiny_1.10.0               
##  [37] digest_0.6.37               AnnotationDbi_1.68.0        rprojroot_2.0.4            
##  [40] pkgload_1.4.0               crosstalk_1.2.1             RSQLite_2.3.11             
##  [43] labeling_0.4.3              httr_1.4.7                  abind_1.4-8                
##  [46] compiler_4.4.2              remotes_2.5.0               bit64_4.6.0-1              
##  [49] withr_3.0.2                 BiocParallel_1.40.2         viridis_0.6.5              
##  [52] DBI_1.2.3                   pkgbuild_1.4.7              MASS_7.3-65                
##  [55] DelayedArray_0.32.0         sessioninfo_1.2.3           rjson_0.2.23               
##  [58] ggdensity_1.0.0             zip_2.3.2                   httpuv_1.6.16              
##  [61] ggseqlogo_0.2               glue_1.8.0                  restfulr_0.0.15            
##  [64] promises_1.3.2              grid_4.4.2                  generics_0.1.4             
##  [67] isoband_0.2.7               gtable_0.3.6                tidyr_1.3.1                
##  [70] data.table_1.17.0           hms_1.1.3                   pillar_1.10.2              
##  [73] later_1.4.2                 lattice_0.22-7              bit_4.6.0                  
##  [76] tidyselect_1.2.1            miniUI_0.1.2                knitr_1.50                 
##  [79] gridExtra_2.3               SummarizedExperiment_1.36.0 xfun_0.52                  
##  [82] Biobase_2.66.0              devtools_2.4.5              brio_1.1.5                 
##  [85] matrixStats_1.5.0           stringi_1.8.7               UCSC.utils_1.2.0           
##  [88] yaml_2.3.10                 evaluate_1.0.3              codetools_0.2-20           
##  [91] tibble_3.2.1                cli_3.6.5                   xtable_1.8-4               
##  [94] jquerylib_0.1.4             dichromat_2.0-0.1           Rcpp_1.0.14                
##  [97] png_0.1-8                   XML_3.99-0.18               parallel_4.4.2             
## [100] ellipsis_0.3.2              ggplot2_3.5.2               blob_1.2.4                 
## [103] prettyunits_1.2.0           profvis_0.4.0               urlchecker_1.0.1           
## [106] bitops_1.0-9                viridisLite_0.4.2           VariantAnnotation_1.52.0   
## [109] scales_1.4.0                openxlsx_4.2.8              purrr_1.0.4                
## [112] crayon_1.5.3                rlang_1.1.6                 cowplot_1.1.3              
## [115] KEGGREST_1.46.0

References

Bedrat, Amina, Laurent Lacroix, and Jean-Louis Mergny. 2016. “Re-Evaluation of g-Quadruplex Propensity with G4Hunter.” Nucleic Acids Research 44 (4): 1746–59.

Huppert, Julian L, and Shankar Balasubramanian. 2007. “G-Quadruplexes in Promoters Throughout the Human Genome.” Nucleic Acids Research 35 (2): 406–13.

Phan, Anh Tuân, and Jean-Louis Mergny. 2002. “Human Telomeric DNA: G-Quadruplex, i-Motif and Watson–Crick Double Helix.” Nucleic Acids Research 30 (21): 4618–25.

Varshney, Dhaval, Jochen Spiegel, Katherine Zyner, David Tannahill, and Shankar Balasubramanian. 2020. “The Regulation and Functions of DNA and RNA g-Quadruplexes.” Nature Reviews Molecular Cell Biology 21 (8): 459–74.

Zhang, Rongxin, Yuqi Wang, Cheng Wang, Xiao Sun, and Jean-Louis Mergny. 2024. “G-Quadruplexes as Pivotal Components of Cis-Regulatory Elements in the Human Genome.” bioRxiv, 2024–01.

Introduction to G4SNVHunter

Rongxin Zhang

2025-05-25