Structural Variation and Cancer
The genome of two humans is never identical. Each genome has a different set of sequences that makes it unique. These differences can be single or multiple nucleotide variations. When a variation is big enough (typically larger than 1 kb) it is termed a structural variation, allegedly because given its size it can affect the chromatin structure. These structural variants (SV's) do not necessarily have a pathological impact on the host cell. In fact it has been observed that 24 to 5 mega bases of the genome are affected by SV's in healthy humans (Redon et al. 2006). Even significant structural variation among tissues collected form the same human individuals have been reported, as well as structural variation between monozygotic twins, suggesting the possibility of somatic mosaicism during early embryogenesis (De and Babu 2010). However, sometimes SV's do have a pathological impact. In some cancers for example, SV's are acquired through the course of a lifetime through the iterative processes of DNA damage and repair, as well as replication errors. In time, these aberrations can contribute to the neoplastic development of a tumor.
Both somatic and gremlin SV's can be very complex and diverse, affecting relatively small portions of the genome or entire chromosomes arms. Recently as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, Li and collaborators classified SV's in two groups: simple and complex (Y. Li et al. 2020). In this study simple SV's were subclassified into deletions, tandem duplications, inversions or translocations. Complex SV's are produced when multiple simple SV's ocurre at the same loci, or alternatively through chromoplexy or chromothripsis (Figure 1). These aberrations have the power to change the organization of the genome and consequently the genetic program of a cell.
To understand how SV's might disrupt homeostasis we have to appreciate that "the genome is divided into large discrete domains that are highly dynamic, cell type and differentiation state specific and associated with specific transcriptional activity" (De and Babu 2010). These domains are genomic regions that have specific characteristics like epigenetic markers and physical interactions with other parts of the genome or the nuclear lamina. In other words the specific location of any given gene has a direct influence on how much it gets transcribed. Gierman and collaborators provided supporting evidence of this phenomenon in a human model, integrating identical green fluorescent protein reporter constructs at 90 different chromosomal positions and finding that the integrated reporters genes typically reached expression levels similar to that of its neighboring genes (Gierman et al. 2007). If upon inserting a gene inside a genomic region we observe that it adopts the transcriptional activity of the neighboring genes, we could expect that by changing the characteristics of a genomic region, the genes encompassed in it would change their transcriptional activity. Such change in the characteristics of a genomic region could be caused by a SV.
There are multiple regulatory mechanisms that a SV could throw off balance. A SV might push together or apart a gene and its enhancer or promoter, causing a change in expression. It has been observed that SV's alter the genomic architecture, disrupting topologically associated domains (TAD's) and creating new ones (Dixon et al. 2018). Another mechanism through which a SV might alter transcription is by depleting or augmenting gene dosage. Weather it is a regulatory element like an enhancer, a promoter, an epigenetic marker, a transcriptional factor, a degradation complex, cohesin, CTCF, a tumor suppressor gene or an oncogene; the degree to which a genetic sequences is present in the genome can have an impact on expression levels (Bradner, Hnisz, and Young 2017). However, the impact of gene dosage on differential expression seems to be limited, at least in known cancer driver genes. In other words not all genes that are copy number mutated change their expression accordingly (Roszik et al. 2016). Likewise, Ghavi-Helm et al. (2019) found in a Drosophila model, that although SV do affect chromatin structure, gene expression (measured as differential expression) remains largely unaltered, most likely because of other regulatory mechanisms at play. It is also possible that by re positioning a loci in a different chromatin context, expression can be altered (Danieli and Papantonis 2020). Most likely a single SV event can affect several of these regulatory mechanisms at once, making it hard to study how a specific SV is altering cell homeostasis.
The degree to which the genomic environment is reshaped by these SV's varies according to their recurrence and magnitude. In Y. Li et al. (2020) 91% of 2,658 samples (representing 38 tumor types) were found to have at least one high confidence SV, but the amount of SV's found in each sample varied wildly even within the same tumor type (Figure 2).
We can observe in Figure 2, how some types of cancer like central nervous system pilocytic astrocytoma, have very few SV's if any. On the other hand in breast adenocarcinoma at least half of the samples exhibit hundreds of structural mutations, with some samples having almost 1,000.
Structural Variation and Copy Number Variation
Copy number variation is a different way of describing SV's in which there is an increase or decrease in the amount of DNA in the genome. Tandem duplications, deletions, and unbalanced translocations (translocations in which a portion of the transferred sequence is lost, or alternatively a sequence is gained and attached to the translocation) are examples of SV's which produce a change in copy number. Inversions and balanced translocations do not cause a change in copy number but can have an impact on regulation. Studying copy number variants (CNV's) has been a popular approach in describing SV's because it can be done using cheaper technologies than genome sequencing. The most popular methodology for doing so has been microarrays. The down side to microarrays is that you necessarily rely on genome mapping when cancer genomes will likely have a structure that is different from that of a reference genome. As we explained in the previous paragraphs the structure of the genome and the location of any given gene is key in its regulation. Therefore this approach leaves out valuable information about the genome architecture and the possible impact of SV's in the regulatory program. Furthermore, there are several other inconvenient aspects about describing SV's with CNV's. To illustrate this point we present a small overview of two popular CNV calling algorithms; ASCAT and GISTIC.
ASCAT
CNV's calling algorithms mainly rely on two parameters to call a CNV: Log "R" ratio (Log RR) and biallelic fraction (BAF). Log RR is equal to: \[\log(\frac{R_{observed}}{R_{expected}})\]
Where \(R_{observed}\) is the sum of the normalized intensity of both allele probe sets for a specific loci: \(R_{observed} = I_{allele A} + I_{allele B}\), and \(R_{expected}\) is the observed intensity of the same loci but in a reference sample, ideally of adjacent healthy tissue. There seems to be a problem when comparing normalized intensities of two different microarray experiments and certain normalization methods to go around it, but discussing this goes beyond the scope of this report. This way Log RR tells us when a higher intensity of a specific loci is detected relative to that of the same loci in a healthy sample. However, ASCAT does not measure \(R_{expected}\) directly from the healthy sample but instead it estimates it through BAF (Van Loo et al. 2010; Peiffer et al. 2006).
BAF estimates the relative a abundance of one allele compared to the other allele, calculated as the intensity of the probe set of one allele over the other; \(\frac{I_{allele A}}{I_{allele B}}\) (B. Liu et al. 2013). One important shortcoming of using BAF to estimate copy number change is that it can only detect allele imbalances in heterozygous loci which eliminates from the start a lot of valuable data. This way BAF can inform us when an allele has been depleted or amplified (B. Liu et al. 2013). The ASCAT algorithm uses theta, a function of BAF to estimate \(R_{expected}\) and is equal to: \[\theta = \frac{2}{\pi} arctan(\frac{I_{allele A}}{I_{allele B}})\]
ASCAT pools together the normal samples and estimates the average relationship between \(R_{expected}\) and BAF for three points; \(\theta = 1, 0, 0.6\) which represent the three expected BAF's \(\frac{I_{allele A}}{I_{allele B}} = 1, 0, 0.5\), which are the expected BAF values for homozygous or heterozygous loci (Peiffer et al. 2006). Finally the \(R_{expected}\) is calculated as the intersection of the observed \(\theta\) with the line formed by the three average points (Figure 3).
This approach of pooling together the healthy samples to calculate (Log RR, \(\theta\)) centroids, and then interpolate the \(R_{expected}\) using the observed \(\theta\) might be an especially appealing method when you don't have match healthy samples for every tumor. However, by comparing an \(R_{observed}\) to a \(R_{expected}\) of pool of samples, what Log RR is actually telling us is the relative abundance of a loci in a sample compared to the average abundance of the same loci over a set of healthy samples. At a first glance this strategy might not seem inconvenient, but this method could be problematic if we consider that a significant portion of every healthy genome varies significantly SV wise. Redon and collaborators found in a cohort of 270 healthy individuals that several mega bases of their genomes have CNV's and that only half of them were identified in more than one individual (Redon et al. 2006). What this could mean in the context of this methodology is that half of the gremlin CNV's could be mistakenly detected as somatic copy number alterations (SCNA's), yielding a great amount of false positives, that is SV's which are present in the genome of interest but do not contribute to the cancer development.
Once there is a Log RR value for every heterozygous loci, ASCAT estimates genome wide allele specific profiles for each sample. In the process of doing so, to overcome the noise in the data, ASCAT averages the Log RR value of adjacent loci with similar values. By default each average segment must comprise at least 6 loci. This smoothing method might miss out on CNV's which are smaller than 6 loci especially when loci are far away from each other. The bigger the distance between loci the more likely it is that the region they will be subject to a SV mutation. Once the genome wide allele specific profile is calculated ASCAT proceeds to estimate tumor purity and ploidy (Van Loo et al. 2010).
GISTIC
GISTIC stands for Genomic Identification of Significant Targets in Cancer (Mermel et al. 2011). It is an algorithm that takes in pre-processed microarray data and outputs a set of genes which are possibly promoting oncogenesis, based on the frequency and amplitude of SCNA's across a set of samples. It is important to note that the outputs and objectives of ASCAT and GISTIC are very different. ASCAT estimates the purity, ploidy and genome wide allele specific copy number profile for each sample while GISTIC infers which genes are most likely being targeted by SCNA's given the normalized Log RR profiles of a set of samples.
The pre-processing procedure basically normalizes the microarray intensity data, transforming it into a copy number estimates using Log RR, and then removes the germline CNV's using a method termed Tangent Normalization which makes use of a pool of healthy samples (Tabak et al. 2019). The same critique about pooling together healthy samples and smoothing the signal could be made for GISTIC. Pooling together healthy samples and using them as a reference of the normal abundance of a specific sequence can yield many false positives given the high heterogeneity of gremlin SV's in humans, and smoothing the Log RR profile makes it difficult to detect small but very possible CNV's. Olshen and collaborators recognize this shortcoming in its paper presenting the Circular Binary Segmentation (GISTIC' smoothing algorithm): "The segmentation procedures have low power to detect a change when the difference in means is small or if the width of the changed segment is small" (Olshen et al. 2004, 565).
Furthermore, unlike ASCAT, GISTIC does not take purity and ploidy into consideration. When a tumor sample has a significant amount of infiltrated healthy cells the SCNA's measurements can become diluted by the healthy genomes that are processed along with the cancerous ones. In other words, there could be significant SCNA's in the tumor cells but when processed and mixed with the rest of the cells in the sample, said SCNA's becomes more difficult to detect. Likewise, ploidy provides useful information about the mutational history of cell populations, and while it is possible to deduce whole chromosome duplications or deletions from microarray data, GISTIC does not estimate the ploidy of samples.
Copy Number Variation as an Estimate of Structural Variation
Despite the difficulties to assess structural variation thoroughly, algorithms like ASCAT and GISTIC can still provide us with useful information. In some studies deletions and tandem duplications were found to be the most common type of SV's (Y. Li et al. 2020). In other studies however translocations were found to be the most recurrent kind of structural variation (Figure 4) (Yang et al. 2013).
Perhaps the reason why translocations are not so wildly represented in Y. Li et al. (2020) is because they omit transposable elements (TE). By doing so Li and collaborators are potentially missing out on important SV activity with potential carcinogenic effects. After all it has been proposed that TE activity may be associated with differentiation in cancer (Lynch-Sutherland et al. 2020). If upon taking TE into account tandem duplications and deletions are still the most common type of SV for a specific cancer type, algorithms like ASCAST and GISTIC can provide us with useful information. Namely, the degree to which a certain sample is copy number mutated can inform us of it's degree of structural variation in general, making it a powerful tool to detect genomic instability.
Mutual Information Networks and Copy Number Change
One way to capture regulatory relationships between genes is to construct a mutual information (MI) network. Gene expression profiles are constructed from the expression values of a given gene in all the samples. Then the expression profiles of every possible pair of genes are compared to calculate an MI value. The MI value tells us how much one profile lets us predict the profile of the other gene. If two expression profiles are highly correlated or anti correlated the MI value will tend to 1, whereas if the two profiles are poorly correlated that pair will get a MI value closer to 0 (Margolin et al. 2006). \[ MI(x_{i}, y_{i}) = \frac{1}{M} log(\frac{f(x_{i}, y_{i})}{f(x_{i})f(y_{i})}) \] \[ i = 1, ..., M \]
Using this expression we can obtain an MI value for every possible pair of genes, and represent this value as an interaction (an edge) in a network. Finally, to construct a network you filter out interactions below a certain threshold, leaving the most significant interactions, i.e. genes that strongly influence each other's expression. This type of analysis can be specially insightful when you have a set of healthy reference samples to build a "control" network and compare it to the network derived from the case samples. Espinal-Enriquez and collaborators used this approach to study TCGA breast cancer samples, reporting a drastic change in topology when comparing the breast cancer and normal tissue networks (Espinal-Enríquez et al. 2017).
One interesting feature of this cancer network is that genes (or nodes) in each of the communities belong almost exclusively to a single chromosome. Further work by the group has described this loss in long range interactions, emphasizing not only that inter-chromosomal interactions are very scarce but that intra-chromosomal interactions ocurre mainly between neighboring genes (Figure 6) (García-Cortés et al. 2018).
This work builds upon Garcia-Cortes's work, which built MI networks for each breast cancer molecular subtype (Basal, Luminal A, Luminal B and her2). Using these networks and copy number data the present work is concerned with one question:
- Are genes which are significantly copy number mutated enriched in our cancer networks? If so, this could mean that many genes become important regulatory elements through the process of being affected by structural mutations.
Moreover some exploratory work was done concerning a second question:
- Is there a relationship between the amount of copy number variation and the number of communities that we observe in the cancer networks?
This question is not answered in the current analysis, we merely present a set of analyses that we think could shed some light on the answer. For it to be significant in the context of breast cancer additional research must be done to elucidate whether copy number profile is a good approximation to discern between samples with high and low amounts of SV's.