SCNAs Project Final Report

Abstract

Structural variants are a very pervasive type of mutations present across cancers. Plenty of research has been done exploring how these aberrations affect differential expression and genomic architecture. However little research has been done concerning the overall effect of this type of mutations in the transcriptional regulatory programs. Building on the work of García-Cortés et al. (2018) and Espinal-Enríquez et al. (2017), we ask whether genes that are significantly copy number mutated are enriched in subtype specific breast cancer mutual information transcriptional networks. We use the TCGA breast cancer data to build the networks with ARACNE and to determine which genes are significantly copy number mutated in each subtype using GISTIC2.0 (Mermel et al. 2011; Margolin et al. 2006). Then we perform two sided fisher tests enrichment analysis to determine whether these genes are over or under represented in our cancer transcriptional networks. Furthermore, we explore the question of whether these types of mutations could be shaping the observed network architecture. This question however is discussed but not addressed formally. The report starts by giving a quick overview of structural variation, popular methods and algorithms to study it, and an overview of mutual information transcriptional networks. Then it elaborates on the aforementioned analyses and ends with a reflexion and future perspectives.

Introduction

Structural Variation and Cancer

The genome of two humans is never identical. Each genome has a different set of sequences that makes it unique. These differences can be single or multiple nucleotide variations. When a variation is big enough (typically larger than 1 kb) it is termed a structural variation, allegedly because given its size it can affect the chromatin structure. These structural variants (SV's) do not necessarily have a pathological impact on the host cell. In fact it has been observed that 24 to 5 mega bases of the genome are affected by SV's in healthy humans (Redon et al. 2006). Even significant structural variation among tissues collected form the same human individuals have been reported, as well as structural variation between monozygotic twins, suggesting the possibility of somatic mosaicism during early embryogenesis (De and Babu 2010). However, sometimes SV's do have a pathological impact. In some cancers for example, SV's are acquired through the course of a lifetime through the iterative processes of DNA damage and repair, as well as replication errors. In time, these aberrations can contribute to the neoplastic development of a tumor.

Both somatic and gremlin SV's can be very complex and diverse, affecting relatively small portions of the genome or entire chromosomes arms. Recently as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, Li and collaborators classified SV's in two groups: simple and complex (Y. Li et al. 2020). In this study simple SV's were subclassified into deletions, tandem duplications, inversions or translocations. Complex SV's are produced when multiple simple SV's ocurre at the same loci, or alternatively through chromoplexy or chromothripsis (Figure 1). These aberrations have the power to change the organization of the genome and consequently the genetic program of a cell.

**Figure 1.** Examples of different types of structural variations described in @Li2020

Figure 1. Examples of different types of structural variations described in Y. Li et al. (2020)

To understand how SV's might disrupt homeostasis we have to appreciate that "the genome is divided into large discrete domains that are highly dynamic, cell type and differentiation state specific and associated with specific transcriptional activity" (De and Babu 2010). These domains are genomic regions that have specific characteristics like epigenetic markers and physical interactions with other parts of the genome or the nuclear lamina. In other words the specific location of any given gene has a direct influence on how much it gets transcribed. Gierman and collaborators provided supporting evidence of this phenomenon in a human model, integrating identical green fluorescent protein reporter constructs at 90 different chromosomal positions and finding that the integrated reporters genes typically reached expression levels similar to that of its neighboring genes (Gierman et al. 2007). If upon inserting a gene inside a genomic region we observe that it adopts the transcriptional activity of the neighboring genes, we could expect that by changing the characteristics of a genomic region, the genes encompassed in it would change their transcriptional activity. Such change in the characteristics of a genomic region could be caused by a SV.

There are multiple regulatory mechanisms that a SV could throw off balance. A SV might push together or apart a gene and its enhancer or promoter, causing a change in expression. It has been observed that SV's alter the genomic architecture, disrupting topologically associated domains (TAD's) and creating new ones (Dixon et al. 2018). Another mechanism through which a SV might alter transcription is by depleting or augmenting gene dosage. Weather it is a regulatory element like an enhancer, a promoter, an epigenetic marker, a transcriptional factor, a degradation complex, cohesin, CTCF, a tumor suppressor gene or an oncogene; the degree to which a genetic sequences is present in the genome can have an impact on expression levels (Bradner, Hnisz, and Young 2017). However, the impact of gene dosage on differential expression seems to be limited, at least in known cancer driver genes. In other words not all genes that are copy number mutated change their expression accordingly (Roszik et al. 2016). Likewise, Ghavi-Helm et al. (2019) found in a Drosophila model, that although SV do affect chromatin structure, gene expression (measured as differential expression) remains largely unaltered, most likely because of other regulatory mechanisms at play. It is also possible that by re positioning a loci in a different chromatin context, expression can be altered (Danieli and Papantonis 2020). Most likely a single SV event can affect several of these regulatory mechanisms at once, making it hard to study how a specific SV is altering cell homeostasis.

The degree to which the genomic environment is reshaped by these SV's varies according to their recurrence and magnitude. In Y. Li et al. (2020) 91% of 2,658 samples (representing 38 tumor types) were found to have at least one high confidence SV, but the amount of SV's found in each sample varied wildly even within the same tumor type (Figure 2).

**Figure 2.** Frequency of SV's for each tumor type in @Li2020

Figure 2. Frequency of SV's for each tumor type in Y. Li et al. (2020)

We can observe in Figure 2, how some types of cancer like central nervous system pilocytic astrocytoma, have very few SV's if any. On the other hand in breast adenocarcinoma at least half of the samples exhibit hundreds of structural mutations, with some samples having almost 1,000.

Structural Variation and Copy Number Variation

Copy number variation is a different way of describing SV's in which there is an increase or decrease in the amount of DNA in the genome. Tandem duplications, deletions, and unbalanced translocations (translocations in which a portion of the transferred sequence is lost, or alternatively a sequence is gained and attached to the translocation) are examples of SV's which produce a change in copy number. Inversions and balanced translocations do not cause a change in copy number but can have an impact on regulation. Studying copy number variants (CNV's) has been a popular approach in describing SV's because it can be done using cheaper technologies than genome sequencing. The most popular methodology for doing so has been microarrays. The down side to microarrays is that you necessarily rely on genome mapping when cancer genomes will likely have a structure that is different from that of a reference genome. As we explained in the previous paragraphs the structure of the genome and the location of any given gene is key in its regulation. Therefore this approach leaves out valuable information about the genome architecture and the possible impact of SV's in the regulatory program. Furthermore, there are several other inconvenient aspects about describing SV's with CNV's. To illustrate this point we present a small overview of two popular CNV calling algorithms; ASCAT and GISTIC.

ASCAT

CNV's calling algorithms mainly rely on two parameters to call a CNV: Log "R" ratio (Log RR) and biallelic fraction (BAF). Log RR is equal to: \[\log(\frac{R_{observed}}{R_{expected}})\]

Where $R_{observed}$ is the sum of the normalized intensity of both allele probe sets for a specific loci: $R_{observed} = I_{allele A} + I_{allele B}$, and $R_{expected}$ is the observed intensity of the same loci but in a reference sample, ideally of adjacent healthy tissue. There seems to be a problem when comparing normalized intensities of two different microarray experiments and certain normalization methods to go around it, but discussing this goes beyond the scope of this report. This way Log RR tells us when a higher intensity of a specific loci is detected relative to that of the same loci in a healthy sample. However, ASCAT does not measure $R_{expected}$ directly from the healthy sample but instead it estimates it through BAF (Van Loo et al. 2010; Peiffer et al. 2006).

BAF estimates the relative a abundance of one allele compared to the other allele, calculated as the intensity of the probe set of one allele over the other; $\frac{I_{allele A}}{I_{allele B}}$ (B. Liu et al. 2013). One important shortcoming of using BAF to estimate copy number change is that it can only detect allele imbalances in heterozygous loci which eliminates from the start a lot of valuable data. This way BAF can inform us when an allele has been depleted or amplified (B. Liu et al. 2013). The ASCAT algorithm uses theta, a function of BAF to estimate $R_{expected}$ and is equal to: \[\theta = \frac{2}{\pi} arctan(\frac{I_{allele A}}{I_{allele B}})\]

ASCAT pools together the normal samples and estimates the average relationship between $R_{expected}$ and BAF for three points; $\theta = 1, 0, 0.6$ which represent the three expected BAF's $\frac{I_{allele A}}{I_{allele B}} = 1, 0, 0.5$, which are the expected BAF values for homozygous or heterozygous loci (Peiffer et al. 2006). Finally the $R_{expected}$ is calculated as the intersection of the observed $\theta$ with the line formed by the three average points (Figure 3).

$**Figure 3.** Estimation of $R_{expected}$ form observed theta for a given loci [@peiffer_high-resolution_2006]$

Figure 3. Estimation of $R_{expected}$ form observed theta for a given loci (Peiffer et al. 2006)

This approach of pooling together the healthy samples to calculate (Log RR, $\theta$) centroids, and then interpolate the $R_{expected}$ using the observed $\theta$ might be an especially appealing method when you don't have match healthy samples for every tumor. However, by comparing an $R_{observed}$ to a $R_{expected}$ of pool of samples, what Log RR is actually telling us is the relative abundance of a loci in a sample compared to the average abundance of the same loci over a set of healthy samples. At a first glance this strategy might not seem inconvenient, but this method could be problematic if we consider that a significant portion of every healthy genome varies significantly SV wise. Redon and collaborators found in a cohort of 270 healthy individuals that several mega bases of their genomes have CNV's and that only half of them were identified in more than one individual (Redon et al. 2006). What this could mean in the context of this methodology is that half of the gremlin CNV's could be mistakenly detected as somatic copy number alterations (SCNA's), yielding a great amount of false positives, that is SV's which are present in the genome of interest but do not contribute to the cancer development.

Once there is a Log RR value for every heterozygous loci, ASCAT estimates genome wide allele specific profiles for each sample. In the process of doing so, to overcome the noise in the data, ASCAT averages the Log RR value of adjacent loci with similar values. By default each average segment must comprise at least 6 loci. This smoothing method might miss out on CNV's which are smaller than 6 loci especially when loci are far away from each other. The bigger the distance between loci the more likely it is that the region they will be subject to a SV mutation. Once the genome wide allele specific profile is calculated ASCAT proceeds to estimate tumor purity and ploidy (Van Loo et al. 2010).

GISTIC

GISTIC stands for Genomic Identification of Significant Targets in Cancer (Mermel et al. 2011). It is an algorithm that takes in pre-processed microarray data and outputs a set of genes which are possibly promoting oncogenesis, based on the frequency and amplitude of SCNA's across a set of samples. It is important to note that the outputs and objectives of ASCAT and GISTIC are very different. ASCAT estimates the purity, ploidy and genome wide allele specific copy number profile for each sample while GISTIC infers which genes are most likely being targeted by SCNA's given the normalized Log RR profiles of a set of samples.

The pre-processing procedure basically normalizes the microarray intensity data, transforming it into a copy number estimates using Log RR, and then removes the germline CNV's using a method termed Tangent Normalization which makes use of a pool of healthy samples (Tabak et al. 2019). The same critique about pooling together healthy samples and smoothing the signal could be made for GISTIC. Pooling together healthy samples and using them as a reference of the normal abundance of a specific sequence can yield many false positives given the high heterogeneity of gremlin SV's in humans, and smoothing the Log RR profile makes it difficult to detect small but very possible CNV's. Olshen and collaborators recognize this shortcoming in its paper presenting the Circular Binary Segmentation (GISTIC' smoothing algorithm): "The segmentation procedures have low power to detect a change when the difference in means is small or if the width of the changed segment is small" (Olshen et al. 2004, 565).

Furthermore, unlike ASCAT, GISTIC does not take purity and ploidy into consideration. When a tumor sample has a significant amount of infiltrated healthy cells the SCNA's measurements can become diluted by the healthy genomes that are processed along with the cancerous ones. In other words, there could be significant SCNA's in the tumor cells but when processed and mixed with the rest of the cells in the sample, said SCNA's becomes more difficult to detect. Likewise, ploidy provides useful information about the mutational history of cell populations, and while it is possible to deduce whole chromosome duplications or deletions from microarray data, GISTIC does not estimate the ploidy of samples.

Copy Number Variation as an Estimate of Structural Variation

Despite the difficulties to assess structural variation thoroughly, algorithms like ASCAT and GISTIC can still provide us with useful information. In some studies deletions and tandem duplications were found to be the most common type of SV's (Y. Li et al. 2020). In other studies however translocations were found to be the most recurrent kind of structural variation (Figure 4) (Yang et al. 2013).

**Figure 4.** SV's frequency in different cancer types [@Yang2013]

Figure 4. SV's frequency in different cancer types (Yang et al. 2013)

Perhaps the reason why translocations are not so wildly represented in Y. Li et al. (2020) is because they omit transposable elements (TE). By doing so Li and collaborators are potentially missing out on important SV activity with potential carcinogenic effects. After all it has been proposed that TE activity may be associated with differentiation in cancer (Lynch-Sutherland et al. 2020). If upon taking TE into account tandem duplications and deletions are still the most common type of SV for a specific cancer type, algorithms like ASCAST and GISTIC can provide us with useful information. Namely, the degree to which a certain sample is copy number mutated can inform us of it's degree of structural variation in general, making it a powerful tool to detect genomic instability.

Mutual Information Networks and Copy Number Change

One way to capture regulatory relationships between genes is to construct a mutual information (MI) network. Gene expression profiles are constructed from the expression values of a given gene in all the samples. Then the expression profiles of every possible pair of genes are compared to calculate an MI value. The MI value tells us how much one profile lets us predict the profile of the other gene. If two expression profiles are highly correlated or anti correlated the MI value will tend to 1, whereas if the two profiles are poorly correlated that pair will get a MI value closer to 0 (Margolin et al. 2006). \[ MI(x_{i}, y_{i}) = \frac{1}{M} log(\frac{f(x_{i}, y_{i})}{f(x_{i})f(y_{i})}) \] \[ i = 1, ..., M \]

Using this expression we can obtain an MI value for every possible pair of genes, and represent this value as an interaction (an edge) in a network. Finally, to construct a network you filter out interactions below a certain threshold, leaving the most significant interactions, i.e. genes that strongly influence each other's expression. This type of analysis can be specially insightful when you have a set of healthy reference samples to build a "control" network and compare it to the network derived from the case samples. Espinal-Enriquez and collaborators used this approach to study TCGA breast cancer samples, reporting a drastic change in topology when comparing the breast cancer and normal tissue networks (Espinal-Enríquez et al. 2017).

**Figure 5.** (A) MI network derived from healthy samples. (B) MI network derived from breast cancer samples [@Espinal-Enriquez2017]

Figure 5. (A) MI network derived from healthy samples. (B) MI network derived from breast cancer samples (Espinal-Enríquez et al. 2017)

One interesting feature of this cancer network is that genes (or nodes) in each of the communities belong almost exclusively to a single chromosome. Further work by the group has described this loss in long range interactions, emphasizing not only that inter-chromosomal interactions are very scarce but that intra-chromosomal interactions ocurre mainly between neighboring genes (Figure 6) (García-Cortés et al. 2018).

Figure 6. Circos plot representing the intra-chromosomal interactions (blue) and the inter-chromosomal interactions (orange) in healthy and breast cancer basal networks (A and D respectively). The plots to the right of the circos plots illustrate the relationship between genomic distance and MI values at a genomic and chromosomal level (García-Cortés et al. 2018)

This work builds upon Garcia-Cortes's work, which built MI networks for each breast cancer molecular subtype (Basal, Luminal A, Luminal B and her2). Using these networks and copy number data the present work is concerned with one question:

Are genes which are significantly copy number mutated enriched in our cancer networks? If so, this could mean that many genes become important regulatory elements through the process of being affected by structural mutations.

Moreover some exploratory work was done concerning a second question:

Is there a relationship between the amount of copy number variation and the number of communities that we observe in the cancer networks?

This question is not answered in the current analysis, we merely present a set of analyses that we think could shed some light on the answer. For it to be significant in the context of breast cancer additional research must be done to elucidate whether copy number profile is a good approximation to discern between samples with high and low amounts of SV's.

Main

1. Somatically Copy Number Mutated Genes Enrichment Analysis

Using GISTIC we obtained a list of significantly copy number mutated genes (SCNMG) for each subtype, choosing the significance threshold as $q = 2.8*10^{-4}$ (for these thresholds it is expected to find less than one false positive in each list). Then we performed an enrichment analysis of said set of genes in their corresponding network (Fisher two sided t-test). To make sure that our observations were not just an artifact of the chosen MI threshold, we performed this analysis using 10 different thresholds. As a result we obtained a q value for each thresholded network for each subtype, which represented the probability of obtaining by chance alone the observed amount of SCNMG in each network. Similarly, the same analysis with different GISTIC threshold was conducted, yielding similar results (analysis not shown). Additionally to make sure that the genes were not over represented in our MI values just because they happen to be important elements of a regulatory mechanism we pooled together all of the SCNMG of all four subtypes and performed the same enrichment test in the healthy network (Figure 7).

Figure 7. Enrichment analysis of significantly copy number mutated genes in mutual information networks for all four breast cancer subtypes and the healthy reference

The fact that q values get smaller as the networks get bigger is expected and indicative of a coherent analysis. As the network gets bigger the amount of genes in it also grows, and consequently the intersection between the set SCNMG and the genes in the network becomes bigger, pulling enrichment values down. That said, it is surprising to find that only one out of the four subtypes is not enriched in SCNMG, when compared to the enrichment values observed in the healthy network. A possible explanation for this is that even though the Basal samples have the largest set of SCNMG under the chosen threshold, SV's taking place within samples do not cause the genes that are affected by them to strongly entangle their transcription. These hypothetical SV's could very well still be reshaping the transcriptional program, entangling the transcription of genes that are not directly affected by them. For example, a tandem duplication or a deletion could create two new insulated neighborhoods, correlating the transcription of genes within these neighborhoods, while not significantly altering the transcription of the genes that are amplified or deleted. On the other hand, entanglement caused directly by a structural variation affecting a group of genes could be a reason why we observe enrichment in the other subtypes. This enrichment, observable in three out of the four subtypes, stands as supporting evidence of the importance of SV's in the origin and progression of cancer.

It seems reasonable to think that the reason why two or more genes are getting coexpressed is because they got duplicated or deleted together. Following this line of thought it is interesting to compare the distribution of the length in tandem duplications and deletions with the distribution of the distance between interacting genes in the network (Figure 8).

Figure 8. Distribution of the range between interacting genes in all four networks, thresholded to the 11,332 top edges. Range equals the genomic distance between the start of the first gene and the end of the second gene, where the first gene is always located before the second gene.

**Figure 9.** Length distribution of tandem duplications (TD) and deletions (Del) in @Li2020 breast cancer samples (supplementary Figures 22 and 23).

Figure 9. Length distribution of tandem duplications (TD) and deletions (Del) in Y. Li et al. (2020) breast cancer samples (supplementary Figures 22 and 23).

We can observe that for all four subtypes most of the interactions happen between genes that are from $10^4$ and $10^5$ base pairs apart. The length distribution of tandem duplications (TD) and deletions have considerably more spread, but nevertheless there is a significant overlap between both distributions which allows us to sustain the previous explanation. It is important to note that although the TD and the deletions distributions appear to be multimodal "individual patients tend to have a simpler—usually unimodal—distribution" (Y. Li et al. 2020). It is also worth noticing that there is a considerable amount of TD and deletions with length around or below $10^4$, which cannot explain the entanglement of most interacting genes because most of these are more than $10^4$ base pairs apart. However when we look at the networks dying the SCNMG (red if they are significantly amplified and blue if deleted) we can observe that communities, which are genes that belong to the same chromosome and are in close genomic proximity, have either exclusively deleted or amplified genes (Figure 10). This observation is consistent with the idea that a single SV is affecting multiple neighboring genes that consequently entangle their transcription.

Figure 10. Subtype networks with nodes dyed blue if said gene were found to be significantly deleted and red if significantly amplified. 13k most significant interactions. From left to right networks belong to Basal, Her2, LumA and LumB subtype.

2. Structural Variation and Transcriptional Isolation

With some evidence that SV's have an important role in breast cancer's regulatory program (at least for all subtypes except Basal), we can move on to ask the question of weather SV's themselves are causing the transcriptional isolation phenomenon that we observe in our MI networks (transforming an MI network from a giant component with genes from all chromosomes to isolated chromosome communities). Naturally we limit ourselves to showing that there is a relationship between these two variables; amount of SV's (SV degree) and transcriptional isolation (actually we dont show anything, we just think it would be interesting to show this).

Interestingly when you look at the SCNA ASCAT data and perform a clustering procedure to group samples according to their copy number profile, samples from different subtypes are clustered together into the same groups (Figure 11). This might seem a little counterintuitive because it has been reported and it is generally understood that samples in the same subtype share similar traits; like genomic instability in the case of Basal and Her2 samples (Roszik et al. 2016, Table 1).

Figure 11. ASCAT copy number heatmap for all the TCGA breast cancer samples, classified into 7 different groups using hierarchical clustering.

Ultimately, it would be interesting to calculate MI networks for the different groups of samples with similar SCNA profiles and measure the amount of independent communities they show for a given threshold (e.g. 11330 edges). Then, following our hypothesis we would expect to see an increase in the number of isolated communities in the network as the degree of structural variation increases (Figure 12 and 13).

$**Figure 12.** Sketch graph of the networks and their measurements of number of CNV and the number of communities in each network.$

Figure 12. Sketch graph of the networks and their measurements of number of CNV and the number of communities in each network.

$**Figure 13.** Sketch graph of the relationship between number of CNV's and the number of communities in each network.$

Figure 13. Sketch graph of the relationship between number of CNV's and the number of communities in each network.

If this tendency is in fact observed further analysis could support the hypothesis that structural variation is in fact reshaping the genetic program by transcriptionally isolating chromosomes. Using the same networks we could change the layout so that all the nodes for every community form a circle. Then we take note of the edges and go back to the initial RNASeq data and calculate the Pearson's correlation instead of MI, but only for the edges that are present in our network. Depending on whether the correlation turns out to be positive or negative, we could color the edges in the new layout, visualizing whether the interactions between the genes in each community are predominantly positive or negative. Following our hypothesis we would expect to see edges in the networks with a high degree of structural variation to be predominantly positive because these mutations are presumably accentuating the tendency of neighboring genes to be transcribed together (Figure 14 and 15). On the other hand, in the networks with a low and medium amount of CNV we would expect to see a combination of both positive and negative correlation between genes, because gene regulatory networks in healthy cells are supposed to have both activating and silencing mechanisms.

$**Figure 14.** Sketch graph of the networks in a circular layout with edges colored based on weather the correlation is negative or positive.$

Figure 14. Sketch graph of the networks in a circular layout with edges colored based on weather the correlation is negative or positive.

$**Figure 15.** Sketch bar graph of the amount of positive and negative edges in each network.$

Figure 15. Sketch bar graph of the amount of positive and negative edges in each network.

If these graphical predictions turn out to be accurate, our hypothesis, that structural variation is causing transcriptional isolation, would remain to be falsified. However, if these tendencies are not observed, proving our hypothesis incorrect, it would have interesting implications. Proving this hypothesis wrong implies that structural variation alone cannot account for this phenomenon, and that it is in fact caused by the malfunctioning of other regulatory mechanisms (Ghavi-Helm et al. 2019).

Discussion and Conclusions

SCNMG are enriched in all of subtypes' transcriptional regulatory networks except for Basal. This finding indicates that for three out of the four molecular subtypes, genes which are being targeted by structural mutations are found to be strongly correlated with the transcription of other genes. It is tempting to think that structural variation itself is what is causing these genes to become important elements of the regulatory program, but further research is needed to confirm this. It is also tempting to think that the reason why we observe such an accentuated pattern of close range correlation is because the length of these mutations usually encompasses several neighboring genes and in doing so they become transcriptional entangled (García-Cortés et al. 2018). The distribution of TD and deletions reported by Y. Li et al. (2020) allows us to keep thinking this because most of the interactions are less than $10^5$ base pairs apart and there is a significant portion of TD and deletions which are larger than $10^5$. However there are also a big amount of TD and deletions which are smaller than the average interaction in our network and therefore could not as easily explain the transcriptional entanglement. Further work is needed to elucidate whether structural variation is in fact responsible for the transcriptional entanglement of neighboring genes, and to elucidate why the Basal subtype seems to out-lie the observed pattern of enrichment. One possible explanation, although vaguely satisfying, is that genes which are targeted by structural mutations in Basal samples simply do not become transcriptionally entangled although they may very well might be having other deregulation effects. However there seems to be no apparent reason for the structural mutations that happen in the basal samples to be inherently different from those in the other subtypes; that is for basal structural mutations to not cause entanglement. We also found that significantly amplified genes almost excursively interact with other genes that are also significantly amplified, forming amplified communities of genes that show up in our networks. The same was observed for significantly deleted genes. This observation is compatible with the hypothesis that neighboring genes are being transcriptionally entangled by structural mutations that encompass them, however further research is needed to test this theory.

Regarding the second question that this report addresses, it is interesting that samples that belong to the same subtype tend not to be clustered together based on their copy number profile (Figure 10). This finding allows to think that if MI networks were constructed for different groups of samples (grouped by hierarchical clustering or k-means based on their copy number profile) the topology of the resulting networks might be correlated with the degree to which samples in each group is affected by copy number mutations. This analysis remains to be done.

Many studies have shown in the past that structural variation can play an important role in carcinogenesis and neoplastic development, and yet to our knowledge this is the first study that attempts to show the relevance of this type of mutations at a systems level. We find SCNMG to be overrepresented in three out of the four subtype transcriptional regulatory networks and we also show that samples from the same subtype do not necessarily have similar copy number profiles. Additionally we propose an analysis to interrogate whether structural variation degree is associated with transcriptional isolation. Even if our hypothesis were found not to be falsified, further work is needed to reinforce the belief that there is a causal relationship between structural variation and pervasive transcriptional deregulation.

References

Bradner, James E., Denes Hnisz, and Richard A. Young. 2017. “Transcriptional Addiction in Cancer.” Cell 168 (4). Elsevier: 629–43. doi:10.1016/j.cell.2016.12.013.

Danieli, Adi, and Argyris Papantonis. 2020. “Spatial Genome Architecture and the Emergence of Malignancy.” Human Molecular Genetics, July. doi:10.1093/hmg/ddaa128.

De, Subhajyoti, and M Madan Babu. 2010. “Genomic Neighbourhood and the Regulation of Gene Expression.” Current Opinion in Cell Biology 22 (3): 326–33. doi:https://doi.org/10.1016/j.ceb.2010.04.004.

Dixon, Jesse R., Jie Xu, Vishnu Dileep, Ye Zhan, Fan Song, Victoria T. Le, Galip Gürkan Yardimci, et al. 2018. “Integrative Detection and Analysis of Structural Variation in Cancer Genomes.” Nature Genetics 50 (10): 1388–98. doi:10.1038/s41588-018-0195-8.

Espinal-Enríquez, Jesús, Cristóbal Fresno, Guillermo Anda-Jáuregui, and Enrique Hernández-Lemus. 2017. “RNA-Seq Based Genome-Wide Analysis Reveals Loss of Inter-Chromosomal Regulation in Breast Cancer.” Scientific Reports 7 (1): 1760. doi:10.1038/s41598-017-01314-1.

García-Cortés, Diana, Guillermo de Anda-Jáuregui, Cristobal Fresno, Enrique Hernández-Lemus, and Jesús Espinal-Enriquez. 2018. “Loss of Trans Regulation in Breast Cancer Molecular Subtypes.” bioRxiv, January, 399253. doi:10.1101/399253.

Ghavi-Helm, Yad, Aleksander Jankowski, Sascha Meiers, Rebecca R. Viales, Jan O. Korbel, and Eileen E. M. Furlong. 2019. “Highly Rearranged Chromosomes Reveal Uncoupling Between Genome Topology and Gene Expression.” Nature Genetics 51 (8): 1272–82. doi:10.1038/s41588-019-0462-3.

Gierman, Hinco J., Mireille H. G. Indemans, Jan Koster, Sandra Goetze, Jurgen Seppen, Dirk Geerts, Roel van Driel, and Rogier Versteeg. 2007. “Domain-Wide Regulation of Gene Expression in the Human Genome.” Genome Research 17 (9). Cold Spring Harbor Laboratory Press: 1286–95. doi:10.1101/gr.6276007.

Li, Yilong, Nicola D. Roberts, Jeremiah A. Wala, Ofer Shapira, Steven E. Schumacher, Kiran Kumar, Ekta Khurana, et al. 2020. “Patterns of Somatic Structural Variation in Human Cancer Genomes.” Nature 578 (7793): 112–21. doi:10.1038/s41586-019-1913-9.

Liu, Biao, Carl D Morrison, Candace S Johnson, Donald L Trump, Maochun Qin, Jeffrey C Conroy, Jianmin Wang, and Song Liu. 2013. “Computational Methods for Detecting Copy Number Variations in Cancer Genome Using Next Generation Sequencing: Principles and Challenges.” Oncotarget 4 (11). Impact Journals, LLC: 1868–81. doi:https://doi.org/10.18632/oncotarget.1537.

Lynch-Sutherland, Chiemi F., Aniruddha Chatterjee, Peter A. Stockwell, Michael R. Eccles, and Erin C. Macaulay. 2020. “Reawakening the Developmental Origins of Cancer Through Transposable Elements.” Frontiers in Oncology 10: 468. doi:10.3389/fonc.2020.00468.

Margolin, Adam A., Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera, and Andrea Califano. 2006. “ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context.” BMC Bioinformatics 7 (1): S7. doi:10.1186/1471-2105-7-S1-S7.

Mermel, Craig H., Steven E. Schumacher, Barbara Hill, Matthew L. Meyerson, Rameen Beroukhim, and Gad Getz. 2011. “GISTIC2.0 Facilitates Sensitive and Confident Localization of the Targets of Focal Somatic Copy-Number Alteration in Human Cancers.” Genome Biology 12 (4): R41. doi:10.1186/gb-2011-12-4-r41.

Olshen, A. B., E. S. Venkatraman, R. Lucito, and M. Wigler. 2004. “Circular Binary Segmentation for the Analysis of Array-Based DNA Copy Number Data.” Biostatistics 5 (4): 557–72. doi:10.1093/biostatistics/kxh008.

Peiffer, Daniel A., Jennie M. Le, Frank J. Steemers, Weihua Chang, Tony Jenniges, Francisco Garcia, Kirt Haden, et al. 2006. “High-Resolution Genomic Profiling of Chromosomal Aberrations Using Infinium Whole-Genome Genotyping.” Genome Research 16 (9): 1136–48. doi:10.1101/gr.5402306.

Redon, Richard, Shumpei Ishikawa, Karen R. Fitch, Lars Feuk, George H. Perry, T. Daniel Andrews, Heike Fiegler, et al. 2006. “Global Variation in Copy Number in the Human Genome.” Nature 444 (7118): 444–54. doi:10.1038/nature05329.

Roszik, Jason, Chang-Jiun Wu, Alan E. Siroy, Alexander J. Lazar, Michael A. Davies, Scott E. Woodman, and Lawrence N. Kwong. 2016. “Somatic Copy Number Alterations at Oncogenic Loci Show Diverse Correlations with Gene Expression.” Scientific Reports 6 (1): 19649. doi:10.1038/srep19649.

Tabak, Barbara, Gordon Saksena, Coyin Oh, Galen F. Gao, Barbara Hill Meyers, Michael Reich, Steven E. Schumacher, et al. 2019. “The Tangent Copy-Number Inference Pipeline for Cancer Genome Analyses.” bioRxiv. Cold Spring Harbor Laboratory. doi:10.1101/566505.

Van Loo, Peter, Silje H. Nordgard, Ole Christian Lingjærde, Hege G. Russnes, Inga H. Rye, Wei Sun, Victor J. Weigman, et al. 2010. “Allele-Specific Copy Number Analysis of Tumors.” Proceedings of the National Academy of Sciences 107 (39). National Academy of Sciences: 16910–5. doi:10.1073/pnas.1009843107.

Yang, Lixing, Lovelace J. Luquette, Nils Gehlenborg, Ruibin Xi, Psalm S. Haseley, Chih-Heng Hsieh, Chengsheng Zhang, et al. 2013. “Diverse Mechanisms of Somatic Structural Variations in Human Cancer Genomes.” Cell 153 (4). Elsevier: 919–29. doi:10.1016/j.cell.2013.04.010.