An interesting new study by researchers at the University of Portsmouth, UK, describes the use of a mathematical method to sequence genomes based on information theory. The method offers an alternative to clinical techniques, which allows mutations to be detected and even predicted. In this way, it opens up new research opportunities in bioinformatics and genetics.
Although living organisms contain information encoded in genetic material, whether deoxyribonucleic or ribonucleic acid (DNA and RNA, respectively), there are many other means by which information can be stored and transmitted.
A prepress version of the study is available on the bioRxiv * server, while the article is subject to peer review.
Information theory was first developed on a mathematical basis by Claude Shannon more than 70 years ago.
He described a method for measuring the information obtained by observing the occurrence of an event. In fact, this gave rise to modern computing. He also coined the word “bit” for a unit of information.
Aside from information technologies, his theory laid the groundwork for advancing a wide range of topics pertaining to topics as diverse as computer science, cryptography, linguistics, physiology, and biology and telecommunications.
Using information entropy
The new paper uses information theory to devise a new method by which mutations in genomic sequences can be traced and predicted. This is far from the first attempt to do so, as DNA sequences have been analyzed using methods based on information theory since the 1970s.
The approach used in this study focuses on the spectrum of information entropy (IE), which is created from genomic sequences, and examination of the dynamics of mutation. It is important to note that this approach is relevant to any sequence of any genome of any size.
The researchers used a program called GENIES (Genetics Information Spectrum Entropy), custom designed for this project and is now available for free to other scientists.
The core of the approach is the view of the genome as a coding system, where its various functional regions such as exons, promoters and enhancers, have unique patterns of information entropy. Mutations appear as changes in the sequence of these regions and therefore as distinctive alterations in the information entropy pattern.
This correspondence could be used to identify these mutations only from the point of view of a system of information storage, without the need to understand the physical and chemical aspects of these changes.
For example, the four DNA nucleotides are represented in any sequence adenine (A), cytosine (C), guanine (G) and thymine (T). Adjacent sequences form a chromosome with unknown sequences represented by N. Each nucleotide is represented by two bits, ie, A = 00, C = 01, G = 10 and T = 11.
Individual symbol distributions and their information entropy values can be found using a suitable equation. Correlations between symbols are further expressed by block information entropies. These block information entropies have been used in many studies on genomic information content, though not to detect mutations.
The researchers used a three-nucleotide codon frame, each codon represented by a symbol (m = 3). The probability of each codon was estimated mathematically for defined stretches of the genomic sequence. This gave the maximum entropy of the studied sequence.
Entropy changes with mutation
When a mutation has occurred, the value of the maximum entropy changes. The presence of a difference between the two entropies that was not equal to zero or a proportion of original to altered entropy not equal to 1 would then indicate the possibility that there was a mutation in the genomic subset of interest.
Based on this concept, the IES-wide genome method was used.
The genome was first divided into subsets called windows. A window contains a defined number of characters, called its window size, which corresponds to the length of the subset of the genome described above.
Continuing throughout the genome, the window moved from one end to the other one “step” at a time, the size of each step corresponding to a fixed number of characters. The step size is between 1 and the window size.
This provides a default number of rounded windows to the nearest integer.
The IE value is calculated window by window and is represented by location within the genome. This provides the IE spectrum of the genome: “a numerical representation of genetically encoded information within a given genome.”
This algorithm conveniently allows IE spectrum information to be used in other ways. More research is being done to determine the size of the window and the passage, probably varying with the type of information needed. This method will only work with GENIES or another fully automated program.
Mutations in SARS-CoV-2 have been detected
By way of illustration, the researchers examined the reference genome of SARS-CoV-2 using the IE spectrum method. They found that with a step size of 1, larger window sizes increased the average IE value of the spectrum. The maximum value of IE closely corresponds to the maximum theoretical value expected up to WS> 33.
Therefore, this point can represent the optimal size of the window, where the IE changes are large enough to allow useful information to be extracted but not to exceed 33.
An earlier and less detailed version of this method has been reported by other researchers, who, however, obtained valuable information by detecting repetitive sequences that helped trace the differences between organisms that arose as they evolved. The current method should help add utility to this tool.
When applied to the SARS-CoV-2 sequence and a randomly chosen variant, the researchers found that the IE spectrum method in various window and step sizes collected six of the seven total mutations identified by direct comparison to nucleotide level.
However, using block values m less than 3, corresponding to the number of nucleotides in a codon, they found that m = 2 produced all seven mutations, while identifying possible correlations between nucleotides. In addition, this value is independent of window and step size values.
Our study indicates that the best block size m is 2 and the optimal window size should contain more than 9 and less than 33 nucleotides. “
What are the implications?
The study reports a first program based on information theory that detects single-point mutations using the ratio of IE spectra. Subsequent work will also help identify indel mutations. Other algorithms and equations can help identify mutations.
However, this technique can show the greatest value in its inverse application, examining the points where mutations are known to have occurred. This would make it possible to relate special features of the IE spectrum to the location of mutations in the genome and to predict possible future mutations.
* Important news
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and therefore should not be considered conclusive, guide clinical practice / health-related behavior, or treated as established information.