Computational biology

Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, ecological, behavioral, and social systems.^[1] The field is broadly defined and includes foundations in biology, applied mathematics, statistics, biochemistry, chemistry, biophysics, molecular biology, genetics, genomics, computer science, ecology, and evolution,^[2] but is most commonly thought of as the intersection of computer science, biology, and big data.

Computational biology is different from biological computing, which is a subfield of computer engineering using bioengineering and biology to build computers.

Introduction[edit]

This timeline displays the year-by-year progress of the now famous Human Genome Project in the context of genetics and genomics as a whole since 1865. Starting in 1990, by 1995 they had mapped the first ever bacterial genome, H. influenzae. Four years later in 1999, chromosome 22 became the first human chromosome to be completely sequenced. Three years after that, a complete draft for the mouse, rat, and rice genome were all completed. Finally, in the following year a complete draft of the human genome was completed, satisfying the initial goals of the project, though work continues to this day.

Computational biology, which includes many aspects of bioinformatics and much more, is the science of using biological data to develop algorithms or models in order to understand biological systems and relationships. Until recently, biologists did not have access to very large amounts of data to analyze. However, this has changed in the past few decades and has allowed researchers to develop analytical methods for interpreting mass amounts of biological information and share them quickly among colleagues.^[3] These methods have now become commonplace, particularly in molecular biology and genomics.

Bioinformatics began to develop in the early 1970s. It was considered the science of analyzing informatics processes of various biological systems. At this time, research in artificial intelligence was using network models of the human brain in order to generate new algorithms. This use of biological data to develop other fields pushed biological researchers to revisit the idea of using computers to evaluate and compare large data sets. By 1982, information was being shared among researchers through the use of punch cards. The amount of data being shared began to grow exponentially by the end of the 1980s. This required the development of new computational methods in order to quickly analyze and interpret relevant information.^[3]

Perhaps the best known example of computational biology, the Human Genome Project, began officially in 1990 and was technically complete by 2003.^[4] By that time, they had mapped about 85% of the human genome, which satisfied the goals set out from the beginning.^[5] Work continued however, and by May 2021 level "complete genome" was reached with a remaining only 0.3% bases covered by potential issues.^[6]^[7] The missing Y chromosome was added in January 2022.

Since the late 1990s, computational biology has become an important part of developing emerging technologies for the field of biology, leading to the development of numerous subfields.^[8] As of today, the International Society for Computational Biology (ISCB) recognizes 21 different Communities of Special Interest (COSIs), each of which represent a slice of the larger field of computational biology.^[9] In addition to helping sequence the human genome, computational biology has helped and continues to help create accurate models of the human brain, map the 3D structure of genomes, and assist in modeling biological systems.^[3]

Applications[edit]

Anatomy[edit]

Computational anatomy is a discipline focusing on the study of anatomical shape and form at the visible or gross anatomical $50-100\mu$ scale of morphology. It involves the development and application of computational, mathematical and data-analytical methods for modeling and simulation of biological structures. It focuses on the anatomical structures being imaged, rather than the medical imaging devices. Due to the availability of dense 3D measurements via technologies such as magnetic resonance imaging (MRI), computational anatomy has emerged as a subfield of medical imaging and bioengineering for extracting anatomical coordinate systems at the morphome scale in 3D.

The original formulation of computational anatomy is as a generative model of shape and form from exemplars acted upon via transformations.^[10] The diffeomorphism group is used to study different coordinate systems via coordinate transformations as generated via the Lagrangian and Eulerian velocities of flow from one anatomical configuration in ${\mathbb {R} }^{3}$ to another. It relates with shape statistics and morphometrics, with the distinction that diffeomorphisms are used to map coordinate systems, whose study is known as diffeomorphometry.

Bioinformatics[edit]

Bioinformatics is one of the most common fields of computational biology. Computational bioinformatics consists of developing and creating databases or other methods of storing, retrieving, and analyzing biological data through various mathematical and computing algorithms. Usually, this process involves genetics and analyzing genes. Bioinformatics combines mathematics and a variety of computing languages to ease the storage and analysis of biological data. Gathering and analyzing large datasets have made way for growing research fields such as data mining.^[11]

Biomodeling[edit]

Computational biomodeling is a field concerned with building computer models of biological systems. Computational biomodeling aims to develop and use visual simulations in order to assess the complexity of biological systems. This is accomplished through the use of specialized algorithms, and visualization software. These models allow for prediction of how systems will react under different environments. This is useful for determining if a system is robust. A robust biological system is one that “maintain their state and functions against external and internal perturbations”,^[12] which is essential for a biological system to survive. Computational biomodeling generates a large archive of such data, allowing for analysis from multiple users. While current techniques focus on small biological systems, researchers are working on approaches that will allow for larger networks to be analyzed and modeled. A majority of researchers believe that this will be essential in developing modern medical approaches to creating new drugs and gene therapy.^[12] A useful modelling approach is to use Petri nets via tools such as esyN.^[13]

Ecology[edit]

Computational methods in ecology have seen increasing interest. Until recent decades, theoretical ecology has largely dealt with analytic models that were largely detached from the statistical models used by empirical ecologists. However, computational methods have aided in developing ecological theory via simulation of ecological systems, in addition to increasing application of methods from computational statistics in ecological analyses.

Evolutionary biology[edit]

Computational biology has assisted the field of evolutionary biology in many capacities. This includes:

Using DNA data to reconstruct the tree of life with computational phylogenetics
Fitting population genetics models (either forward time^[14] or backward time) to DNA data to make inferences about demographic or selective history
Building population genetics models of evolutionary systems from first principles in order to predict what is likely to evolve

Gene Ontology[edit]

Understanding how individual genes contribute to the biology of an organism at the molecular, cellular, and organism levels is an important subfield of computational biology. The Gene Ontology (GO) Consortium's mission is to develop an up-to-date, comprehensive, computational model of biological systems, from the molecular level to larger pathways, cellular, and organism-level systems. The Gene Ontology resource provides a computational representation of our current scientific knowledge about the functions of genes (or, more properly, the protein and non-coding RNA molecules produced by genes) from many different organisms, from humans to bacteria.^[15]

Genomics[edit]

A partially sequenced genome.

Computational genomics is a field within genomics which studies the genomes of cells and organisms. It is sometimes referred to as Computational and Statistical Genetics and encompasses much of Bioinformatics. The Human Genome Project is one example of computational genomics. This project looks to sequence the entire human genome into a set of data. Once fully implemented, this could allow for doctors to analyze the genome of an individual patient.^[16] This opens the possibility of personalized medicine, prescribing treatments based on an individual's pre-existing genetic patterns. This project has created many similar programs. Researchers are looking to sequence the genomes of animals, plants, bacteria, and all other types of life.^[17]

One of the main ways that genomes are compared is by sequence homology. Homology is the study of biological structures and nucleotide sequences in different organisms that come from a common ancestor. Research suggests that between 80 and 90% of genes in newly sequenced prokaryotic genomes can be identified this way.^[17]

This field is still in development. An untouched project in the development of computational genomics is the analysis of intergenic regions. Studies show that roughly 97% of the human genome consists of these regions.^[17] Researchers in computational genomics are working on understanding the functions of non-coding regions of the human genome through the development of computational and statistical methods and via large consortia projects such as ENCODE (The Encyclopedia of DNA Elements) and the Roadmap Epigenomics Project.

3D Genomics[edit]

Figure 1 - Heat-map of the Jaccard similarity index matrix for two given nuclear profiles.

3D Genomics is a subsection in computational biology that focuses on the organization and interaction of genes within a eukaryotic cell. One method used to gather 3D genomic data is through the Genome Architecture Mapping (GAM) process. GAM measures 3D distances of chromatin and DNA in the genome by combining cryosectioning with laser microdissection. Cryosectioning is the process of cutting a strip from the nucleus to examine the DNA. A nuclear profile is simply this strip or slice that is taken from the nucleus. Each nuclear profile contains genomic windows, which are certain sequences of nucleotides - the base unit of DNA. GAM captures a genome network of complex, multi enhancer chromatin contacts throughout a cell.^[18]

Figure 2 - Radar chart comparing percentage of features in each cluster.

Using computational biology and data science techniques, biological information can be gathered, analyzed, and visualized from the 3D genomic GAM data. Patterns can be identified of the individual loci that appear in the strips from the nucleus, including compaction and position measurements. Other advanced examples include calculating Jaccard similarity statistics for clustering, feature comparison of the clusters, and linkage detection of each genomic window in the network. Highlighted below are some examples on how computational biology can be used with GAM data. The GAM data used for these examples is of a mouse's Hist1 region of chromosome 13 retrieved from the following link.

One example of integrating computational biology with data from the GAM method is calculating and displaying the genome network's normalized Jaccard similarity matrix. The Jaccard index is a statistic used to measure the similarity between two sets. In this case, the Jaccard similarity statistic can be used to compare how similar two nuclear profiles are. Two nuclear profiles are similar if they detect similar genomic windows. Using GAM binary data of nuclear profiles and genomic windows, we can calculate the Jaccard index of every nuclear profile using the following formula.

The Jaccard similarity index can be normalized, and then can be displayed using a symmetrical heat-map. Figure 1 is an example of a heat-map of the Jaccard similarity index matrix from a mouse's genome using data retrieved from the GAM method. Red represents two nuclear profiles that are similar to each other, while dark blue indicates that two nuclear profiles are very different. The heat-map is symmetric along the diagonal line. The red diagonal line indicates that two nuclear profiles are identical to each other and have a Jaccard similarity index of 1.

The generated Jaccard similarity index matrix of the nuclear profiles can then be used to identify certain clusters within the dataset. In this case, clusters are defined to be distinct groups of nuclear profiles that have many similar genomic windows. One such algorithm to cluster the nuclear profiles is through the unsupervised learning algorithm of k-means clustering, which is discussed further in detail in the section below. The Jaccard similarity index matrix can be used as a distance measurement between each of the points (nuclear profiles) in the network. Suppose in this example, using the k-means clustering algorithm with the Jaccard Similarity index matrix, that there are three distinct clusters of nuclear profiles.

Figure 3 - Heat-map of the normalized linkage values for two given genomic windows.

One can compute the percentage of genomic windows in a nuclear profile that contains a certain gene using a feature table. A feature table is a binary table that contains a list of the genomic windows that typically occur within a given feature, or gene. Figure 2 is a radar chart which shows the average (across all NPs in a cluster) of the percentage of genomic windows that contain a certain gene. In this radar chart, 15 genes are compared between the three clusters. This radar chart allows for visualization of the general trends of each cluster. For example, cluster 2 has a high percentage of nuclear profiles that have high similarity to LAD and Vmn features, but low similarity in Hist1 and CTCF-7BWU features. However, cluster 1 has nuclear profiles that have low similarity to LAD and Vmn features, but high similarity in Hist1 and CTCF-7BWU features. In this example, the process of computational biology allows for visualization and knowledge about the 3D structure of each nuclear profile.

Lastly, using computational biology, all of the genomic windows detected within the corresponding region can be compared and visualized using a heatmap. This is known as the linkage between two genomic windows on a DNA strand. The detection frequency of a genomic window (f_A) can be computed by dividing the number of nuclear profiles in which A is detected by the total number of nuclear profiles. Also, the co-segregation of a pair of genomic windows (f_AB) can be computed by dividing the number of nuclear profiles in which both A and B are detected by the total number of nuclear profiles. The linkage can be calculated by taking the co-segregation of A and B minus the product of their individual detection frequencies (Linkage = f_AB - f_Af_B), from which the normalized linkage can be computed.^[19]

A heat-map (Figure 3) is used to visualize the normalized linkage between two windows on the DNA strand. The heat-map representation of the normalized linkage table is symmetric along the bolded red diagonal line. The red diagonal line indicates that the normalized linkage for the same window is 1, which is the highest / strongest possible linkage value. This is because both windows are identical to each other, and f_AB = f_A = f_B. Throughout the heat-map, there are a few blue squares, indicating that the normalized linkage value is negative for those specific windows. Also, in the heatmap, there is one white horizontal and vertical line, which indicates that the linkage is 0 at those locations. This is because there are zero nuclear profiles that detect window #45, and therefore, the linkage of any window with window #45 will be 0. Along the bold red line, there are many pockets (triangles / squares) of red strong linkage. This is because those windows are physically located near each other on the chromosome. But also, there are red pockets that are away from the bold red line. This means that the DNA is 3D and can bend. The windows within the chromatin can have interactions with other windows that are far apart from each other on the genome. For example, because of looping, the two ends of the chromatin can interact with each other - as shown in the heat-map. In the top right corner, you can see red / pink squares - indicating that linkage and chromatin interactions exist in windows far apart on the DNA strand. This is because the chromatin is three dimensional, and can bend / loop in many different ways.

Mathematical Biology[edit]

Mathematical biology (also known as "Biomathematics" or "Mathematical and Theoretical Biology") is a subfield of computational biology that uses mathematical models, analyses, and representations of living organisms to examine the systems that govern structure, development, and behavior of and within biological systems. Mathematical computational biology relies on a more theoretical approach and analysis to solve problems rather than using experiments to prove theories like its experimental biology counterpart.^[20] Various mathematics used in mathematical biology research include discrete mathematics, topology (also useful for computational modeling), Bayesian statistics (such as for biostatistics), Linear Algebra, Logic, Boolean algebra, and many other higher level mathematics.^[21]

Neuropsychiatry[edit]

Computational neuropsychiatry is the emerging field that uses mathematical and computer-assisted modeling of brain mechanisms involved in mental disorders. It was already demonstrated by several initiatives that computational modeling is an important contribution to understand neuronal circuits that could generate mental functions and dysfunctions.^[22]^[23]^[24]

Neuroscience[edit]

Computational neuroscience is the study of brain function in terms of the information processing properties of the structures that make up the nervous system. It is a subset of the field of neuroscience, and looks to analyze brain data to create practical applications.^[25] It looks to model the brain in order to examine specific aspects of the neurological system. Various types of models of the brain include:

Realistic Brain Models: These models look to represent every aspect of the brain, including as much detail at the cellular level as possible. Realistic models provide the most information about the brain, but also have the largest margin for error. More variables in a brain model create the possibility for more error to occur. These models do not account for parts of the cellular structure that scientists do not know about. Realistic brain models are the most computationally heavy and the most expensive to implement.^[26]
Simplifying Brain Models: These models look to limit the scope of a model in order to assess a specific physical property of the neurological system. This allows for the intensive computational problems to be solved, and reduces the amount of potential error from a realistic brain model.^[26]

It is the work of computational neuroscientists to improve the algorithms and data structures currently used to increase the speed of such calculations.

Oncology[edit]

Computational oncology, sometimes also called cancer computational biology, is a field that aims to determine the future mutations in cancer through an algorithmic approach to analyzing data. Research in this field has led to the use of high-throughput measurement. High throughput measurement allows for the gathering of millions of data points using robotics and other sensing devices. This data is collected from DNA, RNA, and other biological structures. Areas of focus include determining the characteristics of tumors, analyzing molecules that are deterministic in causing cancer, and understanding how the human genome relates to the causation of tumors and cancer.^[27]^[28]

Pharmacology[edit]

Computational pharmacology (from a computational biology perspective) is “the study of the effects of genomic data to find links between specific genotypes and diseases and then screening drug data”.^[29] The pharmaceutical industry requires a shift in methods to analyze drug data. Pharmacologists were able to use Microsoft Excel to compare chemical and genomic data related to the effectiveness of drugs. However, the industry has reached what is referred to as the Excel barricade. This arises from the limited number of cells accessible on a spreadsheet. This development led to the need for computational pharmacology. Scientists and researchers develop computational methods to analyze these massive data sets. This allows for an efficient comparison between the notable data points and allows for more accurate drugs to be developed.^[30]

Analysts project that if major medications fail due to patents, that computational biology will be necessary to replace current drugs on the market. Doctoral students in computational biology are being encouraged to pursue careers in industry rather than take Post-Doctoral positions. This is a direct result of major pharmaceutical companies needing more qualified analysts of the large data sets required for producing new drugs.^[30]

Sequence Alignment[edit]

Sequence Alignment is the process of comparing and detecting similarities between biological sequences or genes. What “similarities” are being detected will depend on the goals of the particular alignment process. One such goal may be computing the longest common subsequence of two genes. Sequence alignment appears to be extremely useful in a number of bioinformatics applications, such as comparing variants of certain diseases.^[31]

Systems Biology[edit]

Systems biology consists of computing the interactions between various biological systems ranging from the cellular level to entire populations with the goal of discovering emergent properties. This process usually involves networking cell signaling and metabolic pathways. Systems biology often uses computational techniques from biological modeling and graph theory to study these complex interactions at cellular levels.^[32]

Techniques, Algorithms, and Software[edit]

Computational Biologists use a wide range of software. These range from command line programs to graphical and web-based programs.

Unsupervised Learning[edit]

Unsupervised learning is a type of algorithm that learns from unlabeled data and finds patterns. One algorithm of unsupervised learning is k-means clustering. k-means clustering is a method that aims to partition n data points into k clusters in which each data point belongs to the cluster with the nearest mean (cluster centers or cluster centroid). Another version of this algorithm is the k-medoids algorithm. This algorithm is different than the k-means in the way that when selecting a cluster center or cluster centroid, it will pick one of its data points in the set and not just an average of the cluster.

Figure 1: Heat-map of Jaccard Distances of nuclear profiles

The algorithm can be though these steps:

Randomly select k distinct data points. These are the initial clusters.
Measure the distance between each point and each of the ‘k’ clusters. (This is the distance of the points from each point k)
Assign each point to the nearest cluster
Find the center of each cluster (medoid)
Repeat until the clusters no longer change
Assess the quality of the clustering by adding up the variation within each cluster
Repeat the processes with different values of k.
Pick the best value for ‘k’ by finding the “elbow” in the plot of which k value has the lowest variance.

Figure 2: Heat-maps of 3 clusters of nuclear profiles

One example of this in biology is used in the 3D mapping of a genome. Gathering data from this link, information of a mouse's’ HIST1 region of chromosome 13 is gathered. This information contains data on which nuclear profiles show up in certain genomic regions. With this information calculations can be made to find a normalized distance between all the loci. This can be done using a technique called Jaccard Distance. In figure 1 on the right, you can see a visualization of the loci and their distance to each other (normalized). This information can then be clustered by using k medoids clustering. In this example the data points will be the list of all the different loci on the HIST1 gene. The distance is the result from applying the Jaccard Distance index formula. Throughout this process we can select different amounts of data points (loci) and repeat the process listed above to find the best clustering of the data. The result of using three distinct data points is shown in figure 2 on the right. The y-axis shows the index of the loci used, and the x-axis shows the genomic windows or the regions that loci fall in. Looking at the heatmaps in figure 2, information can be gathered on the different clusters. For example, in cluster 1 most nuclear profiles have genomic windows that show up on the edges, but not so much in the middle. This might indicate that because more genomic windows show up on the edges that most nuclear profiles represent HIST1 genes. Cluster 2 shows something a bit different. In cluster 2 the trend seems to be that more genomic windows show up in the middle. This more closely resembles LAD genes indicating that the nuclear profiles in cluster 2 are LAD genes. Cluster 3 shows no clear pattern of nuclear profiles. This makes it hard to draw conclusions from. In biology, it is not unexpected to see no correlations at times, due to data gathering methods and “noise” of the data, sometimes the clustering may not be so clear.

Graph Analytics[edit]

Graph Analytics or Network Analysis is the study of graphs that represent connections between different objects. Graphs can represent all kinds of networks in biology such as Protein-protein interaction (PPI) networks, Regulatory networks (GRNs), Metabolic and biochemical networks and much more. There are many ways to analyze these networks. One of which is looking at Centrality in graphs. Finding centrality in graphs assigns nodes rankings to their popularity or centrality in the graph. This can be useful in finding what nodes are most important. This can be very useful in biology in many ways. For example, if we were to have data on the activity of genes in a given time period, we can use degree centrality to see what genes are most active throughout the network or what genes interact with others the most throughout the network. This can help us understand what roles certain genes play in the network.

There are many ways to calculate centrality in graphs all of which can give different kinds of information on centrality. Finding centralities in biology can be applied in many different circumstances, some of which are gene regulatory, protein interaction and metabolic networks.^[33]

One centrality calculation is Degree centrality. Degree centrality will rank the nodes based on how many other nodes are linked directly to them. The degree can be used to evaluate the likelihood of a node being interacted with something flowing through the network, like a virus or information. This can be thought of as the mode or most common of the nodes. A study by Hahn and Kern in 2005 used degree centrality to show that the mean centrality value for essential proteins is significantly higher than the centrality value of nonessential proteins.^[33] A study of metabolic networks done by Fell and Wagner in 2000, discussed the possibility that metabolites with highest degree (or highest number of connections) may belong to the oldest part of the metabolism. However, having the degree of a vertex alone as a specific centrality measure, is not sufficient to distinguish lethal proteins clearly from viable ones.^[33] Degree centrality is a fundamental way of calculating centrality but other centrality calculations may need to be performed to have the most accurate results.Another way to calculate centrality is to rank the nodes by averaging the length of the shortest path between that node and every other node in the graph. This is called Closeness centrality. This takes the idea of if a node is central in a graph, then it must be closer to all the other nodes. This can be thought of as the mean or average of all the nodes. Closeness centrality has been used in many ways, one of which is a study done by Wuchty and Stadler.^[33] The article states that “Closeness-based centrality has been used in different studies. Wuchty and Stadler (2003) apply this centrality to different biological networks and show the correspondence with the service facility location problem.” But Closeness centrality doesn't come without its draw backs. The same article states that “As the distance between vertices is only defined for pairwise strongly connected vertices this centrality can only be applied to strongly connected networks”. This indicates that closeness centrality alone may not be the best option for networks that are not densely connected.

A third type of centrality is Betweenness centrality, which quantifies the number of times a node acts as a bridge along the shortest path between two other nodes. This can be thought of as the median or middle of the ranking of all the nodes. This centrality has been used in different types of networks in biology. One of which is a protein interaction network. In using betweenness centrality a study found that proteins with a high betweenness value and a low degree centrality value are important as they are supposed to support modularization of the network.^[33] One study of protein interaction found that “proteins with high betweenness control the flow of information across a network”.^[34] The same study found that “that betweenness is more strongly correlated with evolutionary rate than the other measures of centrality in all three networks”.^[34] Betweenness centrality can be used to find a lot of useful information in biology and has much potential in the future.

A fourth is Eigenvector centrality (also called Eigen centrality) which is a measure of the influence of a node in a network. This can be thought of as if you perform many random walks what node shows up the most. A study in which they compared different centrality's with protein interaction networks and transcriptional regulation network found that some nodes were highly connected when using eigenvector centrality when compared to other centralities on the protein interaction network and the transcriptional regulation network. Other centralizes were also applied but varied on how highly correlated certain nodes were in the networks.^[35]

Overall, there are many ways to calculate centrality, some of which might perform better than others depending on the dataset used and the information you would like to find. Looking at the figure to the top right you can see an example of a graph that visualizes the difference between these four different centralities on a single graph. The graph on the top left show's degree centrality, and the main nodes are clustered around many different areas throughout the graph. The graph on the top right shows closeness centrality and shows that the nodes in the middle are most highly central. This is because the graph is relatively locally connected meaning nodes connected to each other tend to be close to each other. This would make the middle the most central when moving from a node to every other node in the graph. The graph in the bottom left shows betweenness centrality. There is no clear cluster or highly ranked nodes using this centrality calculation. The graph on the bottom right shows eigenvector centrality. The graph shows a clear section of the graph that shows which group of nodes are the most influential. We can interpret this as if a message is being passed through the graph, the nodes in this region would be the nodes that would spread this message the quickest. Looking at all the different ways to calculate centrality It can be concluded that to get the most information out of a network that one must perform many centrality calculations.

Supervised Learning[edit]

Supervised learning is a type of algorithm that learns from labeled data and learns how to assign labels to future data that is unlabeled. In biology supervised learning can be helpful when we have data that we know how to categorize and we would like to categorize more data into those categories.

A common supervised learning algorithm is the random forest, which uses numerous decision trees to train a model to classify a dataset. Forming the basis of the random forest, a decision tree is a structure which aims to classify, or label, some set of data using certain known features of that data. A practical biological example of this would be taking an individual's genetic data and predicting whether or not that individual is predisposed to develop a certain disease or cancer. At each internal node the algorithm checks the dataset for exactly one feature, a specific gene in the previous example, and then branches left or right based on the result. Then at each leaf node, the decision tree assigns a class label to the dataset. So in practice, the algorithm walks a specific root-to-leaf path based on the input dataset through the decision tree, which results in the classification of that dataset. Commonly, decision trees have target variables that take on discrete values, like yes/no, in which case it is referred to as a classification tree, but if the target variable is continuous then it is called a regression tree. To construct a decision tree, it must first be trained using a training set to identify which features are the best predicters of the target variable.

Diagram showing a simple random forest.

There are different versions of the algorithm, but most commonly the steps to constructing a random forest with a discrete target variable are as follows given a training set $X=x_{1},...,x_{n}$ with responses $Y=y_{1},...,y_{n}$ :

For $b=1,...,B$ :

Sample, with replacement, $n$ training examples from $X,Y$ ; call these $X_{b},Y_{b}$ . (referred to as bootstrap aggregating)
Train a classification tree $f_{b}$ on $X_{b},Y_{b}$ .

Then, evaluate the dataset on all classification trees $f_{1},...,f_{B}$ and perform a majority vote to get the final classification, as shown in the figure on the right.

This procedure leads to significantly better model performance compared to single decision trees because the variance of the model is decreased, without increasing the bias. Put more simply, this means that while single decision trees are highly sensitive to noise in the training set, the average of many uncorrelated trees is not. If no random sampling was used and we instead trained many trees on a single training set, it would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

Random forests may yield far greater accuracy than single decision trees, but they do lack interpretability. Single decision trees are highly interpretable, as following the root-to-leaf path which a decision tree takes on a given dataset is simple. This can allow developers to double check that the information the decision tree has learned is realistic. However, when constructing a random forest of hundreds, or sometimes thousands of decision trees, developers lose the ability to check the model themselves.

Open source software[edit]

Open source software provides a platform to develop computational biological methods. Specifically, open source means that every person and/or entity can access and benefit from software developed in research. PLOS cites^{[citation needed]} four main reasons for the use of open source software including:

Reproducibility: This allows for researchers to use the exact methods used to calculate the relations between biological data.
Faster Development: developers and researchers do not have to reinvent existing code for minor tasks. Instead they can use pre-existing programs to save time on the development and implementation of larger projects.
Increased quality: Having input from multiple researchers studying the same topic provides a layer of assurance that errors will not be in the code.
Long-term availability: Open source programs are not tied to any businesses or patents. This allows for them to be posted to multiple web pages and ensure that they are available in the future.^[36]

Conferences[edit]

There are several large conferences that are concerned with computational biology. Some notable examples are Intelligent Systems for Molecular Biology (ISMB), European Conference on Computational Biology (ECCB) and Research in Computational Molecular Biology (RECOMB).

Journals[edit]

There are numerous journals dedicated to computational biology. Some notable examples include Journal of Computational Biology and PLOS Computational Biology. The PLOS computational biology journal is a peer-reviewed open access journal that has many notable research projects in the field of computational biology. They provide reviews on software, tutorials for open source software, and display information on upcoming computational biology conferences.^{[citation needed]}

Related fields[edit]

Computational biology, bioinformatics and mathematical biology are all interdisciplinary approaches to the life sciences that draw from quantitative disciplines such as mathematics and information science. The NIH describes computational/mathematical biology as the use of computational/mathematical approaches to address theoretical and experimental questions in biology and, by contrast, bioinformatics as the application of information science to understand complex life-sciences data.^[1]

Specifically, the NIH defines

Computational biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.^[1]

Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.^[1]

While each field is distinct, there may be significant overlap at their interface,^[1] so much so that to many, bioinformatics and computational biology are terms that are used interchangeably.

The terms computational biology and evolutionary computation have a similar name, but are not to be confused. Unlike computational biology, evolutionary computation is not concerned with modeling and analyzing biological data. It instead creates algorithms based on the ideas of evolution across species. Sometimes referred to as genetic algorithms, the research of this field can be applied to computational biology. While evolutionary computation is not inherently a part of computational biology, computational evolutionary biology is a subfield of it.^[37]

References[edit]

^ ^a ^b ^c ^d ^e "NIH working definition of bioinformatics and computational biology" (PDF). Biomedical Information Science and Technology Initiative. 17 July 2000. Archived from the original (PDF) on 5 September 2012. Retrieved 18 August 2012.
^ "About the CCMB". Center for Computational Molecular Biology. Retrieved 18 August 2012.
^ ^a ^b ^c Hogeweg, Paulien (7 March 2011). "The Roots of Bioinformatics in Theoretical Biology". PLOS Computational Biology. 3. 7 (3): e1002021. Bibcode:2011PLSCB...7E2021H. doi:10.1371/journal.pcbi.1002021. PMC 3068925. PMID 21483479.
^ "The Human Genome Project". The Human Genome Project. 22 December 2020. Retrieved 13 April 2022.
^ "Human Genome Project FAQ". Genome.gov. Retrieved 2022-04-20.
^ "T2T-CHM13v1.1 - Genome - Assembly - NCBI". www.ncbi.nlm.nih.gov. Retrieved 2022-04-20.
^ "Genome List - Genome - NCBI". www.ncbi.nlm.nih.gov. Retrieved 2022-04-20.
^ Bourne, Philip (2012). "Rise and Demise of Bioinformatics? Promise and Progress". PLOS Computational Biology. 8 (4): e1002487. Bibcode:2012PLSCB...8E2487O. doi:10.1371/journal.pcbi.1002487. PMC 3343106. PMID 22570600.
^ "COSI Information". www.iscb.org. Retrieved 2022-04-21.
^ Grenander, Ulf; Miller, Michael I. (1998-12-01). "Computational Anatomy: An Emerging Discipline". Q. Appl. Math. 56 (4): 617–694. doi:10.1090/qam/1668732.
^ "The Sub-fields of Computational Biology". Ninh Laboratory of Computational Biology. 2013-02-18. Retrieved 2022-04-18.
^ ^a ^b Kitano, Hiroaki (14 November 2002). "Computational systems biology". Nature. 420 (6912): 206–10. Bibcode:2002Natur.420..206K. doi:10.1038/nature01254. PMID 12432404. S2CID 4401115. ProQuest 204483859.
^ Favrin, Bean (2 September 2014). "esyN: Network Building, Sharing and Publishing". PLOS ONE. 9 (9): e106035. Bibcode:2014PLoSO...9j6035B. doi:10.1371/journal.pone.0106035. PMC 4152123. PMID 25181461.
^ Antonio Carvajal-Rodríguez (2012). "Simulation of Genes and Genomes Forward in Time". Current Genomics. 11 (1): 58–61. doi:10.2174/138920210790218007. PMC 2851118. PMID 20808525.
^ "Gene Ontology Resource". Gene Ontology Resource. Retrieved 2022-04-18.
^ "Genome Sequencing to the Rest of Us". Scientific American.
^ ^a ^b ^c Koonin, Eugene (6 March 2001). "Computational Genomics". Curr. Biol. 11 (5): 155–158. doi:10.1016/S0960-9822(01)00081-1. PMID 11267880. S2CID 17202180.
^ Beagrie, Robert A.; Scialdone, Antonio; Schueler, Markus; Kraemer, Dorothee C. A.; Chotalia, Mita; Xie, Sheila Q.; Barbieri, Mariano; de Santiago, Inês; Lavitas, Liron-Mark; Branco, Miguel R.; Fraser, James (March 2017). "Complex multi-enhancer contacts captured by genome architecture mapping". Nature. 543 (7646): 519–524. Bibcode:2017Natur.543..519B. doi:10.1038/nature21411. ISSN 1476-4687. PMC 5366070. PMID 28273065.
^ Beagrie, Robert A.; Scialdone, Antonio; Schueler, Markus; Kraemer, Dorothee C.A.; Chotalia, Mita; Xie, Sheila Q.; Barbieri, Mariano; de Santiago, Inês; Lavitas, Liron-Mark; Branco, Miguel R.; Fraser, James (2017-03-23). "Complex multi-enhancer contacts captured by Genome Architecture Mapping (GAM)". Nature. 543 (7646): 519–524. doi:10.1038/nature21411. ISSN 0028-0836. PMC 5366070. PMID 28273065.
^ "Mathematical Biology | Faculty of Science". www.ualberta.ca. Retrieved 2022-04-18.
^ "The Sub-fields of Computational Biology". Ninh Laboratory of Computational Biology. 2013-02-18. Retrieved 2022-04-18.
^ Dauvermann, Maria R.; Whalley, Heather C.; Schmidt, Andrã©; Lee, Graham L.; Romaniuk, Liana; Roberts, Neil; Johnstone, Eve C.; Lawrie, Stephen M.; Moorhead, Thomas W. J. (2014). "Computational Neuropsychiatry – Schizophrenia as a Cognitive Brain Network Disorder". Frontiers in Psychiatry. 5: 30. doi:10.3389/fpsyt.2014.00030. PMC 3971172. PMID 24723894.
^ Tretter, F.; Albus, M. (December 2007). "'Computational Neuropsychiatry' of Working Memory Disorders in Schizophrenia: The Network Connectivity in Prefrontal Cortex - Data and Models". Pharmacopsychiatry. 40 (S 1): S2–S16. doi:10.1055/S-2007-993139. S2CID 18574327.
^ Marin-Sanguino, A.; Mendoza, E. (2008). "Hybrid Modeling in Computational Neuropsychiatry". Pharmacopsychiatry. 41: S85–S88. doi:10.1055/s-2008-1081464. PMID 18756425.
^ "Computational Neuroscience | Neuroscience". www.bu.edu.
^ ^a ^b Sejnowski, Terrence; Christof Koch; Patricia S. Churchland (9 September 1988). "Computational Neuroscience". Science. 4871. 241 (4871): 1299–306. Bibcode:1988Sci...241.1299S. doi:10.1126/science.3045969. PMID 3045969.
^ Barbolosi, Dominique; Ciccolini, Joseph; Lacarelle, Bruno; Barlesi, Fabrice; Andre, Nicolas (2016). "Computational oncology--mathematical modelling of drug regimens for precision medicine". Nature Reviews Clinical Oncology. 13 (4): 242–254. doi:10.1038/nrclinonc.2015.204. PMID 26598946. S2CID 22492353.
^ Yakhini, Zohar (2011). "Cancer Computational Biology". BMC Bioinformatics. 12: 120. doi:10.1186/1471-2105-12-120. PMC 3111371. PMID 21521513.
^ Price, Michael (2012-04-13). "Computational Biologists: The Next Pharma Scientists?".
^ ^a ^b Jessen, Walter (2012-04-15). "Pharma's shifting strategy means more jobs for computational biologists".
^ "Sequence Alignment - an overview | ScienceDirect Topics". www.sciencedirect.com. Retrieved 2022-04-18.
^ "The Sub-fields of Computational Biology". Ninh Laboratory of Computational Biology. 2013-02-18. Retrieved 2022-04-18.
^ ^a ^b ^c ^d ^e Koschützki, Dirk; Schreiber, Falk (2008-05-15). "Centrality Analysis Methods for Biological Networks and Their Application to Gene Regulatory Networks". Gene Regulation and Systems Biology. 2: 193–201. doi:10.4137/grsb.s702. ISSN 1177-6250. PMC 2733090. PMID 19787083.
^ ^a ^b "Validate User". academic.oup.com. Retrieved 2022-04-21.
^ "Download Limit Exceeded". citeseerx.ist.psu.edu. Retrieved 2022-04-21.
^ Prlić, Andreas; Lapp, Hilmar (2012). "The PLOS Computational Biology Software Section". PLOS Computational Biology. 8 (11): e1002799. Bibcode:2012PLSCB...8E2799P. doi:10.1371/journal.pcbi.1002799. PMC 3510099.
^ Foster, James (June 2001). "Evolutionary Computation". Nature Reviews Genetics. 2 (6): 428–436. doi:10.1038/35076523. PMID 11389459. S2CID 205017006.

External links[edit]

bioinformatics.org

[nih-1] "NIH working definition of bioinformatics and computational biology" (PDF). Biomedical Information Science and Technology Initiative. 17 July 2000. Archived from the original (PDF) on 5 September 2012. Retrieved 18 August 2012.

[brown-2] "About the CCMB". Center for Computational Molecular Biology. Retrieved 18 August 2012.

[Hogeweg_2011-3] Hogeweg, Paulien (7 March 2011). "The Roots of Bioinformatics in Theoretical Biology". PLOS Computational Biology. 3. 7 (3): e1002021. Bibcode:2011PLSCB...7E2021H. doi:10.1371/journal.pcbi.1002021. PMC 3068925. PMID 21483479.

[:0-4] "The Human Genome Project". The Human Genome Project. 22 December 2020. Retrieved 13 April 2022.

[5] "Human Genome Project FAQ". Genome.gov. Retrieved 2022-04-20.

[6] "T2T-CHM13v1.1 - Genome - Assembly - NCBI". www.ncbi.nlm.nih.gov. Retrieved 2022-04-20.

[7] "Genome List - Genome - NCBI". www.ncbi.nlm.nih.gov. Retrieved 2022-04-20.

[:1-8] Bourne, Philip (2012). "Rise and Demise of Bioinformatics? Promise and Progress". PLOS Computational Biology. 8 (4): e1002487. Bibcode:2012PLSCB...8E2487O. doi:10.1371/journal.pcbi.1002487. PMC 3343106. PMID 22570600.

[9] "COSI Information". www.iscb.org. Retrieved 2022-04-21.

[:20-10] Grenander, Ulf; Miller, Michael I. (1998-12-01). "Computational Anatomy: An Emerging Discipline". Q. Appl. Math. 56 (4): 617–694. doi:10.1090/qam/1668732.

[11] "The Sub-fields of Computational Biology". Ninh Laboratory of Computational Biology. 2013-02-18. Retrieved 2022-04-18.

[Kitano_2002_206–10-12] Kitano, Hiroaki (14 November 2002). "Computational systems biology". Nature. 420 (6912): 206–10. Bibcode:2002Natur.420..206K. doi:10.1038/nature01254. PMID 12432404. S2CID 4401115. ProQuest 204483859.

[Bean_2014-13] Favrin, Bean (2 September 2014). "esyN: Network Building, Sharing and Publishing". PLOS ONE. 9 (9): e106035. Bibcode:2014PLoSO...9j6035B. doi:10.1371/journal.pone.0106035. PMC 4152123. PMID 25181461.

[:2-14] Antonio Carvajal-Rodríguez (2012). "Simulation of Genes and Genomes Forward in Time". Current Genomics. 11 (1): 58–61. doi:10.2174/138920210790218007. PMC 2851118. PMID 20808525.

[15] "Gene Ontology Resource". Gene Ontology Resource. Retrieved 2022-04-18.

[16] "Genome Sequencing to the Rest of Us". Scientific American.

[Koonin_2001_155–158-17] Koonin, Eugene (6 March 2001). "Computational Genomics". Curr. Biol. 11 (5): 155–158. doi:10.1016/S0960-9822(01)00081-1. PMID 11267880. S2CID 17202180.

[18] Beagrie, Robert A.; Scialdone, Antonio; Schueler, Markus; Kraemer, Dorothee C. A.; Chotalia, Mita; Xie, Sheila Q.; Barbieri, Mariano; de Santiago, Inês; Lavitas, Liron-Mark; Branco, Miguel R.; Fraser, James (March 2017). "Complex multi-enhancer contacts captured by genome architecture mapping". Nature. 543 (7646): 519–524. Bibcode:2017Natur.543..519B. doi:10.1038/nature21411. ISSN 1476-4687. PMC 5366070. PMID 28273065.

[19] Beagrie, Robert A.; Scialdone, Antonio; Schueler, Markus; Kraemer, Dorothee C.A.; Chotalia, Mita; Xie, Sheila Q.; Barbieri, Mariano; de Santiago, Inês; Lavitas, Liron-Mark; Branco, Miguel R.; Fraser, James (2017-03-23). "Complex multi-enhancer contacts captured by Genome Architecture Mapping (GAM)". Nature. 543 (7646): 519–524. doi:10.1038/nature21411. ISSN 0028-0836. PMC 5366070. PMID 28273065.

[20] "Mathematical Biology | Faculty of Science". www.ualberta.ca. Retrieved 2022-04-18.

[21] "The Sub-fields of Computational Biology". Ninh Laboratory of Computational Biology. 2013-02-18. Retrieved 2022-04-18.

[22] Dauvermann, Maria R.; Whalley, Heather C.; Schmidt, Andrã©; Lee, Graham L.; Romaniuk, Liana; Roberts, Neil; Johnstone, Eve C.; Lawrie, Stephen M.; Moorhead, Thomas W. J. (2014). "Computational Neuropsychiatry – Schizophrenia as a Cognitive Brain Network Disorder". Frontiers in Psychiatry. 5: 30. doi:10.3389/fpsyt.2014.00030. PMC 3971172. PMID 24723894.

[23] Tretter, F.; Albus, M. (December 2007). "'Computational Neuropsychiatry' of Working Memory Disorders in Schizophrenia: The Network Connectivity in Prefrontal Cortex - Data and Models". Pharmacopsychiatry. 40 (S 1): S2–S16. doi:10.1055/S-2007-993139. S2CID 18574327.

[24] Marin-Sanguino, A.; Mendoza, E. (2008). "Hybrid Modeling in Computational Neuropsychiatry". Pharmacopsychiatry. 41: S85–S88. doi:10.1055/s-2008-1081464. PMID 18756425.

[25] "Computational Neuroscience | Neuroscience". www.bu.edu.

[Sejnowski_1988-26] Sejnowski, Terrence; Christof Koch; Patricia S. Churchland (9 September 1988). "Computational Neuroscience". Science. 4871. 241 (4871): 1299–306. Bibcode:1988Sci...241.1299S. doi:10.1126/science.3045969. PMID 3045969.

[27] Barbolosi, Dominique; Ciccolini, Joseph; Lacarelle, Bruno; Barlesi, Fabrice; Andre, Nicolas (2016). "Computational oncology--mathematical modelling of drug regimens for precision medicine". Nature Reviews Clinical Oncology. 13 (4): 242–254. doi:10.1038/nrclinonc.2015.204. PMID 26598946. S2CID 22492353.

[28] Yakhini, Zohar (2011). "Cancer Computational Biology". BMC Bioinformatics. 12: 120. doi:10.1186/1471-2105-12-120. PMC 3111371. PMID 21521513.

[29] Price, Michael (2012-04-13). "Computational Biologists: The Next Pharma Scientists?".

[Walter-30] Jessen, Walter (2012-04-15). "Pharma's shifting strategy means more jobs for computational biologists".

[31] "Sequence Alignment - an overview | ScienceDirect Topics". www.sciencedirect.com. Retrieved 2022-04-18.

[32] "The Sub-fields of Computational Biology". Ninh Laboratory of Computational Biology. 2013-02-18. Retrieved 2022-04-18.

[:3-33] Koschützki, Dirk; Schreiber, Falk (2008-05-15). "Centrality Analysis Methods for Biological Networks and Their Application to Gene Regulatory Networks". Gene Regulation and Systems Biology. 2: 193–201. doi:10.4137/grsb.s702. ISSN 1177-6250. PMC 2733090. PMID 19787083.

[:4-34] "Validate User". academic.oup.com. Retrieved 2022-04-21.

[35] "Download Limit Exceeded". citeseerx.ist.psu.edu. Retrieved 2022-04-21.

[36] Prlić, Andreas; Lapp, Hilmar (2012). "The PLOS Computational Biology Software Section". PLOS Computational Biology. 8 (11): e1002799. Bibcode:2012PLSCB...8E2799P. doi:10.1371/journal.pcbi.1002799. PMC 3510099.

[37] Foster, James (June 2001). "Evolutionary Computation". Nature Reviews Genetics. 2 (6): 428–436. doi:10.1038/35076523. PMID 11389459. S2CID 205017006.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

v t e Computational science
Biology	Anatomy Biological systems Genomics Neuroscience Phylogenetics
Chemistry	Electronic structure Molecular mechanics
Physics	Astrophysics Electromagnetics Fluid dynamics Mechanics Particle physics
Linguistics	Semantics Lexicology
Other	Finance Materials science Mathematics Social science

v t e Bioinformatics
Databases	Sequence databases: GenBank, European Nucleotide Archive and DNA Data Bank of Japan Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource Other databases: Protein Data Bank, Ensembl and InterPro Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase, VectorBase, WormBase, Rat Genome Database, PHI-base, Arabidopsis Information Resource and Zebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE SAMtools SOAP suite TopHat
Other	Server: ExPASy Ontology: Gene Ontology Rosalind (education platform)
Institutions	Broad Institute China National GeneBank (CNGB) Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format VCF format
Related topics	Computational biology List of biobanks List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons

Computational biology

Contents

Introduction[edit]