Summary: This course is a short series of lectures on Statistical Bioinformatics. Topics covered are listed in the Table of Contents. The notes were prepared by Ewa Paszek, Lukasz Wita and Marek Kimmel. The development of this course has been supported by NSF 0203396 grant.
A central goal of molecular biology is to understand the regulation of protein synthesis and its reactions to external and internal signals. All the cells in an organism carry the same genomic data, yet their protein makeup can be drastically different both temporally and spatially, due to regulation. Protein synthesis is regulated by many mechanisms at its different stages. These include mechanisms for controlling transcription initiation, RNA splicing, mRNA transport, translation initiation, post-translational modifications, and degradation of mRNA/protein. One of the main junctions at which regulation occurs is mRNA transcription. A major role in this machinery is played by proteins themselves that bind to regulatory regions along the DNA, greatly affecting the transcription of the genes they regulate. In recent years, technical breakthroughs in spotting hybridization probes and advances in genome sequencing efforts lead to development of DNA microarrays, which consist of many species of probes, either oligonucleotides or cDNA, that are immobilized in a predefined organization to a solid phase. By using DNA microarrays, researchers are now able to measure the abundance of thousands of mRNA targets simultaneously ( DeRisi et al.,1997; Lockhart et al., 1996; Wen et al., 1998). Unlike classical experiments, where the expression levels of only a few genes were reported, DNA microarray experiments can measure all the genes of an organism, providing a “genomic” viewpoint on gene expression. As a consequence, this technology facilitates new experimental approaches for understanding gene expression and regulation (Iyer et al., 1999; Spellman et al., 1998).
A central focus of genomic research concerns understanding the manner in which cells execute and control the enormous number of operations required for their function. Biological systems behave in an exceedingly parallel and extraordinarily integrated fashion. Feedback and damping are routine even for the most common activities. Thus, in this area of genomic biology, single gene perspectives are becoming increasingly limited for gaining insight into biological processes. Network applications are becoming increasingly important for making progress in our understanding of the manner in which genes and molecules collectively form a biological system and harnessing this understanding in educated intervention for correcting human diseases. Such approaches inevitably require computational and formal methods to process massive amounts of data, understand general principles governing the system under study, and make useful predictions about system behavior in the presence of known conditions. There is a rather wide spectrum of approaches for modeling gene regulatory networks, each with its own assumptions, data requirements, and goals. The group of the most popular models includes: Boolean, Probabilistic Boolean and Bayesian networks.
The Boolean network model, introduced by Kauffman (Kauffman, 1969, 1974; Kauffman and Glass, 1973)and recently developed by Shmulevich(Shmulevich, 2002), has received the most attention, not only from the biology community, but also in physics. In this model, gene expression is quantized to only two levels: ON and OFF. The expression level (state) of each gene is functionally related to the expression states of some other genes, using logical rules. A Boolean network G(V,F) is defined by a set of nodes corresponding to genes V = {x1, . . . , xn} and a list of Boolean functions F = (f1, . . . , fn). The state of a node (gene) is completely determined by the values of other nodes at time t by means of underlying logical Boolean functions. The model is represented in the form of directed graph. Each xi represents the state (expression) of gene i, where xi=1 represents the fact that gene i is expressed and xi=0 means it is not expressed. The list of Boolean functions F represents the rules of regulatory interactions between genes. That is, any given gene transforms its inputs (regulatory factors that bind to it) into an output, which is the state or expression of the gene itself. The maximum connectivity of a Boolean network is defined by K= maxi (ki). All genes are assumed to update synchronously in accordance with the functions assigned to them and this process is then repeated. The artificial synchrony simplifies computation while preserving the qualitative, generic properties of global network dynamics (Kauffman, 1993; Huang, 1999; Wuensche, 1998).
Below the example is presented. Consider a Boolean network consisting of 5 genes {x1, . . . , x5} with the corresponding Boolean functions given by the truth tables shown in Figure1. The maximum connectivity is K=3, although we allow some input variables to duplicate, essentially reducing the connectivity. The dynamics of this Boolean network are shown in Figure2. Since there are 5 genes, there are 2^5 = 32 possible states that the network can be in. Each state is represented by a circle and the arrows between states show the transitions of the network according to the functions in Table 1., Figure1.. It is easy to see that because of the inherent deterministic directionality in Boolean networks as well as only a finite number of possible states.
![]() |
![]() |
In the context of Boolean networks as models of genetic regulatory networks, there is no doubt that the binary approximation of gene expression is an oversimplification (Huang, 1999). However, even though most biological phenomena manifest themselves in the continuous domain, they are often described in a binary logical language such as ‘on and off,’ ‘upregulated and downregulated’, and ‘responsive and nonresponsive.’ There is a several examples showing that a Boolean formalism is meaningful in biology, in (Shmulevich and Zhang, 2002), one reasoned that if the genes, when quantized to only two levels (1 or 0), would not be informative in separating known sub-classes of tumors, then there would be little hope for Boolean modeling of realistic genetic networks based on gene expression data.
Fortunately, the results were very promising. By using binary gene expression data, generated via cDNA microarrays, and the Hamming distance as a similarity metric, a clear separation between different sub-types of gliomas as well as between different sarcomas was showed. This seems to suggest that a good deal of meaningful biological information, to the extent that it is contained in the measured continuous-domain gene expression data, is retained when it is binarized.
Below an example id presented, borrowed from (Shmulevich et al., 2002), showing the logical representation of cell cycle regulation. This process of cellular growth and division is highly regulated. A disbalance in this process results in unregulated cell growth in diseases such as cancer. In order for cells to move from the G1 phase to the S phase, when the genetic material, DNA, is replicated for the daughter cells, a series of molecules such as cyclin E and cyclin dependent kinase 2 (cdk2) work together to phosphorylate the retinoblastoma (Rb) protein and inactivate it, thus releasing cells into the S phase. Cdk2/cyclin E is regulated by two switches: the positive switch complex called cdk activating kinase (CAK) and the negative switch p21/WAF1. The CAK complex can be composed of two gene products: cyclin H and cdk7. When cyclin H and cdk7 are present, the complex can activate cdk2/cyclin E. A negative regulator of cdk2/cyclin E is p21/WAF1, which in turn can be activated by p53. When p21/WAF1 binds to cdk2/cyclin E, the kinase complex is turned off (Gartel and Tyner, 1999). Further, p53 can inhibit cyclin H, a positive regulator of cyclin E/cdk2 (Schneider et al., 1998). This negative regulation is an important defensive system in the cells. For example, when cells are exposed to mutagen, DNA damage occurs. It is to the benefit of cells to repair the damage before DNA replication so that the damaged genetic materials do not pass onto the next generation. Extensive amount of work has demonstrated that DNA damage triggers switches that turn on p53, which then turns on p21/WAF1. p21/WAF1 then inhibits cdk2/cyclin E, thus Rb becomes activated and DNA synthesis stops. As an extra measure, p53 also inhibits cyclin H, thus turning off the switch that turns on cdk2/cyclin E. Such delicate genetic switch networks in the cells are the basis for cellular homeostasis — the ability of an organism to maintain equilibrium.
For purposes of illustration, let consider a simplified diagram, shown in Figure3, illustrating the effects of cdk7/cyclin H, cdk2/cyclin E, and p21/WAF1 on Rb. Thus, p53 and other known regulatory factors are not considered. While this diagram represents the above relationships from a pathway perspective, one may also represent the activity of Rb in terms of the other variables in a logic-based fashion. Figure4 contains a logic circuit diagram of the activity of Rb (‘on’ or ‘off’) as a Boolean function of four input variables: cdk7, cyclin H, cyclin E, and p21/WAF1. Note that cdk2 is shown to be completely determined by the values of cdk7 and cyclin H using the AND operation and thus, cdk2 is not an independent input variable. Also, in Figure3, p21/WAF1 is shown to have an inhibitive effect on the cdk2/cyclin E complex, which in turn regulates Rb, while in Figure4, we see that from a logic-based perspective, the value of p21/WAF1 works together with cdk2 and cyclin E to determine the value of Rb.
![]() |
![]() |