APPENDIX to paper:

Extraction of functional binding sites from unique regulatory regions:
The Drosophila early developmental enhancers

Dmitri A. Papatsenko, Vsevolod J. Makeev, Alex P. Lifanov, Mirelle Regnier, Anna G. Nazina and Claude Desplan.


1 Developmental enhancers: maps of binding site distribution.

1.1 Site motifs.

   Motifs for the 11 binding sites are shown. Footprint results were mapped on the DNA sequence for each binding site. Alignments shown on Figures a1-a11 were constructed on the basis of available published data [1-36] and with the help of the ClustalW algorithm (Megalign, DNA Star inc.). The Tailless motif was not used in the study due to the poor alignment; The Paired Homeodomain and Paired Paired domain motifs were not used due to the low number of available sites. The length of each motif was determined on the basis of its informational content.

1.2 Cutoff values.

   Figure a12 shows the behavior of the penalty function (P(cutoff)=Ln((H)*(H)/(H+FN))/Ln(FN+FP) for six binding sites: Bicoid, Krüppel, Hunchback, Knirps, Caudal and Ftz. The X Axis is the cutoff value for the matrix search. Peak values of the function for the six proteins and the corresponding cutoff values are shown in Table a1. These optimal cutoff values were used to build the preliminary consistent maps for the six transcription factors (shown in Tables a2 - a6, a9). Each preliminary map includes the data for just one transcription factor. The final consistent map for each enhancer region (see "refined and consistent maps") was constructed by merging the independently generated data for several transcription factors.
    The relevance of the penalty function used is demonstrated on the Krüppel example: Tables a8 - a10 show the comparison between three close peak values for Krüppel (numbered on Figure app12). Results of the initial search for Krüppel (20 best matches) are shown in Table a7. False-positives are shown in green, false-negatives are shown in blue, hits are shown in red.

1.3 Refined and Consistent maps.

Tables a11-a29 show the comparison between the "refined" and the "consistent " maps for the best known enhancers. Both types of maps represent only distribution of the most robust binding sites. Sites that were not available in sufficient numbers (Paired), produced an ambiguous alignments (Tailless, Even-skipped) or were detected in just one enhancer from our collection were not included in our maps. "Virtual" maps (possible distribution of known motifs) were built for the eve stripe 4+6 and the runt stripe 5 enhancers, for which the footprint data was not available. Sequence of each enhancer was verified with the Drosophila Genome Project; the borders of each region were set in according to published data.

2 Formulation of Scanseq algorithm.

2.1 Motif description.

Consensus: The most general mathematical definition of a "motif" is a set of allowed words. In the simplest case, a motif can be represented by "the consensus", for example in the form of standard IUPAC abbreviations. For instance, the motif containing words AGCTAT and AGCTGT can be described by the consensus AGCTRT. All words that can be derived from the consensus by a set of allowed operation (substitutions, insertions, deletions etc) define a consensus neighborhood. If only substitutions are allowed (the so-called Hamming distance) the consensus neighborhood can be found using counting with a mismatch scheme.

Positional weighted matrix: A more advanced method of motif representation is a positional weighted matrix (PWM). In this case, the chance of a substitution in a given position is weighted. Therefore, in the case of PWM, a threshold is required each time to define a motif. The popular way of defining PWM is a probabilistic model, in which one uses probabilities of observing a specific letter at a specific position of the motif. The likelihood of observing the word in this probabilistic model is compared to the chosen threshold to conclude if the word does or does not belong to the motif.. The most practical way of evaluating potential matches is to calculate the log-odds ratio of the resulting probability to the corresponding probability yielded by a random model [37], in this case called the "standard".

Hidden Markov Model: In the HMM representation [38, 39], the positions are no longer independent. Position probabilities are linked with the Markov model and the transitional matrices are estimated from the word set. In the case where there are no insertions or deletions allowed [40, 41], searching for matches with HMM can be performed with the help of log-odds ratios, essentially in the same way as with PWM. Currently we do not use an HMM representation in our approach.

2.2 Count of motif occurrence.

Counting in double stranded DNA To count motif occurrences in a sequence one needs to count all words in the sequence that belong to the motif. This can be done with several efficient algorithms (see [42] for the review). Not always however it is relevant to count the complementary DNA strands separately. This motivates a counting scheme with the motifs counted in both complementary strands. For example consider the following double stranded DNA sequence:

3'-ACAACACGT
5'-TGTTGTGCA

The number of occurrences of AC is four, three in the forward strand, and one in the reverse (strands are read in opposite directions). This number is equal to the number of occurrences in any of the strands of {AC, GT} motif, where GT is the word complementary to AC. Observe now, that the self-complementary word ACGT is found twice in the double-stranded sequence. In this case, the problem of counting of such self-overlapping motifs in double-stranded DNA can be reformulated as a problem of weighted counting in a single strand. To achieve this, the motif set is modified by addition of all complementary words. Any word H that has its complement in the initial set (particularly any palindromic word) is assigned weight w(H) = 2. Other words are assigned weight w(H) = 1.

Accounting for background distribution. It is insufficient to count simply the number of occurrences of some motif to infer whether it is "too abundant" in the sequence. If the input sequence exhibits strong composition biases the most frequent words found would be the words that contain the most frequent nucleotides. Therefore, the nucleotide frequencies (in the simplest case) must also be taken into account. A popular way to account for sequence composition [43] is to assess how often a motif found would appear in the random sequence with the same nucleotide content. In this case, Z-score (difference between the observed O and the expected E number of the motifs, normalized by the standard deviation V) is used to count the number of motif occurrences in the random sequence:
 
(1)

To calculate the Z-score, one needs to calculate the expected number of motif occurrences in the random sequence with a given nucleotide distribution, and estimate the variance for this quantity.

Counting with overlaps Calculation of the expectation and the variance of the number of motif occurrences in a random sequence is not a trivial problem, because one should take into account possible overlaps between words belonging to the same motif. Consider for instance the simplest case of a sequence with n=3 over a two-letter alphabet {AB} with equal probabilities of the letters. Inspection of the table below demonstrate that words AA and AB have the same expectation (0.5), but their variances are different:

Table a30. Overlap effect on expectation and variance.
 
 
AA
AB
AAA
2
0
AAB
1
1
ABA
0
1
ABB
0
1
BAA
1
0
BAB
0
1
BBA
0
0
BBB
0
0
Expectation
4/8=0.5
4/8=0.5
Variance
6/8-0.25=0.5
4/8-0.25=0.25

Obviously this happens because the motif AA can overlap itself. This is also true in a general case [44]: whereas the expectation of the motif that contains many overlapping words does not change, the variance is changed, sometimes substantially.

Poisson approximation. If the probability of observation for two overlapping motifs in the sequence is much smaller than that for observation of one motif, the Poisson approximation  can be implemented. The applicability of the Poisson approximation depends on the nature of a motif.

Calculation of variance and expectation. It was found that both the variance and the expectation for the number of occurrences of a motif (a set of words) can be exactly calculated analytically in a random Bernoulli type sequence using the generation function technique [45, 46]. Bernoulli type sequence must contain no correlation between probabilities of letter occurrences at neighboring positions. Let H be the motif of m-words . Let S be an n-letter sequence over the four-letter DNA alphabet with the vector of letter probabilities p. Then for each given word H, the probability will be, where i is the ith character of the word. For the entire motif one can calculate: .

The overlapping and correlation sets. It was observed that the variance depends on the structure of the words. Given two strings H and F, the overlapping is the H-suffixes that are F prefixes. Associated F suffixes form the correlation set AH,F. Each word in AH,F, when concatenated to the right of H creates an occurrence of F. For example, let H = AACAA, and F= AAAT, then AH,F = {AT, AAT} and AF,H is the empty set. The autocorrelation set AH,H={,CAA, ACAA} also contains the empty string  and denotes the associated probability.

Variance and expectation for the single strand counting. It was recently demonstrated [45] that the expectation of the number of occurrences of a set of patterns H in a random sequence of length n generated by a Bernoulli model, NH, is:
 
(2)

And the variance is:
 
(3)

Where

and  , but also can be analytically calculated.

Variance and expectation for the double strand counting In the case of counting at double strands, the word weights appear in formulae (2) and (3):

Let is the weighted number of occurrences of the weighted pattern H. Then its expectation writes:
 
(4)

and the linearity constant c1 for the expectation writes:
 
(5)

An example Let H contain all words H that differ from some fixed word h for no more than k mismatches. If h is palindrome, then any word in its k-neighborhood has its complementary word in this k-neighborhood; hence all the words in the k-neighborhood are assigned weight 2, and  while . In the case of a true random sequence, Z-score does not change as compared to the case of single letter counting. However, the Poisson approximation V~E obviously does not satisfy.

3 Parameters and predictions.

3.1 Training procedure.

Tables a31-a33 show the sorted results of individual training. For each combination of motif length and divergence in the range: mmin=7; mmax=16; kmax=7 (a total of 143 combinations were considered), the best statistical values CC and OQ were found and the best coverage c was calculated. . Then, the average of the values for the 10 enhancers for CC, OQ and PQ were calculated for each considered parameter combination. Finally, all parameter combinations (mmin; mmax; kmax) were sorted in descending order by ether average CC (Table a31), average OQ (Table a32) or average PQ (Table a33) values. Only the top 40 combinations are shown for the each statistics. One of the top combinations (mmin=7; mmax=9; kmax=1) that produced the best average scores in all three types of training was used as the best "default" parameters.

3.2 Predicted maps for 12 enhancers.

Z-score profile plots and maps of predictions are shown on Figures a13-a24 for 12 DNA regions. The plots show the maximal observed Z-scores (Y scale) for each position in sequence (X scale) within the selected parameter range (mmin; mmax; kmax; c). Panels A show the results of training on the group of 10. The results of individual training are shown in panels B. The predicted map is shown below each Z-score profile plot. The blue bars represent the most redundant segments, the Scanseq predictions, while the red bars represent the "consistent" maps. "Virtual maps" are shown for even-skipped stripe 4+6 and runt stripe5.

4 Software tools.


5 References.


[1] M. J. Shimell, A. J. Peterson, J. Burr, J. A. Simon, and M. B. O'Connor, "Functional analysis of repressor binding sites in the iab-2 regulatory region of the abdominal-A homeotic gene," Dev Biol, vol. 218, pp. 38-52., 2000.

[2] E. A. Wimmer, M. Simpson-Brose, S. M. Cohen, C. Desplan, and H. Jackle, "Trans- and cis-acting requirements for blastodermal expression of the head gap gene buttonhead," Mech Dev, vol. 53, pp. 235-45., 1995.

[3] E. A. Wimmer, S. M. Cohen, H. Jackle, and C. Desplan, "buttonhead does not contribute to a combinatorial code proposed for Drosophila head development," Development, vol. 124, pp. 1509-17., 1997.

[4] B. Florence, A. Guichet, A. Ephrussi, and A. Laughon, "Ftz-F1 is a cofactor in Ftz activation of the Drosophila engrailed gene," Development, vol. 124, pp. 839-47., 1997.

[5] J. A. Kassis, C. Desplan, D. K. Wright, and P. H. O'Farrell, "Evolutionary conservation of homeodomain-binding sites and other sequences upstream and within the major transcription unit of the Drosophila segmentation gene engrailed," Mol Cell Biol, vol. 9, pp. 4304-11., 1989.

[6] D. Stanojevic, S. Small, and M. Levine, "Regulation of a segmentation stripe by overlapping activators and repressors in the Drosophila embryo," Science, vol. 254, pp. 1385-7., 1991.

[7] S. Small, A. Blair, and M. Levine, "Regulation of even-skipped stripe 2 in the Drosophila embryo," Embo J, vol. 11, pp. 4047-57., 1992.

[8] D. N. Arnosti, S. Barolo, M. Levine, and S. Small, "The eve stripe 2 enhancer employs multiple modes of transcriptional synergy," Development, vol. 122, pp. 205-14., 1996.

[9] S. Small, A. Blair, and M. Levine, "Regulation of two pair-rule stripes by a single enhancer in the Drosophila embryo," Dev Biol, vol. 175, pp. 314-24., 1996.

[10] M. Fujioka, Y. Emi-Sarker, G. L. Yusibova, T. Goto, and J. B. Jaynes, "Analysis of an even-skipped rescue transgene reveals both composite and discrete neuronal and early blastoderm enhancers, and multi-stripe positioning by gap gene repressor gradients," Development, vol. 126, pp. 2527-38., 1999.

[11] Y. Yu, M. Yussa, J. Song, J. Hirsch, and L. Pick, "A double interaction screen identifies positive and negative ftz gene regulators and ftz-interacting proteins," Mech Dev, vol. 83, pp. 95-105., 1999.

[12] W. Han, Y. Yu, K. Su, R. A. Kohanski, and L. Pick, "A binding site for multiple transcriptional activators in the fushi tarazu proximal enhancer is essential for gene expression in vivo," Mol Cell Biol, vol. 18, pp. 3384-94., 1998.

[13] W. Han, Y. Yu, N. Altan, and L. Pick, "Multiple proteins interact with the fushi tarazu proximal enhancer," Mol Cell Biol, vol. 13, pp. 5549-59., 1993.

[14] C. C. Tsai, S. G. Kramer, and J. P. Gergen, "Pair-rule gene runt restricts orthodenticle expression to the presumptive head of the Drosophila embryo," Dev Genet, vol. 23, pp. 35-44., 1998.

[15] J. L. Brown and C. Wu, "Repression of Drosophila pair-rule segmentation genes by ectopic expression of tramtrack," Development, vol. 117, pp. 45-58., 1993.

[16] Y. Yu, W. Li, K. Su, M. Yussa, W. Han, N. Perrimon, and L. Pick, "The nuclear hormone receptor Ftz-F1 is a cofactor for the Drosophila homeodomain protein Ftz," Nature, vol. 385, pp. 552-5., 1997.

[17] X. Li and M. Noll, "Compatibility between enhancers and promoters determines the transcriptional specificity of gooseberry and gooseberry neuro in the Drosophila embryo," Embo J, vol. 13, pp. 400-6., 1994.

[18] X. Li and M. Noll, "Evolution of distinct developmental funct> 

Transfer interrupted!

uisition of different cis-regulatory regions," Nature, vol. 367, pp. 83-7., 1994.

[19] M. Bouchard, J. St-Amand, and S. Cote, "Combinatorial activity of pair-rule proteins on the Drosophila gooseberry early enhancer," Dev Biol, vol. 222, pp. 135-46., 2000.

[20] J. A. Langeland and S. B. Carroll, "Conservation of regulatory elements controlling hairy pair-rule stripe formation," Development, vol. 117, pp. 585-96., 1993.

[21] J. A. Langeland, S. F. Attai, K. Vorwerk, and S. B. Carroll, "Positioning adjacent pair-rule stripes in the posterior Drosophila embryo," Development, vol. 120, pp. 2945-55., 1994.

[22] A. La Rosee, T. Hader, D. Wainwright, F. Sauer, and H. Jackle, "hairy stripe 7 element mediates activation and repression in response to different domains and levels of Kruppel in the Drosophila embryo," Mech Dev, vol. 89, pp. 133-40., 1999.

[23] A. La Rosee, T. Hader, H. Taubert, R. Rivera-Pomar, and H. Jackle, "Mechanism and Bicoid-dependent control of hairy stripe 7 expression in the posterior region of the Drosophila embryo," Embo J, vol. 16, pp. 4403-11., 1997.

[24] M. Hoch, E. Seifert, and H. Jackle, "Gene expression mediated by cis-acting sequences of the Kruppel gene in response to the Drosophila morphogens bicoid and hunchback," Embo J, vol. 10, pp. 2267-78., 1991.

[25] Q. Gao, Y. Wang, and R. Finkelstein, "Orthodenticle regulation during embryonic head development in Drosophila," Mech Dev, vol. 56, pp. 3-15., 1996.

[26] Q. Gao and R. Finkelstein, "Targeting gene expression to the head: the Drosophila orthodenticle gene is a direct target of the Bicoid morphogen," Development, vol. 125, pp. 4185-93., 1998.

[27] M. Klingler and J. P. Gergen, "Regulation of runt transcription by Drosophila segmentation genes," Mech Dev, vol. 43, pp. 3-19., 1993.

[28] R. Barrio, J. F. de Celis, S. Bolshakov, and F. C. Kafatos, "Identification of regulatory regions driving the expression of the Drosophila spalt complex at different developmental stages," Dev Biol, vol. 215, pp. 33-47., 1999.

[29] J. F. de Celis, R. Barrio, and F. C. Kafatos, "Regulation of the spalt/spalt-related gene complex and its function during sensory organ development in the Drosophila thorax," Development, vol. 126, pp. 2653-62., 1999.

[30] R. P. Kuhnlein, G. Bronner, H. Taubert, and R. Schuh, "Regulation of Drosophila spalt gene expression," Mech Dev, vol. 66, pp. 107-18., 1997.

[31] M. Hoch, N. Gerwin, H. Taubert, and H. Jackle, "Competition for overlapping sites in the regulatory region of the Drosophila gene Kruppel," Science, vol. 256, pp. 94-7., 1992.

[32] G. J. Liaw, K. M. Rudolph, J. D. Huang, T. Dubnicoff, A. J. Courey, and J. A. Lengyel, "The torso response element binds GAGA and NTF-1/Elf-1, and regulates tailless by relief of repression," Genes Dev, vol. 9, pp. 3163-76., 1995.

[33] S. Qian, M. Capovilla, and V. Pirrotta, "Molecular mechanisms of pattern formation by the BRE enhancer of the Ubx gene," Embo J, vol. 12, pp. 3865-77., 1993.

[34] S. Qian, M. Capovilla, and V. Pirrotta, "The bx region enhancer, a distant cis-control element of the Drosophila Ubx gene and its regulation by hunchback and other segmentation genes," Embo J, vol. 10, pp. 1415-25., 1991.

[35] V. Pirrotta, C. S. Chan, D. McCabe, and S. Qian, "Distinct parasegmental and imaginal enhancers and the establishment of the expression pattern of the Ubx gene," Genetics, vol. 141, pp. 1439-50., 1995.

[36] R. S. Mann, "Engrailed-mediated repression of Ultrabithorax is necessary for the parasegment 6 identity in Drosophila," Development, vol. 120, pp. 3205-12., 1994.

[37] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Cambridge University press, 1998.

[38] A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler, "Hidden Markov models in computational biology. Applications to protein modeling," J Mol Biol, vol. 235, pp. 1501-31, 1994.

[39] A. Krogh, I. S. Mian, and D. Haussler, "A hidden Markov model that finds genes in E. coli DNA," Nucleic Acids Res, vol. 22, pp. 4768-78, 1994.

[40] T. L. Bailey and M. Gribskov, "Methods and statistics for combining motif match scores," J Comput Biol, vol. 5, pp. 211-21, 1998.

[41] T. Yada, Y. Totoki, M. Ishikawa, K. Asai, and K. Nakai, "Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences," Bioinformatics, vol. 14, pp. 317-25, 1998.

[42] D. Gusfeld, Algorithms on Strings, Trees, and Sequences: Cambridge University Press, 1997.

[43] M. Tompa, "An exact method for finding short motifs in sequences with application to the ribosome binding site problem," presented at Intellectual Systems in Molecular Biology, ISMB 99, 1999.

[44] L. J. Guibas and A. M. Odlyzko, "String overlaps, pattern matching, and nontransitive games.," Journal of Combinatorial Theory. Series A., vol. 30, pp. 183-208, 1981.

[45] M. Régnier, "A unified approach to word occurrences probabilities," Discrete Applied Mathematics, vol. 104, pp. 259-280, 2000.

[46] M. Régnier, A. Lifanov, and V. Makeev, "Three variations on word counting," presented at II German Conference on Bioinformatics, Heidelberg, Germany, 2000.


home