APPENDIX to paper:
Extraction of functional binding sites from unique
regulatory regions:
The Drosophila early developmental enhancers
Dmitri A. Papatsenko, Vsevolod J. Makeev, Alex P. Lifanov, Mirelle Regnier,
Anna G. Nazina and Claude Desplan.
1 Developmental enhancers: maps of binding site distribution.
1.1 Site motifs.
Motifs for the 11 binding sites are shown. Footprint results
were mapped on the DNA sequence for each binding site. Alignments shown
on Figures a1-a11 were constructed on the basis of available published
data [1-36] and with the help of the
ClustalW algorithm (Megalign, DNA Star inc.). The Tailless motif was not
used in the study due to the poor alignment; The Paired Homeodomain and
Paired Paired domain motifs were not used due to the low number of available
sites. The length of each motif was determined on the basis of its informational
content.
-
Bicoid: Figure a1, alignment; motif
-
Caudal: Figure a2, alignment; motif
-
FTZ: Figure a3, alignment; motif
-
Giant: Figure a4, alignment; motif
-
Hunchback: Figure a5, alignment;
motif
-
Knirps: Figure a6, alignment; motif
-
Krüppel: Figure a7, alignment;
motif
-
Paired HD: Figure a8, alignment;
motif
-
Paired PD: Figure a9, alignment;
motif
-
Tailless: Figure a10, alignment;
motif
-
Tramtrack: Figure a11, alignment;
motif
1.2 Cutoff values.
Figure a12
shows the behavior of the penalty function (P(cutoff)=Ln((H)*(H)/(H+FN))/Ln(FN+FP)
for six binding sites: Bicoid, Krüppel, Hunchback, Knirps, Caudal
and Ftz. The X Axis is the cutoff value for the matrix search. Peak values
of the function for the six proteins and the corresponding cutoff values
are shown in Table a1. These
optimal cutoff values were used to build the preliminary consistent maps
for the six transcription factors (shown in Tables a2 - a6, a9). Each preliminary
map includes the data for just one transcription factor. The final consistent
map for each enhancer region (see "refined and consistent
maps") was constructed by merging the independently generated data
for several transcription factors.
The relevance of the penalty function used is demonstrated
on the Krüppel example: Tables a8 - a10 show the comparison between
three close peak values for Krüppel (numbered on Figure app12). Results
of the initial search for Krüppel (20 best matches) are shown in Table
a7. False-positives are shown in green, false-negatives are shown in blue,
hits are shown in red.
-
Bicoid maps for optimal cutoff=5.3, Table
a2
-
Caudal maps for optimal cutoff=4.7, Table
a3
-
FTZ maps for optimal cutoff=4.2, Table
a4
-
Hunchback maps for optimal cutoff=6.2, Table
a5
-
Knirps maps for optimal cutoff=3.9, Table
a6
-
Krüppel:
1.3 Refined and Consistent maps.
Tables a11-a29 show the comparison between the "refined" and the "consistent
" maps for the best known enhancers. Both types of maps represent only
distribution of the most robust binding sites. Sites that were not available
in sufficient numbers (Paired), produced an ambiguous alignments (Tailless,
Even-skipped) or were detected in just one enhancer from our collection
were not included in our maps. "Virtual" maps (possible distribution of
known motifs) were built for the eve stripe 4+6 and the runt
stripe 5 enhancers, for which the footprint data was not available. Sequence
of each enhancer was verified with the Drosophila Genome Project;
the borders of each region were set in according to published data.
-
abdominal-A enhancer: Table a11, maps;
sequence [1]
-
buttonhead cis element: Table a12,
maps; sequence [2,
3]
-
engrailed intron: Table a13, maps;
sequence [4, 5]
-
even-skipped stripe2 enhancer: Table
a14, maps; sequence [6-8]
-
even-skipped stripe3+7 enhancer: Table
a15, maps; sequence [9]
-
even-skipped stripe4+6 enhancer: Table
a16, virtual map only; sequence [10]
-
fushi-tarazu proximal enhancer: Table
a17, maps; sequence [11-13]
-
fushi-tarazu zebra region: Table
a18, initial map only; sequence
[14-16]
-
gooseberry enhancer: Table a19, initial
map only; sequence [17-19]
-
hairy stripe5 enhancer: Table a20,
maps; sequence [20]
-
hairy stripe6 enhancer: Table a21,
maps; sequence [21]
-
hairy stripe7 enhancer: Table a22,
maps; sequence [22,
23]
-
kruppel region 730: Table a23, maps;
sequence [24]
-
orthodenticle enhancer: Table a24,
initial map only; sequence [14,
25, 26]
-
runt stripe 5 enhancer: Table a25,
maps; sequence [27]
-
spalt early enhancer: Table a26,
maps; sequence [28-30]
-
tailless enhancer: Table a27, maps;
sequence [31, 32]
-
ultrabithorax BRE enhancer: Table
a28, initial map only; sequence
[33, 34]
-
ultrabithorax PBX enhancer: Table
a29, initial map only; sequence
[35, 36]
2 Formulation of Scanseq algorithm.
2.1 Motif description.
Consensus: The most general mathematical definition of
a "motif" is a set of allowed words. In the simplest case, a motif can
be represented by "the consensus", for example in the form of standard
IUPAC abbreviations. For instance, the motif containing words AGCTAT
and AGCTGT can be described by the consensus
AGCTRT. All words that can be derived from
the consensus by a set of allowed operation (substitutions, insertions,
deletions etc) define a consensus neighborhood. If only substitutions are
allowed (the so-called Hamming distance) the consensus neighborhood can
be found using counting with a mismatch scheme.
Positional weighted matrix: A more advanced method of
motif representation is a positional weighted matrix (PWM). In this case,
the chance of a substitution in a given position is weighted. Therefore,
in the case of PWM, a threshold is required each time to define a motif.
The popular way of defining PWM is a probabilistic model, in which one
uses probabilities of observing a specific letter at a specific position
of the motif. The likelihood of observing the word in this probabilistic
model is compared to the chosen threshold to conclude if the word does
or does not belong to the motif.. The most practical way of evaluating
potential matches is to calculate the log-odds ratio of the resulting probability
to the corresponding probability yielded by a random model [37],
in this case called the "standard".
Hidden Markov Model: In the HMM representation [38,
39],
the positions are no longer independent. Position
probabilities are linked with the Markov model and the transitional matrices
are estimated from the word set. In the case where there are no insertions
or deletions allowed [40, 41],
searching for matches with HMM can be performed with the help of log-odds
ratios, essentially in the same way as with PWM. Currently we do not use
an HMM representation in our approach.
2.2 Count of motif occurrence.
Counting in double stranded DNA To count motif occurrences
in a sequence one needs to count all words in the sequence that belong
to the motif. This can be done with several efficient algorithms (see [42]
for the review). Not always however it is relevant to count the complementary
DNA strands separately. This motivates a counting scheme with the motifs
counted in both complementary strands. For example consider the following
double stranded DNA sequence:
3'-ACAACACGT
5'-TGTTGTGCA
The number of occurrences of AC is four, three in the forward strand,
and one in the reverse (strands are read in opposite directions). This
number is equal to the number of occurrences in any of the strands of {AC,
GT} motif, where GT is the word complementary to AC. Observe now, that
the self-complementary word ACGT is found twice in the double-stranded
sequence. In this case, the problem of counting of such self-overlapping
motifs in double-stranded DNA can be reformulated as a problem of weighted
counting in a single strand. To achieve this, the motif set is modified
by addition of all complementary words. Any word H that has its
complement in the initial set (particularly any palindromic word) is assigned
weight w(H) = 2. Other words are assigned weight w(H)
= 1.
Accounting for background distribution. It is insufficient
to count simply the number of occurrences of some motif to infer whether
it is "too abundant" in the sequence. If the input sequence exhibits strong
composition biases the most frequent words found would be the words that
contain the most frequent nucleotides. Therefore, the nucleotide frequencies
(in the simplest case) must also be taken into account. A popular way to
account for sequence composition [43] is to assess
how often a motif found would appear in the random sequence with the same
nucleotide content. In this case, Z-score (difference between the observed
O and the expected E number of the motifs, normalized by
the standard deviation V) is used to count the number of motif occurrences
in the random sequence:
 |
(1) |
To calculate the Z-score, one needs to calculate the expected number
of motif occurrences in the random sequence with a given nucleotide distribution,
and estimate the variance for this quantity.
Counting with overlaps Calculation of the expectation
and the variance of the number of motif occurrences in a random sequence
is not a trivial problem, because one should take into account possible
overlaps between words belonging to the same motif. Consider for instance
the simplest case of a sequence with n=3 over a two-letter
alphabet {AB} with equal probabilities of the letters. Inspection of the
table below demonstrate that words AA and AB have the same expectation
(0.5), but their variances are different:
Table a30. Overlap effect on expectation and variance.
| |
AA
|
AB
|
|
AAA
|
2
|
0
|
|
AAB
|
1
|
1
|
|
ABA
|
0
|
1
|
|
ABB
|
0
|
1
|
|
BAA
|
1
|
0
|
|
BAB
|
0
|
1
|
|
BBA
|
0
|
0
|
|
BBB
|
0
|
0
|
|
Expectation
|
4/8=0.5
|
4/8=0.5
|
|
Variance
|
6/8-0.25=0.5
|
4/8-0.25=0.25
|
Obviously this happens because the motif AA can overlap itself. This
is also true in a general case [44]: whereas the
expectation of the motif that contains many overlapping words does not
change, the variance is changed, sometimes substantially.
Poisson approximation. If the probability of observation
for two overlapping motifs in the sequence is much smaller than that for
observation of one motif, the Poisson approximation
can be implemented. The applicability of the Poisson approximation depends
on the nature of a motif.
Calculation of variance and expectation. It was found
that both the variance and the expectation for the number of occurrences
of a motif (a set of words) can be exactly calculated analytically
in a random Bernoulli type sequence using the generation function technique
[45, 46]. Bernoulli type sequence
must contain no correlation between probabilities of letter occurrences
at neighboring positions. Let H be the motif of m-words
,
.
Let S be an n-letter sequence over the four-letter DNA alphabet
with the vector of letter probabilities p. Then for each
given word H, the probability will be
,
where i is the ith character of the word. For the entire
motif one can calculate:
.
The overlapping and correlation sets. It was observed
that the variance depends on the structure of the words. Given two strings
H and F, the overlapping is the H-suffixes that are
F prefixes. Associated F suffixes form the correlation set
AH,F. Each word in AH,F,
when concatenated to the right of H creates an occurrence of F.
For example, let H = AACAA, and F= AAAT, then AH,F
= {AT, AAT} and AF,H is the empty set. The autocorrelation
set AH,H={
,CAA,
ACAA} also contains the empty string
and denotes the associated probability
.
Variance and expectation for the single strand counting.
It was recently demonstrated [45] that the expectation of the number of
occurrences of a set of patterns H in a random sequence of length
n generated by a Bernoulli model, NH, is:
 |
(2) |
And the variance is:
 |
(3) |
Where
and
, but also can be
analytically calculated.
Variance and expectation for the double strand counting
In the case of counting at double strands, the word weights appear in formulae
(2) and (3):
Let
is the weighted
number of occurrences of the weighted pattern H. Then its
expectation writes:
 |
(4) |
and the linearity constant c1 for the expectation
writes:
 |
(5) |
An example Let H contain all words H that
differ from some fixed word h for no more than k mismatches.
If h is palindrome, then any word in its k-neighborhood has
its complementary word in this k-neighborhood; hence all the words
in the k-neighborhood are assigned weight 2, and
while
. In the case of
a true random sequence, Z-score does not change as compared to the case
of single letter counting. However, the Poisson approximation V~E obviously
does not satisfy.
3 Parameters and predictions.
3.1 Training procedure.
Tables a31-a33 show the sorted results of individual training. For each
combination of motif length and divergence in the range: mmin=7;
mmax=16; kmax=7 (a total of 143 combinations
were considered), the best statistical values CC and OQ were
found and the best coverage c was calculated. . Then, the average
of the values for the 10 enhancers for CC, OQ and PQ
were calculated for each considered parameter combination. Finally, all
parameter combinations (mmin; mmax;
kmax) were sorted in descending order by ether average
CC (Table a31), average OQ
(Table a32) or average PQ (Table
a33) values. Only the top 40 combinations are shown for the each statistics.
One of the top combinations (mmin=7; mmax=9;
kmax=1) that produced the best average scores in all
three types of training was used as the best "default" parameters.
3.2 Predicted maps for 12 enhancers.
Z-score profile plots and maps of predictions are shown on Figures a13-a24
for 12 DNA regions. The plots show the maximal observed Z-scores (Y
scale) for each position in sequence (X scale) within the selected
parameter range (mmin; mmax; kmax;
c). Panels A show the results of training on the group of 10. The results
of individual training are shown in panels B. The predicted map is shown
below each Z-score profile plot. The blue bars represent the most redundant
segments, the Scanseq predictions, while the red bars represent the "consistent"
maps. "Virtual maps" are shown for even-skipped stripe 4+6 and runt
stripe5.
-
abdominal-A enhancer, Figure a13 A,
B
-
engrailed intron, Figure a14 A,
B
-
even-skipped stripe2 enhancer, Figure a15 A,
B
-
even-skipped stripe3+7 enhancer, Figure a16 A,
B
-
even-skipped stripe4+6 enhancer, Figure a17 A,
B
-
fushi-tarazu proximal enhancer, Figure a18 A,
B
-
hairy stripe6 enhancer, Figure a19 A,
B
-
hairy stripe7 enhancer, Figure a20 A,
B
-
kruppel region 730, Figure a21 A,
B
-
runt stripe 5 enhancer, Figure a22 A,
B
-
spalt early enhancer, Figure a23 A,
B
-
tailless enhancer, Figure a24 A,
B
4 Software tools.
-
Programs to refine motif and to search for binding sites with PWM:
-
MOTIF_CUT: finds the informational
content for the user-defined alignment and generates PWM.
-
MOTIF_MATCH: searches with the user
defined PWM for matches in a given sequence.
-
Programs to search for redundant motifs in a single sequence:
-
SCANSEQ: finds related word families
(motifs) for each word in a given sequence and calculates statistical significance
(Z-score) of the found motifs.
-
MAP_GENERATOR: builds a map of redundant
motifs on the base of Scanseq data.
-
Program to train parameters of the Scanseq:
-
TRAIN_SCAN: finds best parameter
combination (mmax, mmin, kmax,
and c) for defined distribution of binding sites on a given sequence.
-
Required accessory tools:
5 References.
[1] M. J. Shimell, A. J. Peterson, J. Burr, J. A. Simon, and M. B.
O'Connor, "Functional analysis of repressor binding sites in the iab-2
regulatory region of the abdominal-A homeotic gene," Dev Biol, vol. 218,
pp. 38-52., 2000.
[2] E. A. Wimmer, M. Simpson-Brose, S. M. Cohen, C. Desplan, and H.
Jackle, "Trans- and cis-acting requirements for blastodermal expression
of the head gap gene buttonhead," Mech Dev, vol. 53, pp. 235-45., 1995.
[3] E. A. Wimmer, S. M. Cohen, H. Jackle, and C. Desplan, "buttonhead
does not contribute to a combinatorial code proposed for Drosophila head
development," Development, vol. 124, pp. 1509-17., 1997.
[4] B. Florence, A. Guichet, A. Ephrussi, and A. Laughon, "Ftz-F1 is
a cofactor in Ftz activation of the Drosophila engrailed gene," Development,
vol. 124, pp. 839-47., 1997.
[5] J. A. Kassis, C. Desplan, D. K. Wright, and P. H. O'Farrell, "Evolutionary
conservation of homeodomain-binding sites and other sequences upstream
and within the major transcription unit of the Drosophila segmentation
gene engrailed," Mol Cell Biol, vol. 9, pp. 4304-11., 1989.
[6] D. Stanojevic, S. Small, and M. Levine, "Regulation of a segmentation
stripe by overlapping activators and repressors in the Drosophila embryo,"
Science, vol. 254, pp. 1385-7., 1991.
[7] S. Small, A. Blair, and M. Levine, "Regulation of even-skipped
stripe 2 in the Drosophila embryo," Embo J, vol. 11, pp. 4047-57., 1992.
[8] D. N. Arnosti, S. Barolo, M. Levine, and S. Small, "The eve stripe
2 enhancer employs multiple modes of transcriptional synergy," Development,
vol. 122, pp. 205-14., 1996.
[9] S. Small, A. Blair, and M. Levine, "Regulation of two pair-rule
stripes by a single enhancer in the Drosophila embryo," Dev Biol, vol.
175, pp. 314-24., 1996.
[10] M. Fujioka, Y. Emi-Sarker, G. L. Yusibova, T. Goto, and J. B.
Jaynes, "Analysis of an even-skipped rescue transgene reveals both composite
and discrete neuronal and early blastoderm enhancers, and multi-stripe
positioning by gap gene repressor gradients," Development, vol. 126, pp.
2527-38., 1999.
[11] Y. Yu, M. Yussa, J. Song, J. Hirsch, and L. Pick, "A double interaction
screen identifies positive and negative ftz gene regulators and ftz-interacting
proteins," Mech Dev, vol. 83, pp. 95-105., 1999.
[12] W. Han, Y. Yu, K. Su, R. A. Kohanski, and L. Pick, "A binding
site for multiple transcriptional activators in the fushi tarazu proximal
enhancer is essential for gene expression in vivo," Mol Cell Biol, vol.
18, pp. 3384-94., 1998.
[13] W. Han, Y. Yu, N. Altan, and L. Pick, "Multiple proteins interact
with the fushi tarazu proximal enhancer," Mol Cell Biol, vol. 13, pp. 5549-59.,
1993.
[14] C. C. Tsai, S. G. Kramer, and J. P. Gergen, "Pair-rule gene runt
restricts orthodenticle expression to the presumptive head of the Drosophila
embryo," Dev Genet, vol. 23, pp. 35-44., 1998.
[15] J. L. Brown and C. Wu, "Repression of Drosophila pair-rule segmentation
genes by ectopic expression of tramtrack," Development, vol. 117, pp. 45-58.,
1993.
[16] Y. Yu, W. Li, K. Su, M. Yussa, W. Han, N. Perrimon, and L. Pick,
"The nuclear hormone receptor Ftz-F1 is a cofactor for the Drosophila homeodomain
protein Ftz," Nature, vol. 385, pp. 552-5., 1997.
[17] X. Li and M. Noll, "Compatibility between enhancers and promoters
determines the transcriptional specificity of gooseberry and gooseberry
neuro in the Drosophila embryo," Embo J, vol. 13, pp. 400-6., 1994.
[18] X. Li and M. Noll, "Evolution of distinct developmental funct>
Transfer interrupted!
uisition of different cis-regulatory regions," Nature, vol. 367, pp. 83-7.,
1994.
[19] M. Bouchard, J. St-Amand, and S. Cote, "Combinatorial activity
of pair-rule proteins on the Drosophila gooseberry early enhancer," Dev
Biol, vol. 222, pp. 135-46., 2000.
[20] J. A. Langeland and S. B. Carroll, "Conservation of regulatory
elements controlling hairy pair-rule stripe formation," Development, vol.
117, pp. 585-96., 1993.
[21] J. A. Langeland, S. F. Attai, K. Vorwerk, and S. B. Carroll, "Positioning
adjacent pair-rule stripes in the posterior Drosophila embryo," Development,
vol. 120, pp. 2945-55., 1994.
[22] A. La Rosee, T. Hader, D. Wainwright, F. Sauer, and H. Jackle,
"hairy stripe 7 element mediates activation and repression in response
to different domains and levels of Kruppel in the Drosophila embryo," Mech
Dev, vol. 89, pp. 133-40., 1999.
[23] A. La Rosee, T. Hader, H. Taubert, R. Rivera-Pomar, and H. Jackle,
"Mechanism and Bicoid-dependent control of hairy stripe 7 expression in
the posterior region of the Drosophila embryo," Embo J, vol. 16, pp. 4403-11.,
1997.
[24] M. Hoch, E. Seifert, and H. Jackle, "Gene expression mediated
by cis-acting sequences of the Kruppel gene in response to the Drosophila
morphogens bicoid and hunchback," Embo J, vol. 10, pp. 2267-78., 1991.
[25] Q. Gao, Y. Wang, and R. Finkelstein, "Orthodenticle regulation
during embryonic head development in Drosophila," Mech Dev, vol. 56, pp.
3-15., 1996.
[26] Q. Gao and R. Finkelstein, "Targeting gene expression to the head:
the Drosophila orthodenticle gene is a direct target of the Bicoid morphogen,"
Development, vol. 125, pp. 4185-93., 1998.
[27] M. Klingler and J. P. Gergen, "Regulation of runt transcription
by Drosophila segmentation genes," Mech Dev, vol. 43, pp. 3-19., 1993.
[28] R. Barrio, J. F. de Celis, S. Bolshakov, and F. C. Kafatos, "Identification
of regulatory regions driving the expression of the Drosophila spalt complex
at different developmental stages," Dev Biol, vol. 215, pp. 33-47., 1999.
[29] J. F. de Celis, R. Barrio, and F. C. Kafatos, "Regulation of the
spalt/spalt-related gene complex and its function during sensory organ
development in the Drosophila thorax," Development, vol. 126, pp. 2653-62.,
1999.
[30] R. P. Kuhnlein, G. Bronner, H. Taubert, and R. Schuh, "Regulation
of Drosophila spalt gene expression," Mech Dev, vol. 66, pp. 107-18., 1997.
[31] M. Hoch, N. Gerwin, H. Taubert, and H. Jackle, "Competition for
overlapping sites in the regulatory region of the Drosophila gene Kruppel,"
Science, vol. 256, pp. 94-7., 1992.
[32] G. J. Liaw, K. M. Rudolph, J. D. Huang, T. Dubnicoff, A. J. Courey,
and J. A. Lengyel, "The torso response element binds GAGA and NTF-1/Elf-1,
and regulates tailless by relief of repression," Genes Dev, vol. 9, pp.
3163-76., 1995.
[33] S. Qian, M. Capovilla, and V. Pirrotta, "Molecular mechanisms
of pattern formation by the BRE enhancer of the Ubx gene," Embo J, vol.
12, pp. 3865-77., 1993.
[34] S. Qian, M. Capovilla, and V. Pirrotta, "The bx region enhancer,
a distant cis-control element of the Drosophila Ubx gene and its regulation
by hunchback and other segmentation genes," Embo J, vol. 10, pp. 1415-25.,
1991.
[35] V. Pirrotta, C. S. Chan, D. McCabe, and S. Qian, "Distinct parasegmental
and imaginal enhancers and the establishment of the expression pattern
of the Ubx gene," Genetics, vol. 141, pp. 1439-50., 1995.
[36] R. S. Mann, "Engrailed-mediated repression of Ultrabithorax is
necessary for the parasegment 6 identity in Drosophila," Development, vol.
120, pp. 3205-12., 1994.
[37] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence
Analysis: Cambridge University press, 1998.
[38] A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler,
"Hidden Markov models in computational biology. Applications to protein
modeling," J Mol Biol, vol. 235, pp. 1501-31, 1994.
[39] A. Krogh, I. S. Mian, and D. Haussler, "A hidden Markov model
that finds genes in E. coli DNA," Nucleic Acids Res, vol. 22, pp. 4768-78,
1994.
[40] T. L. Bailey and M. Gribskov, "Methods and statistics for combining
motif match scores," J Comput Biol, vol. 5, pp. 211-21, 1998.
[41] T. Yada, Y. Totoki, M. Ishikawa, K. Asai, and K. Nakai, "Automatic
extraction of motifs represented in the hidden Markov model from a number
of DNA sequences," Bioinformatics, vol. 14, pp. 317-25, 1998.
[42] D. Gusfeld, Algorithms on Strings, Trees, and Sequences: Cambridge
University Press, 1997.
[43] M. Tompa, "An exact method for finding short motifs in sequences
with application to the ribosome binding site problem," presented at Intellectual
Systems in Molecular Biology, ISMB 99, 1999.
[44] L. J. Guibas and A. M. Odlyzko, "String overlaps, pattern matching,
and nontransitive games.," Journal of Combinatorial Theory. Series A.,
vol. 30, pp. 183-208, 1981.
[45] M. Régnier, "A unified approach to word occurrences probabilities,"
Discrete Applied Mathematics, vol. 104, pp. 259-280, 2000.
[46] M. Régnier, A. Lifanov, and V. Makeev, "Three variations
on word counting," presented at II German Conference on Bioinformatics,
Heidelberg, Germany, 2000.
home