Counting Runs of Ones with Overlapping Parts in Binary Strings Ordered Linearly and Circularly

,


Introduction and Preliminaries
Nowadays, the increasing use of computer science in diverse applications including encoding, compression and transmission of digital information calls for understanding the distribution of runs of 1 s or 0 s.For instance, such knowledge helps in analyzing, and also in comparing, several techniques used in communication networks (wired or wireless).In such networks binary data, ranging from a few bytes (e.g.e-mails) to many gigabytes of greedy multimedia applications (e.g.video on demand), are highly processed.For details as well as for additional real-life applications see Sinha (2007), K. Sinha andB. P. Sinha (2009, 2012) and the references therein.
Another area where the study of distribution of runs of 1 s and 0 s has become increasingly useful is the field of bioinformatics or computational biology.For instance, in the context of hypothesis testing molecular biologists design similarity tests between two DNA (DeoxyriboNucleic Acid) sequences where a 1 is interpreted as a match of the sequences at a given position (Benson, 1999;Lou, 2003;Nuel et al., 2010;Makri & Psillakis, 2011a).
In such applications, as the indicative ones mentioned above, a key point is the understanding how 1 s and 0 s are distributed and combined as elements of a binary sequence (finite or infinite, memoryless or not) and eventually forming runs of 1 s and 0 s which are enumerated according to certain counting schemes.Each counting scheme defines how runs of same symbols (i.e. 1 s or 0 s) are formed and consequently counted.A scheme may depend on, among other considerations, whether overlapping counting is allowed or not as well as if the counting starts or not from scratch when a run of a certain size has been so far enumerated.For extensive reviews of the runs literature we refer to Balakrishnan and Koutras (2002), Fu and Lou (2003), and Koutras (2003).The topic is still active and also attractive, because of the wide range of its application in many areas of applied probability and engineering including hypothesis testing, quality control, system reliability, computer science and financial engineering.Some recent contributions on the subject, among others, are the works of Eryilmaz (2011), Makri and Psillakis (2012), Demir Atalay and Zeybek (2013), and Mytalas and Zazanis (2013).
Consider a sequence of n binary (zero-0, one-1) random variables (RVs) with values (i.e.outcomes of binary trials) ordered on a line or on a circle.In the circular case we assume that the first outcome is adjacent to (and follows) the n-th outcome.
According to Makri and Psillakis (2013), in order to study formally -overlapping counting in any sequence of 0-1 RVs {Z i } n i=1 , the run statistics X (α) n,k, , 0 ≤ < k ≤ n, α = L, C, can be defined as follows for j = k, k + 1, . . ., n, with the convention Z 0 ≡ 0 and for j = 1, 2, . . ., n, with the conventions The supports (range sets) of X (L) n,k, and respectively.In the above formulae, as well as in the whole article x stands for the greatest integer less than or equal to a real number x and δ i, j denotes the Kronecker delta function of the integer arguments i and j.Also, we apply the conventions s i=r a i = 0, s i=r a i = 1, for r > s.The setup of indicators I (α)  j in (1) and (2) holds true for any 0 − 1 sequence {Z i } n i=1 and it is the main tool to derive the expected value, E(X (α)  n,k, ), of X (α) n,k, , α = L, C for such a sequence.Furthermore, it is useful to determine numeric values of X (α) n,k, and therefore empirical frequencies and moments in studies concerning applications like processing of computer files or DNA sequences.
In this point we should mention that the -overlapping counting scheme, as a natural generalization of both Feller's and Ling's counting, differs in concept from the other fundamental counting scheme in runs literature; specifically the Mood's (1940) counting scheme (see, also Koutras, 2003;Makri et al., 2007b;Eryilmaz, 2008;K. Sinha & B. P. Sinha, 2009;Makri & Psillakis, 2011a, 2011b).According to the latter scheme, a run of 1 s (1-run) is defined to be a sequence of consecutive 1 s preceded and succeeded by 0s or by nothing.The number of 1s in a 1-run is referred to as its length (or size).For integers n and k with n ≥ k ≥ 1, let Y (β,α)  n,k denote the numbers of 1-runs of length [exactly equal to k (β = E), greater than or equal to k (β = G)] in n binary trials {Z i } n i=1 ordered on a line (α = L) or on a circle (α = C).These run statistics can be formulated (see, also Makri et al., 2007b) by the indicators Then, the statistics Y (E,α)   n,k can be expressed as and can be expressed as Readily it holds Example 2. In order to make the distinction among the run statistics X (α) n,k, and Y (β,α) n,k for β = E, G, α = L, C clear, let us consider the following 0 − 1 sequence of 21 outcomes, numbered from 1 to 21, 111011001111110101111.Then for k = 3 there is one 1-run of length exactly equal to 3, i.e. the run 1, 2, 3 in the linear case whereas there are no such 1-runs in the circular case.Consequently, Y (E,L)  21,3 = 1 and Y (E,C) 21,3 = 0. Furthermore, the 1-runs of length greater than or equal to 3 are the runs 1, 2, 3; 9, 10, 11, 12, 13, 14; 18, 19, 20, 21 in the linear case and the runs 9, 10, 11, 12, 13, 14; 18, 19, 20, 21, 1, 2, 3 in the circular case.Hence, Y (G,L)  21,3 = 3 and Y (G,C) 21,3 = 2. Finally, using -overlapping counting we find that X (L)  21,3,0 = 4, X (L) 21,3,1 = 4, X (L) 21,3,2 = 7 and X (C) 21,3,0 = 4, X (C) 21,3,1 = 5, X (C) 21,3,2 = 9.In real-life applications like the interesting ones proposed by K. Sinha andB. P. Sinha (2009, 2012), which considered processing of computer files of various formats commonly encountered for computing and communication purposes, the replacement of Y (E,L)   n,k with X (L)  n,k, might offer a useful alternative.The same rather holds for circularly connected networks.
The use of X (α)  n,k, instead of Y (β,α)   n,k possibly offers a more informative testing, via (7), whereas the selection of the overlapping parameter provides, in turns, flexibility depending on the testing requirements of such files.Specifically, in stringent testing of binary sequences could be made as big as k − 1 and then almost all the binary digits participating in 1-runs of length k are tested more than once.Readily, could be made as low as 0, and then all the binary digits will be tested for 1-runs only once.The latter fact corresponds to less demanding testing of the binary sequences.
Associated with Y (β,α)  n,k , β = E, G and α = L, C are the numbers Q (β,α)  n,k of all 1-runs (in the Mood's sense) of length (equal, greater than or equal to) k in all 2 n possible binary strings (that is, symmetric Bernoulli sequences; p = 1/2) of length n ordered on a line or on a circle.Recently, Makri and Psillakis (2011b) and Makri et al. (2012) provided explicit expressions for the numbers Q (β,L)  n,k , β = E, G.These works simplify, extend and unify too, the results of K. Sinha and B. P. Sinha (2009Sinha ( , 2012) ) referring to Q (E,L)  n,k .In the present paper we consider, in all 2 n binary strings of length n ordered on a line or on a circle, first the number N (α)  x;n,k, , of all binary strings of length n which contain, x, x ∈ S (α) n,k, , -overlapping 1-runs of length k, and second the number R (α)  n,k, of all -overlapping 1-runs of length k.
The rest of the paper is organized as follows.In Section 2, we state the problem and we develop closed expressions for the numbers N (α)  x;n,k, and R (α) n,k, for α = L, C. The expressions so provided are relied on a new interpretation of recent results for the statistic X (α)  n,k, of Makri et al. (2007a) and Makri and Psillakis (2013).However, the introduction and study of the numbers N (α)  x;n,k, and R (α) n,k, , have not been addressed previously.In Section 3, we summarize our main results and we discuss the methodology used to derive them.

Mathematical Formulation
In this Section we present our main results.In Section 2.1 we consider Bernoulli sequences whereas in Section 2.2 we restrict our study to symmetric Bernoulli sequences.

Bernoulli Sequences
We consider a sequence {Z i } n i=1 of length n of independent (i.e.derived by a memoryless source) and identically distributed 0 − 1 RVs with a common probability of 1 s p; i.e. p = P(Z i = 1) = 1 − P(Z i = 0) = 1 − q, i = 1, 2, . . ., n.Such a sequence, called a finite Bernoulli sequence, is of particular importance in studies of applied probability because of its simplicity and its help in understanding the notion of randomness, and also since it may be considered as a special case of a sequence with dependent elements; e.g. a Markovian or an exchangeable one (see, e.g.Makri & Psillakis, 2012).The latter feature also serves as a valuable crosscheck of results referring to several random sources and obtained by various methods.
Since the expected values, E(I (α) j ), of the indicators I (α) j , α = L, C, depend on j, the sets A and B help in expressing them.Specifically, by the independence of Z i s we have where I A ( j) = 1, if j ∈ A; 0, otherwise and I B ( j) is defined in a similar manner.Substituting E(I (α) j ) in where J (L) = {k, k + 1, . . ., n}, J (C) = {1, 2, . . ., n}, we conclude after some algebraic manipulations that where r = (n − )/(k − ) and s = n − − r(k − ), and Equations ( 8)-( 9) have been derived by Makri and Psillakis (2013, Corollaries 2.1 and 3.1) using a slightly different formulation.These authors also provide expressions for E(X (α) n,k, ) for sequences of independent not identically distributed binary RVs ordered on a line or on a circle.For 0 ≤ ≤ k−1, Equation ( 8) improves the expression given by Proposition 2.1 of Makri and Philippou (2005) which provides E(X (L)  n,k, ; p) by means of a summation.For = 0 provides a formula that is simpler than that of Aki and Hirano (1988) and Antzoulakos and Chadjiconstantinidis (2001).For 0 ≤ ≤ k − 1, Equation (9) recaptures Proposition 3.1 of Makri and Philippou (2005) which for = 0 coincides with a result of Charalambides (1994) and Makri and Philippou (1994).
For the particular case = 0, an even simpler new expression for E(X (L)  n,k, ; p) is obtained by solving the recurrence relation provided by Balakrishnan and Koutras (2002, p. 163); i.e.

Symmetric Bernoulli Sequences
For a Bernoulli sequence of length n with probability of 1 s p, let f (α)  n,k, (x; p) denote the probability mass function (PMF) of the RV X (α)  n,k, ; i.e.
Then, the expected value of In the sequel, we consider a symmetric (p = 1/2) finite Bernoulli sequence (i.e. a finite binary string) of length n for which we obtain our main results.Since the cardinality of a proper sample space is 2 n (i.e.there are 2 n binary strings that are equally likely to occur) the classical definition of probability implies that In the above formula, N (α)  x;n,k, is the number of all binary strings of length n with exactly x, x ∈ S (α) n,k, , -overlapping 1-runs of length k, among all 2 n (possible) binary strings of length n, n ≥ k > ≥ 0, ordered linearly (α = L) or circularly (α = C).In other words, N (α)  x;n,k, is the number of 0 − 1 strings of length n so that X (α) n,k, = x.That is, where and the values of X (α)  n,k, are determined via (1)-( 2).
Equation ( 16) implies that if we wish to empirically determine N (α)  x;n,k, we have to generate (by a computer) all 2 n 0 − 1 strings of length n and then count among all 2 n strings those for which it holds X (α)  n,k, = x.This approach, i.e. first listing and then counting, although possible and useful too in several applications like the ones mentioned in the Introduction, does not determine theoretically the numbers N (α)  x;n,k, .It is the combinatorial analysis that gives answers in such problems since it counts arrangements of things without listing them.
Next closed expressions for N (α)  x;n,k, , x ∈ S (α) n,k, , α = L, C, in terms of sums of binomial coefficients, are provided.They are directly obtained by Equation ( 15) and by Theorems 2.1 and 4.1 of Makri et al. (2007a).The latter Theorems provide the PMF of X (α)  n,k, defined on a binary sequence derived by a Polya-Eggenberger urn model of which a Bernoulli sequence and in turns a binary string are particular cases.Their method was based on the solution of a combinatorial problem; specifically, the allocation of indistinguishable balls into distinguishable cells under certain restrictions about the capacity of the cells.Accordingly, we have: It is noticed that the RHS of ( 17) for x = 0 is the same for every , 0 and , γ > 0 by Lemma 2.1 and Corollary 2.1 of Makri et al. (2007a), respectively.
The number C M (γ, i, r − i, m 1 − 1, m 2 − 1) is the number of integer solutions of the equation Equivalently, it gives the number of allocations of γ indistinguishable balls into r distinguishable cells, of which i specified cells have a capacity of m 1 − 1, and each of the remaining r − i cells have a capacity of m 2 − 1.Consequently, C R (γ, r, m 1 − 1) represents the number of allocations of γ indistinguishable balls into r distinguishable cells, each cell with capacity m 1 − 1.This is so because, in the case of C R , there are not specified cells and all i + (r − i) = r cells have the same capacity m 1 − 1.
By symmetry, R (α)  n,k, provide the respective numbers associated with -overlapping 0-runs of length k, in all 2 n binary strings of length n . This is so, because the numbers R (α)  n,k, follow in general the behavior of statistics 3 and 4 present the numbers R (L)  n,k, and R (C) n,k, for binary strings of length n = 2 t , t = 1, 2, 3 and for 1 ≤ k ≤ n, 0 ≤ ≤ k − 1.As an illustration notice that (see also Example 3) R (L)  4,2,1 = 12 which is directly computed by the third of ( 21).The entries of Tables 3 and 4, efficiently computed by ( 21)-( 22), also confirm the previously noticed ordering among the depicted numbers.Furthermore, they suggest that the numbers R (L)  n,k, and R (C)  n,k, decrease exponentially as k increases tending to n.They eventually tend to R (L) n,n, = 1, = 0, 1, . . ., n − 1 and R (C)  n,n, = n n− , = 0, 1, . . ., n − 1, respectively.Finally, in order to get a sense of the values of the numbers R (α)  n,k, even for moderate ns, we compute and present some such numbers in Example 5 which continues Example 4.

Concluding Review
In this paper we concerned 0 − 1 sequences of finite length n, ordered on a line or on a circle.For such sequences we considered run statistics, X (α) n,k, , α = L(line), C(circle), associated with counting of 1-runs of a fixed length k, assuming that these runs may have overlapping parts of a given length , 0 ≤ < k ≤ n.Other schemes, along with the associated statistics Y (β,α)  n,k , counting runs of 1s of length exactly equal to k (β = E) and greater than or equal to k (β = G) are also mentioned.Relationships among the statistics X (α)  n,k, and Y (β,α) n,k are discussed.X (α) n,k, and Y (β,α)  n,k are defined on any binary sequence ordered on a line or on a circle via appropriate indicators.Using the latter ones, in Section 2.1 the expected values, E(X (α)  n,k, ; p), of X (α) n,k, for Bernoulli sequences with probability of 1s p were expressed.The formulae (8)-( 9) for E(X (α)  n,k, ; p), α = L, C are simple, computationally efficient and also general.According to the latter feature some known results for E(X (α)  n,k, ; p), α = L, C and = 0, k − 1 are recaptured as special cases of the unified approach.An alternative expression of E(X (L)  n,k,0 ; p), given by (11), which is not implied for = 0 by the general formula (8) for E(X (L)  n,k, ; p) is obtained.After the review in Section 2.1 on some results concerning E(X (α)  n,k, ; p), the numbers N (α)  x;n,k, and R (α) n,k, , α = L, C are introduced and studied on binary strings (symmetric Bernoulli sequences) in Section 2.2.R (α)  n,k, is the number of the -overlapping runs of 1s of length k in all 2 n binary strings of length n and is associated with E(X (α)  n,k, ; 1/2).N (α)  x;n,k, is the number of those strings, among the 2 n strings, which contain x -overlapping runs of 1s of length k and is related with the PMF of X (α)  n,k, .Accordingly, for N (α)  x;n,k, we give, by Equations ( 17)-( 18), closed expressions in terms of sums of binomial coefficients whereas for R (α)  n,k, we provide, by Equations ( 21)-( 22), explicit closed expressions in addition to those, i.e.Equations ( 19), that can be derived using N (α)  x;n,k, .The expressions so obtained clarify further the interdependencies among the numbers N (α)  x;n,k, and R (α) n,k, which might be potentially useful in applications like those mentioned in the Introduction.

Table 3 .
Number of occurrences of -overlapping 1-runs of length k, R (L) n,k, in binary strings of length n ordered on a line n