Note on the Rademacher-Walsh Polynomial Basis Functions

Over the years, one of the methods of choice to estimate probability density functions for a given random variable (defined on binary input space) has been the expansion of the estimation function in Rademacher-Walsh Polynomial basis functions. For a set of L features (often considered as an “L-dimensional binary vector”), the RademacherWalsh Polynomial approach requires 2L basis functions. This can quickly become computationally complicated and notationally clumsy to handle whenever the value of L is large. In current pattern recognition applications it is often the case that the value of L can be 100 or more.


Introduction
When x is an "L-dimensional binary vector" whose components can take binary values (0 or 1), the probability density function, p(x), for x can be approximated by using a set of basis functions.It is often the case that p(x) is estimated through 2 L Rademacher-Walsh Polynomial basis functions ϕ i (Note 1) (Duda & Hart, 1973;Hand, 1981) as where and N refers to the number of available samples, {x j } N j=1 , drawn from the underlying probability distribution being estimated.The coefficients α i can be viewed as moments (Duda & Hart, 1973), which can be estimated as Throughout the paper it is assumed that the L-dimensional random variables reside in a binary input space B with B = {0, 1} L and the descriptions and notations given in (Duda & Hart, 1973) are adopted.
If the available N samples are distinct instances and N = 2 L , the estimated coefficients αi are exact (Tou & Gonzalez, 1974).However, exact or not, the expansion in Equation 1 requires 2 L Rademacher-Walsh Polynomial basis functions, which can make the estimation notationally clumsy and computationally complicated whenever the value of L is large (Duda & Hart, 1973).
In passing we point out that one can employ a subset of the 2 L Rademacher-Walsh Polynomials in the expansion, but this may result in an estimate of p(x)-i.e., p(x)-that can have negative values (Hand, 1981, pp. 106).
Putting Equation 3 into Equation 1 yields (Meisel, 1972) where Clearly, for all practical purposes, L << ∞; besides ϕ i (x j ) and ϕ i (x) can only take values 1, or -1 as illustrated by Note 1. Hence which means that K(x j , x) is a valid positive definite kernel function (Aronszajn, 1950;Shawe-Taylor & Cristianini, 2004) in B × B.
In Equation 4, the estimation of p(x) at x can be instructively viewed as an average of how similar x is to the given N samples x j , where K(x j , x) is the similarity function (cf. the popular Parzen Window approach (Parzen, 1962)).
For the expression in Equation 4to have any practical use, knowledge of the closed form of the kernel function K(x j , x) is essential.In the following section we formulate a theorem stating that K(x j , x) in Equation 5 is a Dirac kernel function (Jacob & Vert, 2008).In this section we also develop the tools necessary for proving this theorem.
The full proof of the theorem is presented in Section 3 followed by our concluding remarks in the final section.
For notational simplicity, in this work x denotes both the random vector and the values it may assume.

Method
This section introduces a theorem which constitutes the core of this paper-that is, K(x j , x) is a Dirac kernel function.Also in this section tools (e.g., definitions, lemmas, propositions and remarks) that are essential for proving the theorem are presented.
Theorem 1 If x and x j ∈ B with B = {0, 1} L , and ϕ i (•) are Rademacher-Walsh Polynomial basis functions defined on B, then i.e., K(x j , x) is a Dirac kernel function.
Where x j = x means that x j1 = x 1 , x j2 = x 2 , ..., x jL = x L , with x jl and x l referring to the binary-valued l th elements of x j and x, respectively.(Equation 6 can be viewed as showing that the basis functions ϕ i √ 2 L satisfy the "orthonormality" relation in the arguments x i and x.) As described in the Introduction, the set {ϕ i (x)} 2 L −1 i=0 is obtained by systematically forming products of (2x l − 1) none at a time, one at a time, two at a time, three at a time, etc., where l = 1, 2, ..., L. By the same token the set i=0 is obtained by forming products of the distinct terms (2x jl − 1)(2x l − 1) none at a time, one at a time, two at a time, three at a time, and so on: Remark 1 By definition ϕ 0 (x j )ϕ 0 (x) = 1; and self-evidently ϕ i (x j )ϕ i (x) (where 1 ≤ i ≤ L) can only take the values of 1 or -1.This also means that ϕ i (x j )ϕ i (x) can only take the values of 1 or -1, where Before we embark on proving Theorem 1, we show that the following lemma (Lemma 1) holds.
Lemma 1 Let a 1 , a 2 , ..., a L be L distinguishable real variables which can take the values of 1 and -1, and assume that combinatorial compositions can be considered as products, i.e., a k a j means a k × a j , where k, j = 1, 2, ..., L. The sum of their possible combinatorial compositions z i , with i = 0, 1, ..., 2 L − 1, then gives: Proof.The possible combinations are the L variables chosen: none at a time; 1 variable, a i , at a time; 2 variables, a i a j , at a time; three variables,a i a j a k , at a time;,...,; or L variables, a 1 a 2 ...a L , at a time.
Let z 0 = +1 (when none is chosen); Clearly, the number of times that none of the variable is chosen is L 0 , which can also written as L C 0 .The number of combinatorial terms containing one variable is L C 1 = L 1 .Similarly the number combinatorial terms consisting of two, three, four, ..., and L variables are .., and L C L = L L , respectively.In other words But from the binomial expansion theorem, The use of both L C and L may seem somewhat superfluous, but the reason for this will become clear in the following discussions.

Scenario (2):
All the L variables take the value of -1, i.e., a k = -1, where k is as defined before.
For no specific reason and loss of generality, let us consider that m and k denote the number of variables that take the values -1 and 1, respectively, where L = m + k.
In this scenario one is required to demonstrate that Fortunately, Equation 13 can be readily proved by use of induction providing m+k C is expressed in terms of m C 's.
In this case it is germane to recall the following important identities where where r, j, n are non-negative integers and r≤n (Riley et al., 2007): m+1 C , which can be expressed as Making use of Identity I, the m+1 C on the RHS of the equation above becomes m C + m C −1 , i.e., the equation can be rewritten as This equation can be modified further by applying Identities III and II to the first and last terms on the RHS of the last equation, respectively, resulting in , evidently the equation above can be written as In Scenario (2) we have demonstrated that in the case that all the variables (denoted here by take the value of -1, m C = (−1) m .This means Applying Identity I to m+2 C in the middle term on the RHS of the equation above, we obtain By following the same line of reasoning as employed in the case of k=1, we can modify the equation above further.
Applying Identities III and II to the first and last terms on the RHS of the equation above, respectively, gives By the virtue of Equation 14 13.
This finalizes the proof of Lemma 1.

Results
In the preceding section, we have attempted to develop the essential tools for proving the proposed theorem, Theorem 1.In this section the full proof of the theorem is given.
Proof of Theorem 1.
k = k, one just needs to repeat the process above k times, which gives the RHS of Equation