what you don't know can hurt you
Home Files News &[SERVICES_TAB]About Contact Add New

nsa4897878.htm

nsa4897878.htm
Posted Dec 21, 1999

nsa4897878.htm

tags | encryption
SHA-256 | bbc074a09960c678accbb39613e4d2bed8a6361667575eeb2472b5e76c82e60a

nsa4897878.htm

Change Mirror Download
<HTML>
<HEAD>
<TITLE>NSA Patent 4,897,878 - 1990</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF">
<P>
29 May 1999<BR>
Source: US Patent Office Online:
<A HREF="http://www.uspto.gov/">http://www.uspto.gov/</A>
<P>
Search "National Security Agency" though none of the patents disclose the
full name.
<P>
For related images see IBM's patent server:
<A HREF="http://www.patents.ibm.com/ibm.html">http://www.patents.ibm.com/ibm.html</A>
<P>
<HR>
<TABLE WIDTH="100%">
<TR>
<TD ALIGN="LEFT" WIDTH="50%"><B>United States Patent </B></TD>
<TD ALIGN="RIGHT" WIDTH="50%"><B> 4,897,878 </B></TD>
</TR>
<TR>
<TD ALIGN="LEFT" WIDTH="50%"><B> Boll , &nbsp; et al.</B></TD>
<TD ALIGN="RIGHT" WIDTH="50%"><B>January 30, 1990 </B></TD>
</TR>
</TABLE>
<P>
<HR>
<FONT size="+1"> Noise compensation in speech recognition apparatus
</FONT><BR>
<BR>
<CENTER>
<B>Abstract</B>
</CENTER>
<P>
A method and apparatus for noise suppression for speech recognition systems
which employs the principle of a least means square estimation which is
implemented with conditional expected values. Essentially, according to this
method, one computes a series of optimal estimators which estimators and
their variances are then employed to implement a noise immune metric. This
noise immune metric enables the system to substitute a noisy distance with
an expected value which value is calculated according to combined speech
and noise data which occurs in the bandpass filter domain. Thus the system
can be used with any set of speech parameters and is relatively independent
of a specific speech recognition apparatus structure.
<P>
<HR>
<TABLE WIDTH="100%">
<TR>
<TD VALIGN="TOP" ALIGN="LEFT" WIDTH="10%">Inventors:</TD>
<TD ALIGN="LEFT" WIDTH="90%"><B>Boll; Steven F.</B> (San Diego, CA); <B>Porter;
Jack E.</B> (San Diego, CA)</TD>
</TR>
<TR>
<TD VALIGN="TOP" ALIGN="LEFT" WIDTH="10%">Assignee:</TD>
<TD ALIGN="LEFT" WIDTH="90%"><B>ITT Corporation</B> (New York, NY)</TD>
</TR>
<TR>
<TD VALIGN="TOP" ALIGN="LEFT" WIDTH="10%" NOWRAP>Appl. No.:</TD>
<TD ALIGN="LEFT" WIDTH="90%"><B> 769215</B></TD>
</TR>
<TR>
<TD VALIGN="TOP" ALIGN="LEFT" WIDTH="10%">Filed:</TD>
<TD ALIGN="LEFT" WIDTH="90%"><B>August 26, 1985</B></TD>
</TR>
</TABLE>
<P>
<TABLE WIDTH="100%">
<TR>
<TD VALIGN=TOP ALIGN="LEFT" WIDTH="40%"><B>U.S. Class:</B></TD>
<TD VALIGN=TOP ALIGN="RIGHT" WIDTH="60%"><B>381/43</B>; 381/47</TD>
</TR>
<TR>
<TD VALIGN=TOP ALIGN="LEFT" WIDTH="40%"><B>Intern'l Class: </B></TD>
<TD VALIGN=TOP ALIGN="RIGHT" WIDTH="60%">G10L 007/08</TD>
</TR>
<TR>
<TD VALIGN=TOP ALIGN="LEFT" WIDTH="40%"><B>Field of Search: </B></TD>
<TD ALIGN="RIGHT" VALIGN="TOP" WIDTH="60%">381/41-50 364/513.5</TD>
</TR>
</TABLE>
<P>
<HR>
<CENTER>
<B>References Cited
<A href="/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2Fsearch-adv.htm&r=0&f=S&l=50&d=CR90&Query=ref/4,897,878">[Referenced
By]</A></B>
</CENTER>
<P>
<HR>
<CENTER>
<B>U.S. Patent Documents</B>
</CENTER>
<TABLE WIDTH="100%">
<TR>
<TD WIDTH="25%"><A href="/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=%2Fnetahtml%2Fsearch-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN%2F4499594">4499594</A></TD>
<TD WIDTH="25%">Feb., 1985</TD>
<TD WIDTH="25%" ALIGN="LEFT">Lewinter</TD>
<TD WIDTH="25%" ALIGN="RIGHT">381/46.</TD>
</TR>
<TR>
<TD WIDTH="25%"><A href="/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=%2Fnetahtml%2Fsearch-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN%2F4567606">4567606</A></TD>
<TD WIDTH="25%">Jan., 1986</TD>
<TD WIDTH="25%" ALIGN="LEFT">Vensko et al.</TD>
<TD WIDTH="25%" ALIGN="RIGHT">381/43.</TD>
</TR>
<TR>
<TD WIDTH="25%"><A href="/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=%2Fnetahtml%2Fsearch-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN%2F4624008">4624008</A></TD>
<TD WIDTH="25%">Nov., 1986</TD>
<TD WIDTH="25%" ALIGN="LEFT">Vensko et al.</TD>
<TD WIDTH="25%" ALIGN="RIGHT">381/43.</TD>
</TR>
<TR>
<TD COLSPAN="4" ALIGN="CENTER"><B>Foreign Patent Documents</B></TD>
</TR>
<TR>
<TD WIDTH="25%">0216118</TD>
<TD WIDTH="25%">Apr., 1987</TD>
<TD WIDTH="25%" ALIGN="LEFT">EP</TD>
<TD WIDTH="25%" ALIGN="RIGHT">381/46.</TD>
</TR>
</TABLE>
<P>
<BR>
<TABLE WIDTH="90%">
<TR>
<TD><BR>
<CENTER>
<B>Other References</B>
</CENTER>
</TD>
<TD ALIGN=LEFT><BR>
Boll, Suppression of Acoustic Noise in Speech Using Spectral Subtraction,
IEEE Trans. on ASSP, vol. ASSP-27, No. 3, Apr. 1979, pp. 113-120. <BR>
Tierney, A Study of LPC Analysis of Speech in Additive Noise, IEEE Trans.
on ASSP, vol. ASSP-28, No. 4, Aug. 1980, pp. 389-397. <BR>
F. Itakura, Minimum Protection Residual Principle Applied Speech Recognition,
IEEE Trans. on ASSP, vol. ASSP-23, pp. 67-72, Feb. 1975. <BR>
Porter and Boll, "Optimal Estimators for Spectral Restoration of Noisy Speech",
ICASSP, San Diego, CA, Mar. 1984, pp. 18A.2.1-18A.2.4.</TD>
</TR>
</TABLE>
<P>
<BR>
<I>Primary Examiner:</I> Clark; David L. <BR>
<I>Assistant Examiner:</I> Knepper; David D. <BR>
<I>Attorney, Agent or Firm:</I> Twomey; Thomas N. Werner; Mary C. <BR>
<HR>
<CENTER>
<B><I>Goverment Interests</I></B>
</CENTER>
<P>
<HR>
<BR>
<BR>
The invention was made with Government support under Contract No.
MDA-904-83-C-0475 awarded by the
<A Name=h1 HREF=#h0></A><A HREF=#h2></A><B><I>National Security
Agency</I></B>. The Government has certain rights in this invention.
<HR>
<CENTER>
<B><I>Claims</I></B>
</CENTER>
<P>
<HR>
<BR>
<BR>
1. A method of compensating for noisy input speech in order to improve the
recognition result of a speech recognition apparatus having an input for
unknown speech, converting means for converting the unknown speech into
time-sampled frames of speech signals representing its spectral distribution
over a given range of frequencies, storing means for storing templates of
known speech in the form of speech signals representing its spectral distribution
over the given range of frequencies, computing means for computing the minimum
mean square error of the Euclidean squared distance between the speech signals
of the unknown speech compared with the speech signals of the template speech,
and recognizer means for producing a recognition result based upon the minimum
mean square error computed by the computing means, <BR>
<BR>
wherein said method of compensating for noisy input speech comprises the
following steps for producing an improved minimum mean square error estimate
conditioned by compensatory characteristics of the noisy input speech:
<BR>
<BR>
(a) computing optimal estimated distance values over the given range of
frequencies for noise-free template speech, based upon comparing known speech
segments, which are input in a noise-free environment and converted into
corresponding templates of known speech signals t.sub.s, with unknown speech
segments, which are input in a noise-free environment and converted to unknown
speech signals u.sub.s ; <BR>
<BR>
(b) computing estimated variance values corresponding to the optimal estimated
distance values for a sample population of noise-free speech segments;
<BR>
<BR>
(c) storing said optimal estimated distance values and variance values on
a look-up table associated with the template speech; <BR>
<BR>
(d) computing squared distance values over the given range of frequencies
for input noisy unknown speech signals u.sub.s+n compared with signals t.sub.s+n
representing template speech to which a spectral representation of noise
n in the actual input environment is added; <BR>
<BR>
(e) replacing the computed squared distance values for the unknown speech
signals with conditional expected distance values calculated using the optimal
estimated distance values and variance values obtained from the look-up table,
in order to derive noise-immune metric values for the unknown speech signals;
and <BR>
<BR>
(f) computing the minimum mean square error of the noise-immune metric values
for the unknown speech signals compared with the noise-free template speech
signals, whereby an improved recognition result is obtained. <BR>
<BR>
2. The method according to claim 1, wherein said values are provided at specific
frequencies within the speech band. <BR>
<BR>
3. The method according to claim 2, wherein said frequencies employed are
at 300, 425, 1063, 2129 and 3230 Hz. <BR>
<BR>
4. The method according to claim 3, wherein said values are provided at selected
average signal-to-noise ratios. <BR>
<BR>
5. The method according to claim 4, wherein said average signal-to-noise
ratios are 0 db, 10 db, and 20 db. <BR>
<BR>
6. The method according to claim 2, wherein said values stored are indicative
of said first value at different frequencies within said speech bandwidth.
<BR>
<BR>
7. The method according to claim 4, wherein said values stored are indicative
of said first value at different signal-to-noise ratios. <BR>
<BR>
8. The method according to claim 1, wherein said values are replaced by mean
values to provide a new expected distance equal to: <BR>
<BR>
d.sup.2 =(t-u).sup.2 +.sigma..sub.i.sup.2 +.sigma..sub.u.sup.2 <BR>
<BR>
where t & u are the expected values of the template and unknown and
.sigma..sub.t &.sigma..sub.u are the variances of the estimates.
<HR>
<CENTER>
<B><I> Description</I></B>
</CENTER>
<P>
<HR>
<BR>
<BR>
BACKGROUND OF THE INVENTION <BR>
<BR>
This invention relates to speech recognition systems and more particularly
to a noise compensation method employed in speech recognition systems.
<BR>
<BR>
Speech recognizers measure the similiarity between segments of unknown and
template speech by computing the Euclidean distance between the respective
segment parameters. The Euclidean distance, as is known, is the sum of the
squares between such parameters. In such systems, by adding noise to either
the unknown or template speech or both causes the distance to become either
too large or too small and hence produce undesirable results in regard to
the speech recognizing capability of the system. <BR>
<BR>
As is known, speech may be represented as a sequence of frequency spectra
having different power levels across the frequency spectrum associated with
speech. In order to recognize speech, spectra from unknown words are compared
with known spectra. The storage of the known spectra occurs in the form of
templates or references. Basically, in such systems, unknown speech is processed
into, for example, digital signals, and these signals are compared with stored
templates indicative of different words. <BR>
<BR>
By comparing the unknown speech with the stored templates, one can recognize
the unknown speech and thereby assign a word to the unknown speech. Speech
recognition systems are being widely investigated and eventually will enable
a user to communicate with a computer or other electronic device by means
of speech. A larger problem in regard to speech recognition systems in general
is dealing with interfering noise such as background noise as well as the
sounds made by a speaker which sounds are not indicative of true words such
as lip smacking, tongue clicks, and so on. Other sources of interferences
such as background noise as well as other environmental noises produce
interfering spectra which prevent the recognition system from operating reliably.
<BR>
<BR>
In order to provide recognition with a noise background, the prior art has
attempted to implement various techniques. One technique is referred to as
noise masking. In this technique, one masks those parts of the spectrum which
are due to noise and leaves other parts of the spectrum unchanged. In these
systems both the input and template spectra are masked with respect to a
spectrum made up of maximum values of an input noise spectrum estimate and
a template noise spectrum estimate. In this way, the spectral distance between
input and template may be calculated as though the input and template speech
signals were obtained in the same noise background. Such techniques have
many disadvantages. For example, the presence of high noise level in one
spectrum can be cross coupled to mask speech signals in the other spectrum.
<BR>
<BR>
These systems require extensive mathematical computation and are therefore
extremely expensive, while relatively unreliable. In other techniques proposed
in the prior art, one measures the instantaneous signal-to-noise ratio and
replaces the noisy distance with a predetermined constant. This substitution
has the effect of ignoring information in those frequency intervals where
the signal-to-noise ratio is poor. In any event, this creates other problems
in that the speech recognition system may ignore unknown speech segments
in confusing them as noise or may serve to match a template to a dissimilar
unknown speech segment. Hence the above noted approach produces many errors
which are undesirable. <BR>
<BR>
It is, therefore, an object of the present invention to provide an improved
speech recognition system whereby the noisy distance is replaced with its
expected value. <BR>
<BR>
It is a further object to provide a speech recognition system which will
reduce the above noted problems associated with prior art systems. <BR>
<BR>
As will be explained, the system according to this invention replaces the
noisy distance with its expected value. In this manner incorrect low scores
are increased and incorrect high scores are decreased. The procedures according
to this invention require no operator intervention nor empirically determined
thresholds. The system can be used with any set of speech parameters and
is relatively independent of a specific speech recognition apparatus structure.
<BR>
<BR>
BRIEF DESCRIPTION OF THE PREFERRED EMBODIMENT <BR>
<BR>
A method of reducing noise in a speech recognition apparatus by using the
minimum mean square error estimate of the actual squared Euclidean distance
between template speech and unknown speech signals as conditioned by noisy
observations, comprising the steps of providing a first value indicative
of the value of a noise-free template, ts, providing a second value Us indicative
of the noise free unknown, providing a third value indicative of a noisy
template ts+n, providing a fourth value indicative of a noisy unknown speech
signal Us+n, providing a fifth value indicative of the average power of said
unknown speech signals, Ps, providing a sixth value indicative of the average
power of noise signals Pn, computing a new expected distance between said
template speech to said unknown speech signals according to the following
algorithm: <BR>
<BR>
d.sup.2 =E[(t.sub.s -u.sub.s).sup.2 /t.sub.s+n, u.sub.s+n, p.sub.s, p.sub.n
]. <BR>
<BR>
using said new expected distance as computed to measure the similarity between
said template speech and said unknown speech. <BR>
<BR>
BRIEF DESCRIPTION OF THE FIGURES <BR>
<BR>
Before proceeding with a brief description of the figures, it is noted that
there are seven Appendices which form part of this application, and the Figures
below are referred to in the Appendices as well as in the specification.
<BR>
<BR>
FIG. 1 is a graph depicting the distribution of real speech spectral magnitude
and of a Gaussian distribution. <BR>
<BR>
FIG. 2 is a graph depicting the optimal estimator averaged over all frequencies
versus prior art estimators. <BR>
<BR>
FIGS. 3a-3e are a series of graphs showing optimal estimator mappings for
different functions of frequencies. <BR>
<BR>
FIG. 4 is a graph showing the dependence of optimal estimators on signal-to-noise
ratio. <BR>
<BR>
FIG. 5 is a graph depicting a mean and standard deviation plotted against
noisy versus clean fourth rooted spectral frames. <BR>
<BR>
FIG. 6 is a mean and standard deviation plotted against noisy versus clean
speech and silence frames. <BR>
<BR>
FIGS. 7 and 8 are graphs depicting predicted and observed Pcum for noise
only and noisy speech cases. <BR>
<BR>
FIG. 9 to 11 are graphs depicting clean distance and alternative metrics
versus noisy unknown parameters for given template frames. <BR>
<BR>
FIGS. 12 to 17 are a series of graphs depicting wordspotting performance
using unnormalized parameters according to this invention. <BR>
<BR>
FIG. 18 is a simple block diagram of a word recognition apparatus which can
be employed with this invention. <BR>
<BR>
FIG. 19 is a flow chart illustrating the method for compensating for noisy
input speech employed in the invention. <BR>
<BR>
DETAILED DESCRIPTION OF THE DRAWINGS <BR>
<BR>
As will be explained, this invention describes a method of noise suppression
for speech recognition systems which method employs the principle of least
mean squares estimation implemented with conditional expected values. The
method can be applied to both restoration of noisy spectra and the implementation
of noise immune metrics. <BR>
<BR>
The method has been employed for the types of recognition systems which are
the connected digit recognition system and wordspotting. Utilizing the method
in regard to both systems resulted in improvements in operation.
<BR>
<BR>
Before describing the method, a few comments concerning the problems in general
will be described. As indicated, additive noise degrades the performance
of digital voice processors used for speech compression and recognition.
Methods to suppress noise are grouped according to how they obtain information
about the noise and integrate that information into the suppression procedures.
<BR>
<BR>
Spectral subtraction methods use the Short Time Fourier Transform to transform
the speech into magnitude spectrum. The noise bias is subtracted and the
modified magnitude is recombined with the phase and inverse transformed.
These methods vary according to the amount of the noise bias subtracted from
the magnitude spectrum, the type of non-linear mapping applied to the magnitude
spectrum, and the parameters of frequency analysis used. <BR>
<BR>
In any event, based on such considerations, acoustic noise suppression is
treated as a problem of finding the minimum mean square error estimate of
the speech spectrum from a noisy version. This estimate equals an expected
value of its conditional distribution given the noisy spectral value, the
mean noise power and the mean speech power. It has been shown that speech
is not Gaussian. This results in an optimal estimate which is a non-linear
function of the spectral magnitude. Since both speech and Gaussian noise
have a uniform phase distribution, the optimal estimator of the phase equals
the noisy phase. This estimator has been calculated directly from noise-free
speech. See a paper entitled OPTIMAL ESTIMATORS FOR SPECTRAL RESTORATION
OF NOISY SPEECH by J. E. Porter and Steven F. Boll, Proc. Internatl. Conf.
Acoust., Speech, and Sign. Proc., San Diego, CA, Mar., 1984. <BR>
<BR>
The above noted paper describes how to find the optimal estimator for the
complex spectrum, the magnitude, the squared magnitude, the log magnitude
and the root magnitude spectra. According to this invention, the noisy distance
is replaced with its expected value. Hence, as will be explained, the system
and method requires two forms of information which are the expected values
of the parameters and their variances. In regard to the above noted article,
there is described the techniques for estimating the minimum means square
error estimators. The article describes the optimal estimators necessary
for each criterion function. <BR>
<BR>
The optimal estimators are shown in FIG. 3. Each estimator is a function
of the signal to noise ratio. This is demonstrated by computing the tables
based upon signal-to-noise ratios of 0, 10, and 20 db. <BR>
<BR>
FIG. 4 gives an example of the SNR dependence for the complex spectrum estimate
at a frequency of 1063 Hz. The article explains how the table values for
the optimal estimators are arrived at and how the tables were generated to
provide the data shown in FIGS. 2 and 3. The similar figures are employed
in the article, and, essentially, the computation of optimal estimators is
well known as explained in the article. <BR>
<BR>
Before employing such estimators, one has to determine the estimation of
the expected value which minimizes the mean square error and the estimation
of the variance of the estimator in terms of pre-computed mean value tables.
Thus the objective is to provide an unbiased estimator. The mathematics for
determining the optimal estimators are included in the Appendix specified
as Section II. The technique of determining the mean and variation of transformed
optimal parameters is also included in the Appendix as Section III.
<BR>
<BR>
Section IV of the Appendix describes the technique of combining speech and
noise in the bandpass filter domain. Thus the information as in regard to
the expected values of the parameters as well as the variances are shown
in the Appendix under Sections II, III, IV whereby these sections describe
in great mathematical detail how to calculate the parameters.
<BR>
<BR>
Essentially, in consideration of the above, the new metric which will be
explained is conceptually midway between squared Euclidean and log likelihood
noise immune metric. The new metric is motivated by noticing that squared
Euclidean metrics become noisy when the signal has additive noise.
<BR>
<BR>
This noise can be reduced by using the minimum mean squared error estimate
of the actual squared Euclidean distance between template and unknown,
conditioned by the noisy observations. This results in a squared Euclidean
distance with the template and unknown values replaced by their optimal
estimators derived in regard to the above noted Appendices. To this Euclidean
distance is added the variance of the template estimate plus the variance
of the unknown estimate. Mathematically, the minimum mean squared error estimate
of the metric is defined as the expected value of the distance between noisy
unknown and template frames according to the following algorithm:
<BR>
<BR>
d.sup.2 =E[(t.sub.s -u.sub.s).sup.2 .vertline.t.sub.s+n,u.sub.s+n,P.sub.s,P.sub.n
] <BR>
<BR>
where <BR>
<BR>
t.sub.s =clean template <BR>
<BR>
u.sub.s =clean unknown <BR>
<BR>
t.sub.s+n =noisy unknown <BR>
<BR>
P.sub.s =Average Power of Speech <BR>
<BR>
P.sub.n =Average Power of Noise <BR>
<BR>
Assume that the clean template and unknown are given by: <BR>
<BR>
t.sub.s =t+.epsilon..sub.t <BR>
<BR>
u.sub.s =u+.epsilon..sub.u <BR>
<BR>
where: <BR>
<BR>
E[t.sub.s ]=t <BR>
<BR>
E[u.sub.s ]=u <BR>
<BR>
The quantities t and u are the expected values of the template and unknown.
.epsilon..sub.t and .epsilon..sub.u are the zero mean estimation errors
associated with the optimal estimator. Expressing the expected value in term
of these values gives: <BR>
<BR>
E[d.sup.2 (t,u)]=E[[(t+.epsilon..sub.t)-(u+.epsilon..sub.u)].sup.2 ]
<BR>
<BR>
Expanding the expected value, and noting that: <BR>
<BR>
E[.epsilon..sub.t ]=0 <BR>
<BR>
E[.epsilon..sub.u ]=0 <BR>
<BR>
E[.epsilon..sub.t .epsilon..sub.u ]=0 <BR>
<BR>
E[.epsilon..sub.t.sup.2 ]=.sigma..sub.t <BR>
<BR>
E[.epsilon..sub.u.sup.2 ]=.sigma..sub.u <BR>
<BR>
where .sigma..sub.t and .sigma..sub.u are the variances of these estimates,
gives <BR>
<BR>
d.sup.2 =(t-u).sup.2 +.sigma..sub.t.sup.2 +.sigma..sub.u.sup.2 <BR>
<BR>
Notice that this metric model reduces to a standard Euclidean norm in the
absence of noise. The metric model is also symmetric and can be applied when
either the template or unknown or both are noisy. <BR>
<BR>
Values for these means and variances are obtained by table lookup. These
tables are generated using filterbank parameters as described in Appendix
IV. <BR>
<BR>
To establish that the metric was working properly two types of experiments
were conducted: First scatter plots of clean distances versus noisy filterbank
parameters were generated and superimposed with the euclidean metrics using
noisy and optimal parameters and with the optimal parameters plus the variance
terms. Second, wordspotting runs with these parameters and metrics were made.
<BR>
<BR>
VERIFICATION OF THE EXPECTED VALUE <BR>
<BR>
In the same manner as shown in Appendix IV the validity of the noise metric
as a conditional expected value can be examined by plotting clean distances
versus noisy parameters. The distance requires a noisy unknown frame and
a clean or noisy template frame. In order to plot in just two dimensions,
the template frame was held constant and a set of distances were generated
for various unknown conditions and metrics. Three template frames, 0, 10,
and 50 were selected from the Boonsburo template of speaker 50 representing
the minimum, average and maximum spectral values. Distances and spectral
outputs from the ninth filter were selected as approximately representing
to the average signal to noise ratio over the entire baseband. FIGS. 9 through
11 show the scatter data along with the noisy distance, (straight parabola),
euclidean distance with optimal parameters, and the noisy metric. <BR>
<BR>
For this single channel, single template frame configuration, there is little
difference between using just the optimal parameters and the parameters plus
the variance term. However in each case the noisy metric passes through the
mean of the clean distances given the noisy unknown parameter. The dark band
in each figure corresponds to distances where the clean speech was near zero,
resulting in a distance equal to the square of the template parameter. Since
the optimal parameter tables where trained on speech only frames, the mean
distance is not biased by this non-speech concentration. Note that for large
values of the noise parameter, that all three distances agree. This is to
be expected, since the mean has approached the identity and the variance
has approached zero (See FIGS. 9 and 10). <BR>
<BR>
REDUCTION IN MEAN SQUARE ERROR <BR>
<BR>
The mean square error for each of these cases was also computed. The error
was calculated as: ##EQU1## As expected the error reduced monotonically going
from noisy to the optimal parameters, to the noise metric. Below is the computed
mean square error between clean distance and the distances computed with
each of following parameters: noisy, optimal estimator and optimal estimator
plus variance, i.e., noise metric. The distance is straight Euclidean, i.e.
the sum of the squares between the unknown spectral values minus the template
spectral values. These distances for the mean square error calculation, were
computed by selecting the 10th frame from the Boonsburo template for speaker
50 and dragging it by 1100 speech frames from the first section of WIJA.
The average mean square error values are: <BR>
<BR>
<PRE>
TABLE 5.1
______________________________________
Average Mean Square Error Values
Condition mse
______________________________________
noisy - clean 9.4
optimal parameters - clean
3.3
noise metric - clean
2.5
______________________________________
</PRE>
<P>
<BR>
<BR>
Although this represents only a course examination of performance, it does
demonstrate that the metric is performing as desired. A more realistic test
requires examining its performance in a wordspotting experiment as defined
below. <BR>
<BR>
WORDSPOTTING USING UNNORMALIZED PARAMETERS <BR>
<BR>
The wordspotter was modified to use unnormalized 4th root parameters and
Euclidean distance with or without the variance terms added. All other aspects
of the wordspotting program remained the same, i.e. standard blind deconvolution,
overlap removal, biasing, etc. Results are presented using the same scoring
procedure as described in App. III. The table shows the average ROC curve
differences for each template talker. <BR>
<BR>
<PRE>
______________________________________
Wordspotting Results Using Unnormalized Parameters
Condition 50 51 joco gara chwa caol ave
______________________________________
clean -19 -19 -7 -15 -10 -12 -13.6
noisy -25 -27 -21 -21 -21 -26 -13.3
Optimal Params
-20 -23 -8 -14 -22 -12 -16.6
Only
Noise Metric
-20 -22 -9 -16 -22 -12 -16.7
______________________________________
</PRE>
<P>
<BR>
<BR>
Although overall performance using unnormalized parameters is lower than
using normalized features, these experiments show some interesting
characteristics. Specifically, for five of the six template talkers, use
of the optimal parameters and/or the noise metric returned performance to
levels nearly equal to the clean unknown data. This degree of restoration
is not found in the normalized case. Stated another way, normalization tends
to minimize the deleterious effect of noise and the restoring effect of the
optimal parameters. <BR>
<BR>
NOISE METRIC USING NORMALIZED PARAMETERS <BR>
<BR>
In a preliminary development of the noise metric, the analysis used first
order terms in the power series expansion of the reciprocal square root.
Use of only first order terms leads to results which differ slightly from
the results when second order terms are included. The development with second
order terms is given below. Wordspotting performance is presented using the
corrected formulation. <BR>
<BR>
BACKGROUND <BR>
<BR>
Let x, x, x represent noisy, noise-free and estimated noise-free parameter
vectors, and let primes denote 1.sub.2 normalization: ##EQU2## <BR>
<BR>
The (unnormalized) estimator error is <BR>
<BR>
.epsilon.=x-x <BR>
<BR>
We define .eta. to be the error in estimating the normalized noise-free
parameters by normalizing the (unnormalized) estimator, ##EQU3##
<BR>
<BR>
.eta. can be expressed to first order in .epsilon. as ##EQU4## <BR>
<BR>
The previous analysis proceeded to use this first order approximation as
a basis for computing the effect of second order statistics of .epsilon.,
in the form of variances of the components of .epsilon.. This leads to the
conclusion, for example, that expected value of .eta. is zero since the expected
value of .epsilon. is <BR>
<BR>
E(.epsilon.)=E(x-x.vertline.x) <BR>
<BR>
which is zero when we ignore cross channel effects. This treatment is
inconsistent and leads to error, as second order effects are ignored some
places and used in other places. The previous analysis can be corrected by
carrying all second order terms in .epsilon. forward. This leads, among other
things, to the result that the expectation of .eta. is not zero. <BR>
<BR>
The analysis is now repeated by carrying forward all second order terms in
.epsilon.. Other than this change, the development is little different from
the previous one. When the expectation of the noise-free l.sub.2 - normalized
distance given the noisy observations has been expressed to second and higher
order in .epsilon. we will then assume third moments vanish and then ignore
cross-channel covariances. <BR>
<BR>
We start from the definition of the noise-free distance given the noisy
observations: ##EQU5## <BR>
<BR>
To simplify notation, we drop the notation specifying the noisy observation
conditioning. Then, using the fact that we are dealing with unit vectors,
the expected value can be expressed in terms of dot products as: ##EQU6##
where .eta. is defined above. Expanding the dot products give: ##EQU7##
<BR>
<BR>
The product term in .eta..sub.u and .eta..sub.t is an interesting problem.
For the most part, the error term .eta. will be the result of noise, and
noise at the template recording and at the unknown recording are very reasonably
assumed to be uncorrelated, so approximately, <BR>
<BR>
E(.eta..sub.u .eta..sub.t)=E(.eta..sub.u)E(.eta..sub.t) <BR>
<BR>
But this not quite correct, as the expectation is over speech and noise.
Correlations between .eta..sub.u and .eta..sub.t can therefore arise due
to the speech aspect of the expectation, (and, in fact, it can be expected
to differ in match and no-match conditions). Fortunately, in the present
analysis, where we're willing to have templates treated as noise free,
.eta..sub.t.ident. 0 and the problem doesn't arise. <BR>
<BR>
We continue by computing the expectation of a clean normalized parameter
vector. Since the treatment applies to both template and unknown, we don't
distinguish between them. ##EQU8## <BR>
<BR>
Substituting these in the expression for the expectation of x gives ##EQU9##
<BR>
<BR>
This expression for the expectation of the noise-free normalized parameter
vector is true for any estimator x which is a function of the noisy observations.
It is complete in the second order of residual error of the estimator, hence
is an adequate model for computing the effect of second order statistics.
We now specialize to the optimal estimator we have been using. (Notice we
have made no simplifying assumptions yet.) <BR>
<BR>
First some subtler points. From the definition of .eta., we have ##EQU10##
<BR>
<BR>
Since it is not known whether a distance calculation is for a match or no
match condition, correlation which exist between the template and the error
in the unknown cannot be used. It is therefore reasonable to make:
<BR>
<BR>
Assumption 1a: <BR>
<BR>
E(.epsilon..sub.u .vertline.x.sub.u,x.sub.t)=E(.epsilon..sub.u .vertline.x.sub.u)
<BR>
<BR>
Assumption 1b: <BR>
<BR>
cov(.epsilon..sub.u .vertline.x.sub.u,x.sub.t)=cov(.epsilon..vertline.x.sub.u)
<BR>
<BR>
Since the expectation of a vector is a vector of expectations, ##EQU11##
and each component can be expressed <BR>
<BR>
E(x.sub.u,i -x.sub.u,i .vertline.x.sub.u)=x.sub.u,i -E(x.sub.u,i
.vertline.x.sub.u) <BR>
<BR>
where <BR>
<BR>
E(x.sub.u,i .vertline.x.sub.u)=E(x.sub.u,i .vertline.x.sub.u,l, . . . ,
x.sub.u,n) <BR>
<BR>
Our optimal estimators are derived independently for each channel; that is,
<BR>
<BR>
x.sub.u,i .ident.E(x.sub.u,i .vertline.x.sub.u,i) <BR>
<BR>
In doing this, we ignore inter-channel dependencies. Thus we make
<BR>
<BR>
Assumption (2a) For any i <BR>
<BR>
E(x.sub.u,i .vertline.x.sub.u)=E(x.sub.u,i .vertline.x.sub.u,i)
<BR>
<BR>
Assumption (2b) For any i <BR>
<BR>
var(x.sub.u,i .vertline.x.sub.u)=var(x.sub.u,i .vertline.x.sub.u,i)
<BR>
<BR>
The effect of assumptions (1a) and (2a) is to make ##EQU12## <BR>
<BR>
Next we make the necessary assumptions needed to compute the estimator residual
error statistics. We have the within-channel variances, but don't want to
deal with the multitude of cross-channel covariances or higher moments. So
we make <BR>
<BR>
Assumption (3a) ##EQU13## <BR>
<BR>
Assumption (3b) Higher order moments of .epsilon..sub.u vanish, i.e.
<BR>
<BR>
E(O.sub.3 (.epsilon..sub.u))=0 <BR>
<BR>
Under these conditions, the expectation of the i.sup.th component of the
normalized parameter vector is ##EQU14## <BR>
<BR>
To find the noise immune metric, we first need the expectation of .eta..
##EQU15## and using the expression above, we find the components of this
expectation are given by ##EQU16## and similarly for the template vector,
when assumptions 1 through 3 are extended to it. <BR>
<BR>
As shown in the first part of this section, ##EQU17## <BR>
<BR>
To estimate it using the previous results, we formalize previous remarks
with Assumption 4: ##EQU18## and .gamma. and .beta. are defined similarly
for the template. <BR>
<BR>
In the wordspotting case, we generally assume the template is noise free,
so the .beta. and .gamma. terms for the template vanish. In that case the
result simplifies to ##EQU19## <BR>
<BR>
WORDSPOTTING RESULTS <BR>
<BR>
Wordspotting runs were made with and without the corrected metric on 10 dB
noisy speech. The table lists the results of the wordspotting experiments.
The standard scoring approach is given. That is, for each condition and each
template talker, the number represents the average amount that the ROC curve
differs from a selected baseline consisting of speaker 50 on clean unknown
speech. <BR>
<BR>
The legend for the table is as follows: <BR>
<BR>
base: clean templates vs. clean unknowns <BR>
<BR>
base.sub.-- noisy: clean templates vs. noisy unknowns at 10 dB SNR
<BR>
<BR>
base.sub.-- opt: clean templates vs. optimally restored unknowns. <BR>
<BR>
<PRE>
TABLE
______________________________________
Wordspotting Performance
Condition
50 51 joco gara chwa caol ave
______________________________________
base 3 7 9 15 3 17 9.0
base noisy
-12 -18 -2 -3 -10 -6 -8.5
base opt
-10 -12 2 -2 -17 1 -6.3
______________________________________
</PRE>
<P>
<BR>
<BR>
Referring to FIG. 18, there is shown a simple block diagram of a speech
recognizer apparatus which can be employed in this invention. Essentially,
the speech recognizer includes an input microphone 104 which microphone has
its output coupled to a preamplifier 106. The output of the preamplifier
is coupled to a bank of bandpass filters 108. The bank of bandpass filters
is coupled to a microprocessor 110. The function of the microprocessor is
to process the digital inputs from the bandpass filter bank and to process
the digital inputs in accordance with the noise immune distance metric described
above. <BR>
<BR>
Also shown in FIG. 18 are an operator's interface 111, a non-volatile mass
storage device 114 and a speech synthesizer 116. Examples of such apparatus
are well known in the field. See for example, a patent application entitled
APPARATUS AND METHOD FOR AUTOMATIC SPEECH RECOGNITION, Ser. No. 473,422,
filed on Mar. 9, 1983, now U.S. Pat. No. 4,624,008 for G. Vensko et al and
assigned to the assignee herein. <BR>
<BR>
As indicated above, the algorithm or metric which has been described is suitable
for operation with any type of speech recognizer system, and hence the structures
of such systems are not pertinent as the use of the above described algorithm
will enhance system operation. In any event, as indicated above, such speech
recognition systems operate to compare sound patterns with stored templates.
A template which is also well known is a plurality of previously created
processed frames of parametric values representing a word, which when taken
together form the reference vocabulary of the speech recognizer. Such templates
are normally compared in accordance with predetermined algorithms such as
the dynamic programming algorithm (DPA) described in an article by F. Ita-Kura
entitled MINIMUM PROTECTION RESIDUAL PRINCIPLE APPLIED TO SPEECH RECOGNITION,
IEEE Transactions, Acoustics, Speech and Signalling Processing, Vol. ASSP-23,
pages 67-72, Feb. 1975. <BR>
<BR>
The algorithm allows one to find the best time alignment path or match between
a given template and a spoken word. Hence as should be apparent from FIG.
18, modern speech recognition systems employ templates and incorporate such
templates in computer memory for making a comparison of the templates with
the digital signals indicative of unknown speech sounds which signals are
developed in the bandpass filter bank. The techniques for generating the
digital signal from unknown speech signals have been extensively described
in regard to the above noted co-pending application. <BR>
<BR>
See also a co-pending application entitled A DATA PROCESSING APPARATUS AND
METHOD FOR USE IN SPEECH RECOGNITION, filed on Mar. 9, 1983, Ser. No. 439,018,
now U.S. Pat. No. 4,567,606 by G. Vensko et al and assigned to the assignee
herein. This co-pending application describes a continuous speech recognition
apparatus which also extensively employs the use of templates. In any event,
as can be understood from the above, this metric compensates for noise by
replacing the noisy distance with its expected value. Hence a speech recognizer
operates to measure the similarity between segments of unknown and template
speech by computing, based on an algorithm, the Euclidean distance between
respective segment parameters. The addition of noise to either the unknown
speech or the template speech or both causes this distance to become either
too large or too small. Hence based on the algorithm of this invention, the
problem is solved by replacing the noisy distance with its expected value.
In order to do so, as explained above, there are two forms of information
required. The first is the expected values of the parameters and the second
is their variance. <BR>
<BR>
Thus based on the above description, as further supplemented by Appendices
II, III and IV, there is described the necessary calculations to enable one
to calculate the required parameters while the specification teaches one
how to combine the parameters to form the noise immune metric. As indicated,
the processing can be implemented by the system shown in FIG. 18 by storing
both the parameters and their variances in either memory 114 or in the
microprocessor memory 110. <BR>
<BR>
In accordance with the invention, a method of compensating for noisy input
speech in order to improve the recognition result of the speech recognition
apparatus comprises the following steps for producing an improved minimum
mean square error estimate conditioned by compensatory characteristics of
the noisy input speech: <BR>
<BR>
(a) computing optimal estimated distance values over the given range of
frequenciesfor noise-free template speech, based upon comparing known speech
segments, which are input in a noise-free environment and converted into
corresponding templates of known speech signals t.sub.s, with unknown speech
segments, which are input in a noise-free environment and converted to unknown
speech signals u.sub.s ; <BR>
<BR>
(b) computing estimated variance values corresponding to the optimal estimated
distance values for a sample population of noise-free speech segments;
<BR>
<BR>
(c) storing said optimal estimated distance values and variance values on
a look-up table associated with the template speech; <BR>
<BR>
(d) computing squared distance values over the given range of frequencies
for input noisy unknown speech signals u.sub.s+n compared with signals t.sub.s+n
representing template speech to which a spectral representation of noise
n in the actual input environment is added; <BR>
<BR>
(e) replacing the computed squared distance values for the unknown speech
signals with conditional expected distance values calculated using the optimal
estimated distance values and variance values obtained from the look-up table,
in order to derive noise-immune metric values for the unknown speech signals;
and <BR>
<BR>
(f) computing the minimum mean square error of the noise-immune metric values
for the unknown speech signals compared with the noise-free template speech
signals, whereby an improved recognition result is obtained. <BR>
<BR>
In regard to the above, the implementation of the noise immune distance metric
is mathematically explained in Appendix V. Appendix V describes how the metric
or algorithm can be stored into an existing metric which is widely known
in the field as the Olano metric. As indicated, noise immunity is obtained
in this system by replacing the Euclidean square distance between the template
and unknown frames of speech by its conditional expectation of the square
distance without noise, given the noisy observations. As can be seen from
Appendix V, the conditional expected value is the minimum means square error
estimate of the distance. <BR>
<BR>
It, therefore, will reduce the noise on the frame-to-frame distance values
to its minimum possible value for given data. In order to implement the use
of the above described system, the noise metrics can be installed in any
system by substituting the optimal parameter values as derived and as explained
and by augmenting the feature vector with the variance information. Thus
for each signal frame which, as indicated above, is implemented in a voice
recognition system by means of the bandpass filter outputs after they have
been digitized, one performs the following steps: <BR>
<BR>
1. Replace (by table lookup) the noisy estimate with optimal estimate.
<BR>
<BR>
2. Obtain the variance, (also by table lookup). <BR>
<BR>
3. Normalize the filterbank parameters. <BR>
<BR>
4. Normalize the variance to account for parameter normalization. <BR>
<BR>
5. Augment the feature vectors with the variance information. <BR>
<BR>
The mathematics, as indicated, are explained in great detail in Appendix
V and particularly show how to modify the Olano metric. <BR>
<BR>
The following Appendices are included herein and are referred to during the
course of the specification to establish the mathematics used in accordance
with this invention: <BR>
<BR>
1. Appendix II--OPTIMAL ESTIMATORS FOR RESTORATION OF NOISY DFT SPECTRA.
<BR>
<BR>
2. Appendix III--MEAN AND VARIANCE OF TRANSFORMED OPTIMAL PARAMETERS.
<BR>
<BR>
3. Appendix IV--COMBINING SPEECH AND NOISE IN THE BANDPASS FILTER DOMAIN.
<BR>
<BR>
4. Appendix V--UNORMALIZED NOISE METRIC STUDIES. <BR>
<BR>
APPENDIX II <BR>
<BR>
OPTIMAL ESTIMATORS FOR RESTORATION OF NOISY DFT SPECTRA <BR>
<BR>
This Appendix considers processes which optimally restore the corrupted spectrum,
x=s +1c to a spectrum which minimizes the expected value of the norm squared
error between a function of the clean speech, f (s), and the same function
of the estimate, f (s), using only the noisy spectrum x and the average noise
energy at each frequency, P.sub.N. The restoration is done for each frequency
individually, and any correlation which might exist between spectral values
at different frequencies is ignored. The functions, f , to be considered
include: ##EQU20## <BR>
<BR>
These compression functions are commonly used in both speech recognition
and speech compression applications. Having optimal estimators for each case
allows the estimation to be matched to the type of compression used prior
to the distance calculation. That is if the recognizer matches cepstral
parameters, then the appropriate function to select would be log, etc. The
power function was estimated to measure performance differences with spectral
subtraction techniques based on the power function. Each of these minimizations
is described below. <BR>
<BR>
ESTIMATING THE MAGNITUDE SPECTRUM <BR>
<BR>
Many speech recognition algorithms are sensitive only to spectral magnitude
information. Human perception is also generally more sensitive to signal
amplitude than to phase. A speech enhancement system used to restore speech
for human listeners, or as a preprocessor for an automatic speech recognition
device, might therefore be expected to perform better if it is designed to
restore the spectral magnitude or power, ignoring phase. In this case,
appropriate optimization criterion functions, f , are:
<BR>
<BR>
f(s)=.vertline.s.vertline., <BR>
<BR>
or <BR>
<BR>
f(s)=.vertline.s.vertline..sup.2, <BR>
<BR>
and the optimal restoration function will minimize the ensemble average of
the error quantity
<BR>
<BR>
E[(.vertline.s.vertline.-.vertline.s.vertline.).sup.2 .vertline.x,P.sub.N]
<BR>
<BR>
or <BR>
<BR>
E[(.vertline.s.vertline..sup.2 -.vertline.s.vertline..sup.2).sup.2
.vertline.x,P.sub.N ]. <BR>
<BR>
ESTIMATING THE COMPRESSED MAGNITUDE SPECTRUM <BR>
<BR>
Studies of audition suggest that there is an effective compression active
in some perceptual phenomena (especially the sensation of loudness). Some
speech recognition devices also incorporate compression in the feature extraction
process. This suggests the criterion function:
<BR>
<BR>
f(s)=c(.vertline.s.vertline.) <BR>
<BR>
where c is a compression function. In this case, the optimal restoration
function will minimize the ensemble average of the error quantity
<BR>
<BR>
E[(c(.vertline.s.vertline.)-c(.vertline.s.vertline.)) .sup.2 .vertline.x,P.sub.N
] <BR>
<BR>
We shall consider two compression functions: the logarithm and the square
root. <BR>
<BR>
Note that since the cepstrum is the Fourier Transform of the logarithm of
the magnitude spectrum, and the Fourier Transform is a linear process,
minimization of the mean square error in the cepstrum is obtained when the
optimality criterion is f(s)=log .vertline.s.vertline.. <BR>
<BR>
ESTIMATING THE COMPLEX SPECTRUM <BR>
<BR>
Adopting the identity function for f , leads to a complex spectrum estimator
which minimizes the error quantity <BR>
<BR>
E[.vertline.s-s.vertline..sup.2 .vertline.x,P.sub.N ]. <BR>
<BR>
RELATION TO WIENER FILTER <BR>
<BR>
By integrating over all time, the Wiener filter minimize the mean square
error of a time waveform estimate, subject to the constraint that the estimate
is a linear function of the observed values. In the time domain the Wiener
filtering operation can be represented as a convolution, and in the frequency
domain as multiplication by the filter gain function. At a single frequency
the Wiener filter spectrum estimate is therefore a constant times the corrupted
spectrum value, x i.e. a linear function of the spectral magnitude. If speech
spectral values, s, had a Gaussian distribution, then the spectrum estimator
which minimizes the error quantity mentioned above would also be linear,
i.e., a constant times x. However, the distribution of speech differs greatly
from a Gaussian distribution, and the true optimal estimator function is
highly non-linear. FIG. 1 shows the cumulative distribution of real speech
spectral magnitudes and the cumulative distribution of the spectral magnitude
of a complex Gaussian time signal of equal power. The speech distribution
was obtained using a 1000 frame subset from the 27,000 magnitude frames used
to compute the estimators described in the implementation section. The Gaussian
signal was generated by averaging 20 uniformly distributed random numbers
of equal energy. The optimal linear estimator, corresponding to a Wiener
filter, is shown with the non-linear estimator averaged over all frequencies
and the mapping for Spectral Subtraction in FIG. 2. <BR>
<BR>
2.2.4. Minimum Mean Square Error Estimators <BR>
<BR>
The minimum mean square error estimate of a function of the short-term speech
spectral value is the a posteriori conditional mean of that function given
the speech and noise statistics and the noisy spectral value. This estimate
can be calculated as follows. Let f represent the function of the spectrum
to be estimated. Let s and x be clean and noisy complex spectral values,
respectively, and n the complex noise. Let f (s) denote the optimal estimator
of the function f (s) and let E{.}.sub.p denote expectation with respect
to the probability distribution p. Then the minimum mean square estimate
is given by: <BR>
<BR>
f(s)=E{f(s)}.sub.s.vertline.x =.intg.f(s)p(s.vertline.x)ds. <BR>
<BR>
When speech and noise are independent,
<BR>
<BR>
p(x.vertline.s)=p(s+n.vertline.s)=p.sub.n (n)=p.sub.n (x-s), <BR>
<BR>
where p.sub.n is the a priori noise density function. Thus the density of
the joint distribution of clean and noisy spectral values is given by:
<BR>
<BR>
p(s,x)=p(x.vertline.s)p.sub.s (s)=p.sub.n (x-s)p.sub.s (s), <BR>
<BR>
Where p.sub.s is the a priori speech probability density. Substituting gives
##EQU21## <BR>
<BR>
Thus, the optimal estimator, f (s), equals the ratio of expected values of
two random variables with respect to the distribution of clean speech spectra.
<BR>
<BR>
SPECIALIZATION TO A GAUSSIAN NOISE MODEL <BR>
<BR>
Assume that the noise has a zero mean, is uniform in phase, and has a Gaussian
distribution with power P.sub.N. Then the noise density function is:
<BR>
<BR>
p.sub.n (n)=.gamma.exp(-.vertline.n.vertline..sup.2 /P.sub.N) <BR>
<BR>
where .gamma. is a normalization factor. <BR>
<BR>
Substituting x-s for n in the expression for the optimal estimator gives:
##EQU22## Clean speech spectral values are observed to be uniformly distributed
in phase so p.sub.s (s) depends only on .vertline.s.vertline.. The density,
p.sub..vertline.s.vertline., of .vertline.s.vertline. on the positive real
line is then related to p.sub.s, the density of s, in the complex plane by:
##EQU23## The integrals in the expression for the optimal estimator are evaluated
in the complex plane using polar coordinates. Using the fact that ##EQU24##
where I.sub.n is the nth order modified Bessel function, the integrals can
be reduced to the real line. <BR>
<BR>
Two cases are considered, f(s)=s and f(s)=c(.vertline.s.vertline.), where
c is a compression function to be specified. <BR>
<BR>
In the first case the estimator reduces to: ##EQU25## where .psi. is the
phase of the corrupted spectral value, x. This shows that the phase of the
best estimate of the complex spectral value is the noisy phase. <BR>
<BR>
In the second case the estimator reduces to: ##EQU26## <BR>
<BR>
EVALUATION USING A LARGE SAMPLE OF SPEECH <BR>
<BR>
The estimates given above can be evaluated by interpreting them as ratios
of expectations with respect to the distribution of .vertline.s.vertline.
on the real line. Each integral in the expressions above is an expected value
with respect to the distribution of .vertline.s.vertline., as characterized
by its density, p.sub..vertline.s.vertline.. These expected values are functions
of .vertline.s.vertline., .vertline.x.vertline. and P.sub.N. They can be
conveniently approximated as average values of the given functions summed
over a large sample of clean speech. <BR>
<BR>
Using the ratio of sample averages to approximate, the optimal estimator
has the significant practical advantage that the a priori distribution of
.vertline.s.vertline. need not be known or approximated. In view of the
significant error introduced by the fairly common erroneous assumption that
speech spectral values have a Gaussian distribution, this distribution-free
approach to finding the optimal estimator is particularly attractive. From
a theoretical point of view, the ratio of sample averages can be defended
as giving a consistent estimate of the optimal estimator. Although it is
a biased estimate, the bias can, in practice, be made negligible by using
a large sample. For this study, 27,000 samples of spectral magnitude were
taken from the marked speech of the six males and two females in the X data
base. <BR>
<BR>
Of course, an optimal estimator obtained in this way is optimal with respect
to the distribution of .vertline.s.vertline. in the population of speech
from which the sample is taken. We have observed the distribution of
conversational speech spectral magnitude to be stable and reproducible when
averaged over twenty seconds or more, and normalized with respect to the
rms value after removal of silence. To make this normalization explicit,
with respect to speech power, we introduce the normalized spectral magnitude:
##EQU27## where P.sub.S is the average speech power in the sample S of speech.
<BR>
<BR>
TABLE GENERATION <BR>
<BR>
The expressions given above for the optimal estimators can be expressed as
tables in terms of the speech-to-noise ratio SNR=P.sub.S /P.sub.N, the noise
power, P.sub.N, and the distribution of the dimensionless clean speech spectral
magnitude, .sigma.. For restoration of speech it is convenient to implement
an optimal estimator in the form of tables which gives the spectral component
magnitude estimate as a function of the noisy spectral component magnitude,
.vertline.x.vertline., using a different table for each SNR value of interest.
It has been found useful to normalize the table input and output by
.sqroot.P.sub.N, since the tables are then only weakly dependent on SNR.
Accordingly, we introduce the dimensionless input quantity: ##EQU28## Tables,
t(.xi.,SNR), for estimating the complex spectrum, are then computed using
the expressions above for s, with the expectations converted to averages.
<BR>
<BR>
The estimator for the first case reduces to: ##EQU29## The estimate is then
implemented as: ##EQU30## <BR>
<BR>
Defining .vertline.s.vertline..sub.c as the spectral component magnitude
estimate which leads to the minimum mean square error in estimating the tables
for c(.vertline.s.vertline.) are defined by: ##EQU31## which, when the
compression function c is any power or the logarithm function, reduces to
##EQU32## The estimate is then implemented as: ##EQU33##
<BR>
<BR>
IMPLEMENTATION <BR>
<BR>
The restoration procedure consists of generating a table for mapping noisy
magnitude spectra into optimal estimates of the clean speech spectra. Values
for the table are calculated using a large population of talkers to obtain
a speaker independent process. The table is incorporated into a short time
spectral analysis-synthesis program which replaces noisy speech with restored
speech. <BR>
<BR>
TABLE GENERATION <BR>
<BR>
The optimal estimators are functions of the distribution of .vertline.s.vertline.
in the DFT frequency bin, the SNR in that bin, and the spectral magnitude
.vertline.x.vertline. of noisy signal divided by P.sub.N. A large sample
of conversational speech (27,000 frames) was taken from the wordspotting
data base, and a Gaussian noise model was used to build a set of tables
specifying the optimal estimates at a preselected set of five frequencies
and three SNR values. The five frequencies selected were a subset of the
center frequencies of the bandpass filterbank used to measure the spectral
parameters in the speech recognition system. The frequencies were 300, 425,
1063, 2129, and 3230 Hz. The optimal estimator tables were calculated at
each of these node frequencies. For the initial experiments, estimates at
other DFT bin frequencies were obtained by linear interpolation from these
five tables. Subsequent experiments used a single table representing the
average over all frequencies. <BR>
<BR>
GENERATION OF A REPRESENTATIVE SPEECH POPULATION <BR>
<BR>
A marked data base, was used as a representative conversational speech sample
for calculating the estimators. The data base consists of eight speakers
(six males and two females). Each 10 ms frame of speech has been marked as
either speech or non-speech by a trained listener, and only frames marked
as speech were used. For each frame, the DFT complex spectrum is calculated
at each of the specified node frequency bins. A total of 27,000 frames of
speech were used to estimate each table. <BR>
<BR>
SIGNAL-TO-NOISE RATIO ESTIMATION <BR>
<BR>
Table values for the optimal estimator are dependent upon the speech distribution
and the noise power. Thus, they are dependent upon the local signal-to-noise
ratio in each frequency bin. Tables were generated based on average
signal-to-noise ratios, across all frequencies, of 0 dB, 10 dB, and 20 dB.
At each of these levels the average noise power was measured. <BR>
<BR>
Average speech power was measured separately by first generating a histogram
of speech power from the multi-speaker conversational data base. The contribution
in the histogram due to silence is suppressed by noting that non-speech manifests
itself in the histogram as a chi-squared distribution at the low end of the
histogram. Non-speech power is removed by subtracting a least squares fit
to the chi-square distribution using the low end histogram samples from the
overall distribution. Speech power is then calculated by summing the difference.
<BR>
<BR>
Table entries are computed for normalized magnitude values, .xi., from 0
to 10 in steps of 0.2. The table is linearly extended from 70. to 700. Each
entry is calculated by specifying the value of P.sub.N based upon the average
signal-to-noise ratio and the value of .xi.. The tables are calculated by
averaging over all speech samples at a given frequency. <BR>
<BR>
OPTIMAL ESTIMATORS <BR>
<BR>
The optimal estimators for each criterion function, f, are presented in FIG.
3. These tables were calculated based upon an average signal-to-noise ratio
of 10 dB. <BR>
<BR>
The estimator is a function of the signal-to-noise ratio. This is demonstrated
by computing the tables based upon signal-to-noise ratios of 0, 10, and 20
dB. Examining the resulting estimator shows that the signal-to-noise ratio
dependence is similar for all frequencies. FIG. 4 gives an example of the
SNR dependence for the complex spectrum estimate at frequency 1063 Hz.
<BR>
<BR>
ANALYSIS-SYNTHESIS PROCEDURES <BR>
<BR>
The analysis-synthesis procedures were implemented using an algorithm similar
to that used to implement Spectral Subtraction. The input noisy speech was
analyzed using 200 point, half-overlapped hanning windows. A 256-point DFT
is taken and converted to polar coordinates. The magnitude spectrum is normalized
at each frequency by the square root of the average noise spectrum, P.sub.N.
The restored magnitude spectrum is found using the optimal estimator tables
at the five node frequencies and linearly interpolating at other frequencies.
<BR>
<BR>
EVALUATION ON CONNECTED DIGITS <BR>
<BR>
The effectiveness of the estimator as a noise suppression preprocessor was
measured both qualitatively by listening to the synthesis, and quantitatively
by measuring the improvement in performance of a connected digit recognition
algorithm using noisy speech with and without noise stripping at a
signal-to-noise ratio of 10 dB. Recognition performance is compared with
other approaches to noise stripping [8], performance without noise stripping,
and performance using alternative optimality criterion functions, f.
<BR>
<BR>
RECOGNITION EXPERIMENT <BR>
<BR>
The recognition experiment used a 3, 4, 5, and 7 connected digit data base
spoken by eight talkers (four males and four females). Template information
consisted of nine tokens per digit per speaker. Three of the tokens were
spoken in isolation and six of the tokens were manually extracted. For each
speaker there were 680 trials. The recognition experiments were done speaker
dependently. The feature vectors from templates and unknowns were matched
using a prior art metric. White Gaussian noise was added to the unknown data
to give an average signal-to-noise ratio of 10 dB. <BR>
<BR>
SUMMARY OF RESULTS <BR>
<BR>
Results are presented in terms of recognition error rates averaged over eight
speakers as a function of the type of preprocessing. Also given is the error
rate and the percent recovery from the noisy error rate, i.e., ##EQU34##
<BR>
<BR>
The need for two dimensional interpolation was also tested by collapsing
the five frequency tables into a single averaged table. The averaged table
for the root estimator is presented in FIG. 2. <BR>
<BR>
<PRE>
______________________________________
The legend for the table is:
Clean: Speech recorded using a 12 bit analog-to-
digital converter in a quiet environment.
Noisy: speech with Gaussian Noise added to give
a 10 dB signal-to-noise ratio.
SS: Noisy Processed by Spectral Subtraction [8]
Spectrum: Noisy Processed by using .function.(s) = s.
Power: Noisy Processed by using .function.(s) = .vertline.s.vertline.
.sup.2.
Mag: Noisy Processed by using .function.(s) = .vertline.s.vertline.
.
Root:
##STR1##
Single Table Root:
Noisy Processed by Single Table of Root
Log: Noisy Processed by using .function.(s) = log.vertline.s.vertli
ne..
______________________________________
Score Error Recovery
Unknown Template (%) Rate (%)
(%)
______________________________________
Clean Clean 98.4 1.6 100
10 dB Clean 58.1 41.9 0
SS Clean 88.7 11.3 76
Root Clean 89.8 10.2 79
Root-Ave.
Clean 88.6 11.4 76
Log Clean 91.1 8.9 82
Mag Clean 87.9 12.1 74
Power Clean 81.2 18.8 57
Spect Clean 86.5 13.5 70
10 dB 10 dB 96.4 3.6 95
SS SS 95.5 4.5 93
Spect Spect 96.5 3.5 95
Power Power 97.6 2.4 98
Mag Mag 97.7 2.3 98
Log Log 97.9 2.1 99
Root Root 97.9 2.1 99
Root-Ave.
Root-Ave. 97.8 2.2 99
______________________________________
</PRE>
<P>
<BR>
<BR>
OBSERVATIONS <BR>
<BR>
Use of the optimal estimators reduces the error rate for a speaker dependent
connected digit speech recognition experiment using a 10 dB signal-to-noise
data base from 42% to 10%. In addition, by processing the template data in
the same way as the unknown data, the error rate can be further reduced from
10% to 2%. Standard Spectral Subtraction techniques perform at a level near
those of the optimal estimator. <BR>
<BR>
The use of a single table reduced performance by 1.1% compared to multiple
tables when the recognizer used clean templates, but resulted in essentially
no degradation when the recognizer used processed templates.
<BR>
<BR>
LISTENING TESTS <BR>
<BR>
Informal listening tests were conducted to compare the alternative forms
of processing. The results can be divided into roughly three characterizations:
(1) speech plus musical noise; (2) speech plus white noise; and, (3) a blend
of 1 and 2. The spectral subtraction, SS and complex spectral estimate, Spect,
clearly fall in the first category. The Mag and Pow estimates are characterized
by the second category. Finally, the Root and Log processes are characterized
by the third category. <BR>
<BR>
These results can be correlated with the transfer function characteristics
by noting how the low amplitude signals are treated. When the low amplitude
magnitudes are severely attenuated, as in the Spect and SS options, the spectrum
is "more spike-like" with many narrow bands of signal separated by low energy
intervals giving rise to the musical quality. When the low amplitude signals
are set to a constant, as in the Mag and Pow options, the effect is to fill
in between the spikes with white noise. <BR>
<BR>
APPENDIX III <BR>
<BR>
MEAN AND VARIANCE OF TRANSFORMED OPTIMAL PARAMETERS <BR>
<BR>
Introduction <BR>
<BR>
This section addresses two topics: estimation of the magnitude value which
minimizes the mean square error between compressed magnitudes; and estimation
of the variance of the estimator in terms of precomputed mean value tables.
This section shows that using this approach to magnitude estimation produces
an unbiased estimator. It also shows that for monomial compression functions
such as square root or power, the variance can be calculated directly from
the mean tables. <BR>
<BR>
Optimal Magnitude Estimator <BR>
<BR>
Define the output power in a bandpass channel (either BPF or DFT), to be
P in the absence of noise, with magnitude, M=.sqroot.P. Define P* as the
noisy power due to the presence of stationary noise with mean power value
P.sub.n. Let the mean power value of the clean speech signal be P.sub.s.
The general form for the optimal estimator, c(M) of c(M) (not necessarily
a compressed function of the magnitude), which minimizes the error quantity:
##EQU35## is the conditional expected value: ##EQU36## <BR>
<BR>
In Appendix IV there is desired methods for computing estimators of the spectral
magnitude, M=.sqroot.P, which minimize this mean square error, with respect
to various compression functions, c. The compression functions considered
include the identity, log, square and square root. This section presents
this formulation again from a perspective which emphasizes the relation between
the compression function and the conditional expected value. <BR>
<BR>
The optimal magnitude estimator M.sub.c must satisfy <BR>
<BR>
c(M.sub.c)=c(M). <BR>
<BR>
We can solve for M.sub.c by considering compression functions c which are
one to one on the real line, R.sup.+, and thus have inverses on this domain.
Then ##EQU37## <BR>
<BR>
The Optimal Estimator As a Table Lookup <BR>
<BR>
Our method of computing M.sub.c uses the distribution of the noise (assumed
to be Guassian) and the distribution of clean speech. The Gaussian noise
is completely characterized by its mean power P.sub.n. We assume that the
speech power is scaled by P.sub.s. Thus normalizing the instantaneous speech
power by P.sub.s, results in a fixed distribution which is obtained from
any sample of speech, by just normalizing. <BR>
<BR>
Under these conditions and also that c is a power or the logarithm, it can
be shown that a scaling factor can be extracted from M.sub.c, permitting
the table lookup to be a function of two variables. We chose to use
.sqroot.P.sub.n, as it had dominant effect. <BR>
<BR>
The optimal estimator is implemented as a table lookup with .sqroot.P*/P.sub.n
as the argument and SNR=P.sub.s /P.sub.n as a parameter, i.e., different
tables for different values of SNR. Define the estimator, M.sub.c in terms
of the table t.sub.c as: ##EQU38## where t.sub.c represents the table lookup
function based upon compression function c. Solving for t.sub.c gives: ##EQU39##
The form actually implemented normalizes M by .sqroot.P.sub.n first before
forming the expected values. <BR>
<BR>
Compression Functions <BR>
<BR>
We use various compression functions applied to the magnitude to form the
recognition parameters. For example, the Olano metric uses the square root
of magnitude. In general, we get a recognition parameter, x, from compression
function k on the magnitude: <BR>
<BR>
x=k(M). <BR>
<BR>
In the presence of noise we use the optimal estimator for M rather than the
noisy value. Suppose we use the optimal estimator M.sub.c. Then we will be
using recognition parameters: <BR>
<BR>
x=k(M.sub.c). <BR>
<BR>
which will differ from the true, noise-free value by
<BR>
<BR>
.epsilon.=k(M.sub.c)-k(M). <BR>
<BR>
Statistics of Recognition Parameters <BR>
<BR>
In this section we derive the statistics of the recognition parameters with
respect to noise effects. Thus ##EQU40## When the compression functions k
and c are the same, ##EQU41## Evaluating the bias of the estimator gives:
##EQU42## So that in this case the estimator is unbiased. <BR>
<BR>
The variance of the estimation parameters can be obtained as: ##EQU43## Since
<BR>
<BR>
E{c(M)}=c(M.sub.c) <BR>
<BR>
substituting for M.sub.c gives: ##EQU44## <BR>
<BR>
By the same approach, when c.sup.2 is one to one on the real line, R.sup.+,
and thus has an inverse, ##EQU45## where t.sub.c s is the table lookup estimator
for the compression function c.sup.2 =c.times.c, (multiplication of functions
not composition). <BR>
<BR>
Variance of the Square Root Estimator <BR>
<BR>
If we use Olano parameters (before normalization) with a square-root c table,
c=.sqroot., denoted by r, and c.sup.2 =id. In that case ##EQU46## Where t.sub.m
and t.sub.r are the tables for the magnitude and root estimators.
<BR>
<BR>
Variance of the Magnitude Estimator <BR>
<BR>
If we were to use magnitude values themselves without compression as the
recognition parameters, c=id and c.sup.2 is the square law, which we have
called p. In that case ##EQU47## <BR>
<BR>
Examples of Mean and Variance Data <BR>
<BR>
Mean and variance data based upon the square root function were generated
with frames marked with SxSy categories 0 through 5 and displayed with two
types of scatter plots. Spectral outputs from the ninth filter were selected
as approximately representing the average signal to noise ratio over the
entire baseband. FIG. 5 shows the clean verses noisy 4th root of power spectral
frames using just speech frames. In FIG. 6 both speech and non-speech frames
are included. Along any vertical axis the estimator lies at the mean value.
Likewise the standard deviation represents about 30 percent of the scatter
away from the mean. The dark band in FIG. 6 corresponds to frames where the
clean speech was near zero. Since the optimal parameter tables where trained
on speech only frames, the mean distance is not biased by this non-speech
concentration. <BR>
<BR>
APPENDIX IV <BR>
<BR>
COMBINING SPEECH AND NOISE IN THE BANDPASS FILTER DOMAIN <BR>
<BR>
Introduction <BR>
<BR>
This section derives the probability density function for the noisy filterbank
parameter, X, given the clean filterbank parameter, S, P(X.vertline.S). This
density describes how speech and noise combine in the filterbank domain.
It is needed in order to generate the conditional expected value, S, and
its variance, of the clean filterbank parameter given the noise parameter,
E[S.vertline.X]. As discussed in Appendix III, this conditional expected
values, S, minimizes the mean square error: <BR>
<BR>
E[(S-S).sup.2 .vertline.X]. <BR>
<BR>
Each bandpass filter, BPF channel is modeled as the sum of independent DFT
channels. The number of independent channels will be less than or equal to
the number of bins combined to form a filterbank output. <BR>
<BR>
Each DFT channel is modeled as an additive complex process, with the signal
s.sup.k .xi..sup.k +i.eta..sup.k in the kth channel and noise n.sup.k
=.xi..sub.n.sup.k +i.eta..sub.n.sup.k. The noise is assumed Guassian and
uniformly distributed in phase. Define the noisy signal as, <BR>
<BR>
x.sup.k =s.sup.k +n.sup.k =(.xi..sup.k +.eta..sup.k)+i(.lambda..sub.n.sup.k
+.eta..sub.n.sup.k) <BR>
<BR>
The noisy channel signals add as: ##EQU48## <BR>
<BR>
We assume the noise in each independent channel has the same value,
.sigma..sub.n, (to be determined). Then the density of the noisy signal given
the clean signal in the complex plane is: ##EQU49## <BR>
<BR>
Let X be the BPF channel output with noise and S the BPF channel output without
noise. <BR>
<BR>
The joint density of the individual channel observation (x.sup.1, . . . ,
x.sup..chi.) given the signal values, (s.sup.1,. . . , s.sup..chi.), is the
product density: ##EQU50## where .chi. is the number of individual channels.
<BR>
<BR>
We see that the conditional distribution of X given the .chi. signal values
(s.sup.1, . . . , s.sup..chi.), is just the distribution of a sum of squares
of 2.sub..chi. normal variates which are independent, each having variance
.sigma..sub.n.sup.2, and means .xi..sup.1, .eta..sup.1, .xi..sup.2, .eta..sup.2,
. . . , .xi..sup..chi., .eta..sup..chi.. This is the non-central chi-squared
distribution in 2.sub..chi. degrees of freedom. <BR>
<BR>
Kendall and Stewart (Vol. II, page 244, Advanced Theory of Statistics) shows
that the density of the quantity ##EQU51## where each x.sub.i is unit variance
Guassian with mean .mu..sub.i, (all independent) is ##EQU52## <BR>
<BR>
(We note that the density of Z depends on the means .mu..sub.1, . . . ,
.mu..sub.n, only through the sum of their squares, .lambda., which is fortuitous,
as it makes the density of X depend on the individual DFT channel means
(.xi..sup.k,n.sup.k), through the sum of their squares, which is S).
<BR>
<BR>
To apply this, note that ##EQU53## showing that X/.sigma..sup.2 is distributed
as the sum of 2.sub..chi. unit variance, independent Gaussians with means
##EQU54## Therefore the density of ##EQU55## <BR>
<BR>
Simplification Using Bessel Functions <BR>
<BR>
Abramowitz and Stegin (AMS55) formula 9.6.1 shows that the modified Bessel
function of the first kind of order .upsilon. is ##EQU56## Comparing this
with the expression for P(X.vertline.S), see that ##EQU57## For later use,
we note the special case, obtainable directly from the power series expansion,
##EQU58## This agrees exactly with the (central) chi-squared distribution
in 2.sub..chi. degrees of freedom, as it should. <BR>
<BR>
Determination of the Number of Independent Degrees of Freedom <BR>
<BR>
The special case S=0 predicts that the BPF channel will pass Guassian noise
yielding outputs with the statistics of the chi squared distribution with
2.sub..chi. degrees of freedom. The mean and variance are ##EQU59## Let's
define P.sub.n to be the mean noise power in the channel with no signal present:
that is ##EQU60## <BR>
<BR>
By measuring the average and the variance of the output of the BPF channel,
we can therefore estimate the properties of the channel by the way it passes
Gaussian noise. The channel can be characterized by any two of the four
parameters .chi.,.sigma..sub.n.sup.2,P.sub.n,var(P.sub.n). <BR>
<BR>
We choose .chi. and P.sub.n because the former should be independent of the
channel input and the second should be a constant gain times the variance
or power of the input noise. <BR>
<BR>
Density In Terms of Measurement Parameters <BR>
<BR>
The model predicts the distribution of noisy output, given the clean output,
in terms of the channel characteristic .chi., and the noise-only mean output
P.sub.n : ##EQU61## and, in the special case of noise only, ##EQU62## FIG.
7 shows an example of how well the noise-only case fits observation. Shown
is the actual cumulative distribution for Gaussian white noise through the
first channel and the predicted distribution, P(X.vertline.S=0). The first
channel has an independent equivalent count, .chi., of 1.81. The sample size
is 1103 frames was used to generate the distribution. FIG. 8 shows an example
of how well the noisy speech case fits the predicted distribution,
P(X.vertline.S). Shown are the fractiles for the distribution superimposed
on a scatter plot of WIJA's clean versus noise channel parameters taken from
the 25th channel. <BR>
<BR>
Distribution Using Normalized Parameters <BR>
<BR>
We have been using quantities with the dimension of magnitude (versus power)
and normalizing by the rms magnitude of noise. What is actually required
is the distribution of the non-dimensional quantities ##EQU63## <BR>
<BR>
As one check of these results, we see that they reduce to the previous case
of individual DFT channels when .chi.=1. In that case, these distributions
are <BR>
<BR>
P.sub.1 (.xi..vertline..eta.)=2.xi.e.sup.- (.xi..sup.s +.eta..sup.s)I.sub.0
(2.xi..eta.) <BR>
<BR>
and <BR>
<BR>
P.sub.1 (.xi..vertline..eta.=0)=2.xi.e.sup.- .xi..sup.s <BR>
<BR>
The first formula is used to find optimal DFT channel estimators and the
second is a normalized Rayleigh distribution. <BR>
<BR>
For scaling purposes we have found it convenient to use a different
transformation for the Bessel function calculation. We use ##EQU64##
<BR>
<BR>
APPENDIX V <BR>
<BR>
UNNORMALIZED NOISE METRIC STUDIES <BR>
<BR>
Introduction <BR>
<BR>
This section presents an analysis of the noise immune distance metric without
normalization. Parameters consist of the fourth root of filterbank output
power. To demonstrate that the metric has been calculated properly, scatter
plots of clean distances versus noise parameters superimposed with the noise
metric are presented. Also in order to verify the installation of the variance
tables and new metric code, wordspotting runs using unnormalized parameters
were made. As expected, use of these parameters reduce overall performance
by 10% to 20%. However, the intent of these experiments was to verify the
code and correlate a reduction in rms error to increased wordspotting
performance. The results demonstrate that the combination of optimal parameters
plus variance terms improves performance to the same level as using clean
speech for five of the six template talkers. <BR>
<BR>
Minimum Error Estimate <BR>
<BR>
The noise metric is based upon the premise that adding noise to the unknown
or template generates noisy distances. Noise immunity is obtained by replacing
the Euclidean squared distance between the template and unknown frames by
its conditional expectation of the squared distance without noise, given
the noisy observations. That is, <BR>
<BR>
d.sup.2 =E[(t.sub.s -u.sub.s).sup.2 .vertline.t.sub.s+n, u.sub.s+n, P.sub.s,
P.sub.n ] <BR>
<BR>
where <BR>
<BR>
t.sub.s =4th root of power for template <BR>
<BR>
u.sub.s =4th root of power for unknown <BR>
<BR>
t.sub.s+n =noisy template filterbank parameter <BR>
<BR>
u.sub.s+n =noisy unknown filterbank parameter <BR>
<BR>
P.sub.s =Average Power of Speech <BR>
<BR>
P.sub.n =Average Power of Noise. <BR>
<BR>
The conditional expected value is the minimum mean squared error estimate
of the distance, given the observations. It will reduce the noise on the
frame-to-frame distance values to its minimum possible value for the given
data. It is also an unbiased estimate. <BR>
<BR>
Expanding the expected value and replacing mean values by their optimal estimates
gives: <BR>
<BR>
d.sup.2 =.SIGMA.(t.sub.i -u.sub.i).sup.2 +.sigma.t.sup.2.sub.i
+.sigma..sub.u.sup.2.sub.i <BR>
<BR>
The quantities t.sub.i and u.sub.i are the expected values of the template
and unknown for each channel and .sigma..sub.t.sbsb.i and .sigma..sub.u.sbsb.i
are the variances of these estimates for each channel. Notice that this metric
model reduces to a standard Euclidean norm in the absence of noise. The metric
model is also symmetric and can be applied when either the template or unknown
or both are noisy. <BR>
<BR>
Values for these means and variances are obtained by table lookup. These
tables are generated using filterbank parameters as previously described.
<BR>
<BR>
To establish that the metric was working properly two types of experiments
were conducted: First scatter plots of clean distances versus noisy filterbank
parameters were generated and superimposed with the euclidean metrics using
noisy and optimal parameters and with the optimal parameters plus the variance
terms. Second, wordspotting runs with these parameters and metrics were made.
<BR>
<BR>
Verification of Expected Value <BR>
<BR>
In the same manner as used in Appendix III the validity of the noise metric
as a conditional expected value can be examined by plotting clean distances
versus noisy parameters. The distance requires a noisy unknown frame and
a clean or noisy template frame. In order to plot in just two dimensions,
the template frame was held constant and a set of distances were generated
for various unknown conditions and metrics. Three template frames, 0, 10,
and 50 were selected from the Boonsburo template of speaker 50 representing
the minimum, average and maximum spectral values. Distances and spectral
outputs from the ninth filter were selected as approximately representing
to the average signal to noise ratio over the entire baseband. FIGS. 9 through
11 show the scatter data along with the noisy distance, (straight parabola),
euclidean distance with optimal parameters, and the noisy metric. For this
single channel, single template frame configuration, there is little difference
between using just the optimal parameters and the parameters plus the variance
term. However in each case the noisy metric passes through the mean of the
clean distances given the noisy unknown parameter. The dark band in each
figure corresponds to distances where the clean speech was near zero, resulting
in a distance equal to the square of the template parameter. Since the optimal
parameter tables where trained on speech only frames, the mean distance is
not biased by this non-speech concentration. Note that for large values of
the noise parameter, that all three distances agree. This is to be expected,
since the mean has approached the identity and the variance has approached
zero (See FIGS. 10 and 11). <BR>
<BR>
Reduction in Mean Square Error <BR>
<BR>
The mean square error for each of these cases was also computed. The error
was claculated as: ##EQU65## As expected the error reduced monotonically
going from noisy to the optimal parameters, to the noise metric. Below is
the computed mean square error between clean distance and the distances computed
with each of following parameters: noisy, optimal estimator and optimal estimator
plus variance, i.e., noise metric. The distance is straight Euclidean, i.e.
the sum of the squares between the unknown spectral values minus the template
spectral values. These distances for the mean square error calculation, were
computed by selecting the 10th frame from the Boonsburo template for speaker
50 and dragging it by 1100 speech frames from the first section of WIJA.
The average mean square error values are: <BR>
<BR>
<PRE>
______________________________________
Condition mse
______________________________________
noisy - clean 9.4
optimal parameters - clean
3.3
noise metric - clean
2.5
______________________________________
</PRE>
<P>
<BR>
<BR>
Although this represents only a course examination of performance, it does
demonstrate that the metric is performing as desired. A more realistic test
requires examining its performance in a wordspotting experiment as defined
below. <BR>
<BR>
Wordspotting Using Unnormalized Parameters <BR>
<BR>
The wordspotter was modified to use unnormalized 4th root parameters and
Euclidean distance with or without the variance terms added. All other aspects
of the wordspotting program remained the same, i.e. standard blind deconvolution,
overlap removal, biasing, etc. FIGS. 12 through 17 show the ROC curves for
each template talker. <BR>
<BR>
Observations <BR>
<BR>
Although overall performance using unnormalized parameters is lower than
using normalized features, these experiments show some interesting
characteristics. Specifically, for five of the six template talkers, use
of the optimal parameters and/or the noise metric returned performance to
levels equal to the clean unknown data. This degree of restoration is not
found in the normalized case. Stated another way, normalization tends to
minimize the deleterious effect of noise and the restoring effect of the
optimal parameters. <BR>
<BR>
<CENTER>
<B>* * * * *</B>
</CENTER>
<P>
<HR>
<CENTER>
</CENTER>
</BODY></HTML>
Login or Register to add favorites

File Archive:

September 2024

  • Su
  • Mo
  • Tu
  • We
  • Th
  • Fr
  • Sa
  • 1
    Sep 1st
    261 Files
  • 2
    Sep 2nd
    17 Files
  • 3
    Sep 3rd
    38 Files
  • 4
    Sep 4th
    52 Files
  • 5
    Sep 5th
    23 Files
  • 6
    Sep 6th
    27 Files
  • 7
    Sep 7th
    0 Files
  • 8
    Sep 8th
    1 Files
  • 9
    Sep 9th
    16 Files
  • 10
    Sep 10th
    38 Files
  • 11
    Sep 11th
    21 Files
  • 12
    Sep 12th
    40 Files
  • 13
    Sep 13th
    18 Files
  • 14
    Sep 14th
    0 Files
  • 15
    Sep 15th
    0 Files
  • 16
    Sep 16th
    21 Files
  • 17
    Sep 17th
    51 Files
  • 18
    Sep 18th
    23 Files
  • 19
    Sep 19th
    48 Files
  • 20
    Sep 20th
    36 Files
  • 21
    Sep 21st
    0 Files
  • 22
    Sep 22nd
    0 Files
  • 23
    Sep 23rd
    0 Files
  • 24
    Sep 24th
    0 Files
  • 25
    Sep 25th
    0 Files
  • 26
    Sep 26th
    0 Files
  • 27
    Sep 27th
    0 Files
  • 28
    Sep 28th
    0 Files
  • 29
    Sep 29th
    0 Files
  • 30
    Sep 30th
    0 Files

Top Authors In Last 30 Days

File Tags

Systems

packet storm

© 2024 Packet Storm. All rights reserved.

Services
Security Services
Hosting By
Rokasec
close