Speech detection by direction

As long as there have been hearing instruments with directional microphones, the common assumption has been that hearing instrument wearers look at the person to whom they are listening. That theory is interesting, widespread, reasonable – and often incorrect. Think about this example: You are seated in the third row at a lecture on hearing instruments. The lecturer is standing at a podium about 60° to your left, speaking to you about directional microphone performance while describing an image of a polar plot with a laser pointer on the screen in front of you. Are you looking at the speaker or the screen?

This is by no means an isolated case. Waiters speak to you over your shoulder at a restaurant, “would you like the fish or the beef?” Your children chat with you on the way to school from the back seat of a minivan. Are you watching them or the road? Your spouse leans over in church to whisper that you have a coffee stain on your shirt. We aren’t looking at the person to whom we are listening a substantial percentage of the time.

How often are signals of interest located somewhere other than directly in front of us? Walden et.al. (2004)1 asked 17 hearing instrument wearers to track various aspects of desired signals and noises for seven days over a fourweek period. The participants reported on 1586 instances of listening experiences where speech was present. They indicated that speech was from the front for 1268 of those instances, and speech was from other than the front 318 times. Thus, listeners had reported that speech was from the front roughly 80% of the time and from some other direction about 20% of the time. Therefore, speech is not from the front a pretty respectable amount of the time. When you assume that people look at the person to whom they are listening, you would be correct only about 80% of the time for most people.

Why then are hearing instrument companies so stuck on this idea? Partly the reason was historical. We had no choice. Hearing instruments weren’t yet able to determine which direction speech was coming from. However, that situation changed with the introduction of SpeechPro in Unitron hearing instruments. SpeechPro uses binaural acoustic signal processing to determine whether speech is coming from the front, right, left or back on a momentto- moment basis. By using the inputs to both microphones on each hearing instrument and then communicating between the devices collaboratively, it is possible for the whole binaural system to converge on the location of the desired speech with great accuracy.

There are substantial potential advantages to knowing the direction of desired speech relative to background noise. Not only can we point the target area of directional microphones toward the speech and away from distracting noise, we can also make other direction (azimuth) specific adjustments to enhance hearing instrument performance. For example, the microphone location effect (MLE) for receiver in canal (RIC) instruments with microphones at the top of the pinna is different from the front than from the back or the side. The transfer function to account for the MLE has always assumed a front-facing signal because we had to pick a direction and it was the obvious choice. However, if the hearing instruments can now determine the direction of speech, we can dynamically alter the MLE to the best possible values for the known orientation of the speech signal in the moment. This adjustment can improve sound quality and the natural perception of direction. How effective those dynamic adjustments will be depends on how well the system actually works.

We turned to the University of South Florida to help us benchmark the accuracy of our speech azimuth detectors. Digital detection proficiency of any type depends on a balance of speed, certainty and processing capability. In essence, the fewer samples you need to make a decision about a listening environment, the faster your detection. However, the more samples you take of the environment, the more certain you are that your detection is correct. In other words, the faster you go, the more mistakes you are likely to make. The slower you go, the more delayed your decisions will be and the further out of synch with the dynamic listening environment they can become. The ideal balance is to go fast enough so that your decisions are still relevant for the changing listening environment, but slow enough so you don’t make too many mistakes.

One important caveat to this rule is the processing capability of the device doing the sampling. As with personal computers, each generation of signal processing chip in our hearing instruments is progressively faster and more powerful than the ones that preceded it. Thus, we have benchmarked our more recent Tempus platform as well as the earlier version on the North platform. See Figures 2a & 2b on the next page.

The performance of these two generations of Unitron devices was tested in the Auditory & Speech Sciences Laboratory at the University of South Florida (USF). Speech and noise passages were presented in a sound treated room containing an array of 24 speakers with presentations from four azimuths (see Figure 1). The investigators monitored detection over four hours of sampling using 40-second intervals. The speech and noise samples consisted of combinations of the following:

Four speech passages: male, female and M/F turn taking
Five different types of diffuse background noise
Four SNRs: -3, 0, 3 and 6 dB
Four azimuths: 0°, 90°, 180° and 270°

Figure 1. Speech (male, female or turn taking) from any one of the speakers with a green + at any time. One of the four types of noise from all four speakers with a red – at all times.

Accuracy was calculated as the percentage of correct detections for each of three start times averaged over a full 40-second interval. The start times were:

0 second delay = “Instantaneous Measure”, which occurs immediately upon a change of direction
5-second delay = “Intermediate Offset”
the remaining interval following the average switch time = “Best Offset” for each given device
North platform (Q2 Pro) = 17 seconds
Tempus platform (Moxi Fit Pro) = 6.2 seconds

Notice in Figure 2a it took up to 17 seconds to reach the Best Offset for the Quantum2 Pro but only 6.2 seconds for the Moxi Fit Pro (Figure 2b). The reduced time required to converge on a reliable detection for the Moxi Fit Pro was due to the improved processing capabilities of the Tempus platform relative to the North platform.

Figure 2a & 2b – Detection accuracy of two generations of Unitron products with SpeechPro. Accuracy results on the top using 2a) Quantum2 Pro (North) device and the bottom 2b) Moxi Fit Pro (Tempus). Results are displayed from left to right in each figure as the signal-to-noise ratio (SNR) goes from very difficult (-3 dB) to very easy (+6 dB). Accuracy is shown by percentage from 0% (0) to 100% (1).

There are three trends visible in Figures 2a & 2b. First, it should be obvious that the later Moxi Fit Pro (Tempus) yielded much more accurate detection results than the earlier Quantum2 Pro (North). The detection accuracy of the Quantum2 Pro ranged from just under 30% at the -3 dB SNR to as high as just under 70% at +6 dB SNR. Meanwhile, the Moxi Fit Pro accuracy scores ranged from approximately 70% at worst for the -3 dB SNR condition up to nearly 90% correct in the +6 dB SNR condition. Thus, there was a huge improvement in detection accuracy associated with the switch to the newer Tempus platform. Furthermore, in the SNR range (+3 dB to +6 dB) where most hearing instrument wearers would choose to listen to speech in noise, the Moxi Fit Pro detection accuracy was consistently above 80% correct.

Second, aside from the large performance difference between the two platforms there was an overall impact of SNR for both sets of hearing instruments. As the SNR became more favorable, from left to right on each graph, the detection accuracy increased as well. There was an offset delay effect. Giving the hearing instrument a few more seconds to monitor the speech direction also improved accuracy when both platform and SNR factors were constant. In other words, ignoring the first five seconds of each 40 second detection cycle and averaging only the last 35 seconds consistently yielded improved detection accuracy for the intermediate offset versus the instantaneous measure for both platforms. Waiting for the best offset rendered even more accurate detection outcomes. This effect was more pronounced with the Quantum2 Pro devices. But obtaining the best offset with the Quantum2 Pro required the investigators to ignore the entire first 17 seconds of the detection samples. Meanwhile, the Moxi Fit Pro converged on the best offset in only 6.2 seconds – that’s three times faster.

Third, the Moxi Fit Pro outperformed the Quantum2 Pro so thoroughly that the detection accuracy of the instantaneous measure of the Moxi Fit Pro at -3 dB SNR (the worstcase scenario) was equal to the best offset measurement of the Quantum2 Pro at +6 dB SNR (the best case scenario). That is a huge performance bump.

So we built a better speech detector for our hearing instruments and we got a big bump in detection accuracy. Having a hearing instrument that can accurately detect the direction of speech over 80% of the time at a positive SNR sounds pretty good. But what does it mean?

While we don’t have human detection accuracy data as a part of this study to compare our results against, we can look to the literature for another study where azimuth detection was measured on hearing impaired people. Then we can decide how excited we should be about these results.

By collapsing the Moxi Fit Pro data into a single table of azimuth directions by SNRs we can make an acceptable comparison to another table in a study by Keidser et.al. (2009).2 See Table 1 below for the Moxi Fit Pro detection data.

Table 1. Percent of correct detections for all measurements at each SNR and for each direction. The Overall column is the percentage of all correct detections by SNR averaged across all four azimuths tested.

The Moxi Fit Pro demonstrated near perfect accuracy when detecting speech across all four types of background noise when the speech came from the front. However, the accuracy drops off gradually as you go across and down Table 1. The overall average accuracy collapsed across all four directions that were tested is highest at the most favorable SNR, 88.4% (+6 dB) and lowest at the 0 dB SNR level. The dip at 0 dB SNR relative to -3 dB SNR appears to be due to an increase in confusions for speech from the back at 0 dB SNR.

These results can be compared to the front/back confusion data from Keidser et.al. as shown in
Table 2.

Table 2. The average percentage of reversals of 40 responses produced in the front/back (F/B) dimension

Keidser et. al. looked at front/back confusions for the 51 participants in their study. As with most localization studies, humans had the most difficulty correctly determining whether the test signal came from the front or the back. The researchers found that left/right confusions were far less common. Front/back confusions are the most common type of localization error, even among normal hearing people. Left/right confusions are far less common because of the relatively larger differences in interaural level, time, frequency and phase resulting from the head related transfer function (HRTF) from one side of the head to the other. The HRTF has minimal impact when listening to target signals directly from the front or back. It is mainly the external ear effects, primarily spectral, that contribute to front/back localization3. Those effects are very small relative to the much larger impact of the HRTF from left/ right.

Table 2, from Keidser at. el. (2009), shows the percentage of front/back reversals out of 40 trials in responses from 30 normal hearing (NH) and 21 hearing impaired (HI) individuals. Standard deviations are also shown in brackets. In this study, the participants were presented with the five different speech or noise targets from any one of 20 loudspeakers spaced around them in a circle at 20° intervals. The most direct comparison between the human results in this study and the hearing instrument detection accuracy above is the percentage of front/back confusions for the speech signal. The normal hearing participants did quite well, averaging between 1% and 6% incorrect with standard deviations ranging from 1.8% to 8.1% across all test signals. However, the unaided responses from the hearing impaired group were not as good. Their error rates ranged from 33% to 38% incorrect with standard deviations from 8% to 13.2% across test signals.

These unaided participants correctly located speech from the front/back 67% of the time with no competing noise present. We can compare that to the most similar test conditions for the performance of the hearing instrument detectors at the two most positive SNRs, +3 dB and +6 dB. At the +6 dB SNR the hearing instruments correctly detected the location of the speech 99.1% of the time from the front and 85.6% of the time from the back. The results were similar at the +3 dB SNR level, 100% correct from the front and 82.8% correct from behind. To be fair, we are comparing a speech in quiet paradigm to a speech in noise paradigm so the comparison is not ideal. However, it can provide a frame of reference for the detection performance of the hearing instrument against the known capability of hearing instrument wearers for locating speech in the most difficult test case (front/back). It is not unreasonable to say that the hearing instrument results are at least comparable and perhaps an improvement over what a hearing impaired person can detect.

We can make a few observations based on the results presented here. It should be clear that shifting from the North platform to the Tempus platform yields a considerable increase in detection speed and accuracy. This speed and accuracy continuously improves with every new platform iteration going forward from North to Tempus and now Discover and Discover Next. The accuracy of the North platform ranged from about 30% to nearly 70% in the most favorable condition, including a 17 second delay time to improve accuracy. Meanwhile, the Tempus platform results ranged from approximately 70% to nearly 90% accuracy with at most a 6.2 second measurement delay for processing. It can also be seen that the Tempus detection accuracy remains very high at nearly 70%, even at a -3 dB SNR with instantaneous measurement. Finally, we have seen that the front/back accuracy of the Tempus detectors is at least comparable to that of a group of hearing impaired listeners and, in some cases, much better (for speech from front). Hopefully, this paper demonstrates some of the value of binaural signal processing that enables the hearing instruments to accurately determine the location of speech even in a very noisy listening environment.

I would like to acknowledge the contributions of Dr. Ozmeral and Dr. Eddins who worked closely with us to develop the sound parkour and undertake the data collection in their lab at the University of South Florida.

References

¹Walden, B.E., et al., Predicting Hearing Aid Microphone Preference in Everyday Listening. Journal American Academy of Audiology, 2004. 15: p. 365-396.
²Keidser, G., et al., The effect of frequency-dependent microphone directionality on horizontal localization performance in hearingaid users. International Journal of Audiology, 2009. 48(11): p. 789-803.
³Van Den Bogaert, T., E. Carette, and J. Wouters, Sound localization with and without hearing aids. 2009.

Extend the hearing assessment beyond your office and into the real world. Your clients can experience hearing instruments at home, at work, or wherever they spend time.