Binaural sound source localization based on weighted template matching

2021-09-06 09:26:28HongLiuYonghengSunGeYangYangChen

CAAI Transactions on Intelligence Technology 2021年2期

Hong Liu |Yongheng Sun |Ge Yang |Yang Chen

1Key Laboratory of Machine Perception,Shenzhen Graduate School,Peking University,Shenzhen,China

2College of Liangjiang Artificial Intelligence,Chongqing University of Technology,Chongqing,China

3Yanka Kupala State University of Grodno,Grodno,Belarus

Abstract In robot binaural sound source localization (SSL),locating the direction of the sound source accurately in the shortest time is important.It refers to the algorithm complexity,but even more to the shortest duration of the required signal.A novel binaural SSL method based on feature and frequency weighting is proposed.More specifically,in the training stage,the direction-related interaural cross-correlation function(CCF) and interaural intensity difference(IID)in each frequency band are calculated under noiseless conditions,which are considered the templates.In the testing stage,first the cosine similarities between the CCF and IID of the test signal and templates are calculated in all features and frequency bands.Then,the direction likelihood can be obtained by weighting the similarities.Finally,the direction with maximum likelihood is specified as the direction of the sound source.Experiments were carried out on CIPIC dataset subject 003 with different noises in the noisex-92 dataset and demonstrated that the method can accurately locate the sound source with a short signal duration.

1|INTRODUCTION

The robotic auditory system is a natural and friendly way for the robot to interact with the outside world.Compared with the visual system,the auditory system has a 360-degree angle of view that is less affected by obstacles,and there is no requirement for light.The robotic hearing system includes many aspects such as speech recognition,speaker recognition,emotion recognition,and speech noise reduction [1].Among them,robot sound source localization (SSL),as one of the front-end processing modules of the auditory system,has an important role,such as in hearing aids,human-computer interaction,and video conferencing [2-5].Binaural SSL is a branch of SSL that simulates the human auditory system and has an irreplaceable role in humanoid robots.The method,which is based on the time difference of arrival,is an important way to realize SSL.It usually includes two steps:the extraction of localization cues and the mapping of localization cues to the direction of the sound source.Commonly used binaural SSL cues include interaural time difference(ITD),interaural intensity difference (IID) and interaural phase difference(IPD)[6,7].ITD refers to the time interval between the sound waves from the source to the left and right ears,IID refers to the intensity difference between the sound waves received by both ears and IPD is the performance of ITD in the frequency domain.The time-frequency representation of binaural signals can be used to describe the change of ITD and IID with time and frequency [8,9].The gammatone filter and the short-time Fourier transform(STFT) are two common methods for converting time-domain signals into time-frequency domain signals.The STFT is only from the point of view of the signal,whereas the gammatone filter is designed with the structure of the human cochlea,which can well simulate the frequency division characteristics of the human ear basement membrane[10,11].

Under normal circumstances,it is assumed that the sound from the source to the microphone is a simple straight-line transmission.However,in a reverberant room environment,the signal received by the microphone is the superposition of the direct path signal and the reflected signal resulting from the reflection of objects such as walls,ground,furniture,and so on.This causes the signal between each microphone not to obey the idealized time delay relationship,ultimately making the time difference between the microphones difficult to judge.In addition,environmental noise is an important problem that the SSL algorithm cannot avoid.When the environmental noise is strong,or the spectral characteristics of the noise signal and the source signal are close,the time difference between the target signal and the binaural microphone is often overwhelmed[12].Therefore,many works are devoted to extracting robust binaural localization cues.For example,Zhang et al.considered the consistency of the binaural signal to determine the reliability of the signal and exclude unreliable frames [13,14].To deal with the error caused by reverberation,the signal was dereverberated while preserving binaural cues[15].In Pang et al.[16,17],a method based on reverberation weighting and noise error estimation is proposed:first perform reverberation and noise removal on the received signal,and then extract the binaural localization cues.To deal with the influence of noise on binaural cues,Pak et al.tried to use a deep neural network to extract pure binaural cues [18].

After the extraction of binaural cues,many methods are dedicated to linking the cues with the corresponding sound source direction.May et al.[19]described the joint distribution of ITD and IID using the direction-dependent Gaussian mixture model and merged all of the frequency bands using log likelihood.Liu et al.proposed a two-layer Bayesian model to locate the sound source.The ITD was used to determine the general direction of the sound,and the time-compensated IID was used to locate the sound source further.Ma et al.used multi-conditional trained deep neural networks to determine the mapping from the interaural cross-correlation function(CCF)to the directions[20].Karthik et al.proposed a subband weighting method[21].The weight of the frequency band was obtained according to the reliability,but only ITD was considered.In Ma et al.[22],the spectrums of the background noise are considered for more precise localization.In Ma et al.[23],to deal with problems caused by the limitation of the angle of the sound source,a rotatable head model was proposed to extract robust binaural cues from received signals with noise and reverberation and the mapping from cues to the direction of the sound.However,under the conditions of noise and reverberation,it is difficult for current SSL methods to achieve high accuracy,especially when the signal duration is short.

A novel and simple template matching-based method is proposed.Only one frame of data is required for each test.The binaural CCF and IID features are used in our method.The method includes two parts:training and testing.In the training stage,the noiseless binaural signal is first divided by the gammatone filter.Then,the interaural CCF and IID are calculated using the binaural signal after frequency division and are used as the templates in each direction.The gammatone filter used here has 32 steps;therefore,in every direction there are a total of 32 templates,and every template consists of the IID and CCF.In the testing stage,similarly,the signal is first divided by the gammatone filter.Then,the CCF and IID of the signal are calculated on each frequency band.Next,the directional similarities are calculated between the testing signal and the templates.The similarity between CCF is designed as the cosine distance and the similarity between IID is designed as the reciprocal.Finally,the similarities are weighted between features and the frequency band,and the direction with the maximum similarity is considered the direction of the sound source.

2|WEIGHTED TEMPLATE MATCHING

It is assumed that the signal from the sound source to the binaural microphone obeys a simple linear transmission relationship.The received binaural signal is formally modelled as the convolution of head-related impulse responses(HRIR)with the sound signal emitted from the sound sources plus the noise signal:

wherenrepresents the sampling point,sis the signal emitted from the sound source,hdenotes the HRIR,vis the noise,xis the received signal,the symbol * represents the convolution operator andi∈l,rdenotes the index of the left and right microphones.

Figure 1 shows the overall flowchart of the proposed method.In the training stage,the signal is first frequencydivided by a 32-band gammatone filter with a minimum center frequency band of 80 Hz and maximum center frequency band of 7200 Hz.Then,the CCF and IID are calculated on each frequency band.The features calculated from the noise-free signal are used as the templates.In the testing stage,first,the signal is also divided by the gammatone filter and the CCF and IID features are extracted.Then,the similarities between the CCF and IID of the received signal and the templates in all directions are calculated.The method of calculation of feature similarity will be introduced later in detail.Finally,the weighted average of the similarity of different features and frequency bands are taken as the directional likelihood value,the direction with a maximum likelihood value is taken as the direction of the sound source and the method of calculating weights will also be introduced in detail later.

2.1|Direction-related templates

Commonly used binaural localization cues include ITD and IID,which are related to the signal frequency [24];thus,we first use a gammatone filter to divide the signal.Then,feature extraction and template calculation are performed on the signal after frequency division.

2.1.1|Gammatone filter

FIGURE 1 Flowchart of proposed weighted template matching.CCF,cross-correlation function;IID,interaural intensity difference

Gammatone function is a band-pass filter whose maximum amplitude appears at the center frequency.Its different center frequencies have different widths,and both sides have steep edges,indicating that the gammatone filter has sharp frequency selection characteristics.The gammatone filter is designed according to the way in which human cochlea process sound signals,which can well simulate the frequency division effect of the cochlea to signals.Its time domain impulse response is:

wheremis the order of the filter,fis the subband central frequency of the filter,bis the bandwidth of the filter andAis the amplitude.In our experiments,a 4-order 32-band filter is used.

2.1.2|Cross-correlation function

The CCF is caused by the distance difference between the sound source and the different microphones.CCF is a commonly used tool for calculating the time difference.We commonly treat the signal received by the left ear as a delay or advance of the signal from the right ear;thus,the maximum value of binaural CCF corresponds to the ITD,but owing to the interference of noise and reverberation,the correspondence is often disturbed,so that the position of the maximum CCF does not necessarily fall in the real ITD.However,the value of CCF in the real ITD position is often relatively large,as shown in Figure 2.The sound source comes from -80 degrees and the true ITD corresponds to the x-coordinate -16;however,the CCF value in x-coordinate -10 is the largest.If ITD is used as the feature of localization,it retains only the maximum position of the CCF(that is to say,only the number -10),and it is wrong in Figure 2.In addition,the CCF is aimed at signals of infinite length,and it is impossible to achieve infinite length here.To deal with the influence of signal length,we use the normalized CCF.

The normalized CCF is calculated as:

where:

wherekdenotes the index of frequency band,xi(k,n)is the divided signal by gammatone filter,τrepresents the time delay andGi,jis the CCF.

Figure 3a gives the CCF of a noiseless binaural signal in direction -5 degrees.The yellow part represents the greater correlation.The x-coordinate positions of the maximum correlation value of different frequency bands are slightly different;in other words,the brightest points in every line are not on the same vertical line,and this is not caused by the error but rather the frequency.In Figure 3a,at low frequencies,the peak value of the CCF is narrow,which is helpful for extracting the binaural time difference,but the peak value will appear repeatedly,making the extracted time difference correspond to the wrong CCF peak.At high frequencies,the number of peaks appears less,but the peak area is wide,making the extraction of binaural ITD cues inaccurate.Therefore,the entire CCF is retained and used as the binaural cues.

2.1.3|Interaural intensity difference

The IID is caused by the head shadow effect of the microphone far from the sound source.The size of the artificial head is generally longer than the wavelength of the sound.The presence of the artificial head will have a scattering effect on the signal.After scattering,the signal energy will be lost to a certain extent,and the energy loss is related to the sound frequency.Equation(5)is used to calculate the IID in different frequency bands:

For simplicity,log is not used.Figure 3b gives the IID of a noiseless binaural signal from-5 degrees.We can see that IIDs in different frequency bands are different,but because the sound source is on the left side of the head,the IID is greater than 1.In the high-frequency part,the IID is close to 1,which indicates that the energydifference between the two microphones is small.This is because in this frequency part,the signal has less energy and noise has a greater impact on the signal.

2.2|Feature similarity

FIGURE 2 The cross-correlation function in direction -80 degrees with a center frequency of 513 Hz of the eighth channel in a gammatone filter

FIGURE 3 Cross-correlation function and interaural intensity difference in direction -5 degrees.(a) Time delay and (b) Frequency band

For signals from the same direction,the CCF have the similar waveform function,which is greater similarity.Thus,we directly term the cosine similarity between the CCF and the templates in directionθas the similarityof the sounds inθ.A greater similarity indicates that the sound is more likely to be emitted from this direction.The similarity of CCF is calculated as:

whereθdenotes the direction,krepresents the index of the frequency bands,Rtempis the CCF in templates,τis the time delay andl,rdenote the left and right.

For the similarity of IID,we use the ratio of the IID of the test signal in each direction to the IID template.We cannot know in advance whether the IID of the test signal is greater than the IID of the corresponding direction template,but we must ensure that the similarity in the true direction is the largest in theory.It can be assumed that the two sets of signals in the same direction have the closest binaural time difference:that is,their ratio closest to 1.Thus,if the similarity value is larger than 1,then takes the inverse:

whereIIDtempis the IID in templates and min means take small operation.

Figure 4 shows the directional similarity of signals from -80 degrees:that is,the first column.Each column indicates the likelihood that the source exists in this direction and the yellow part indicates that the source is more likely to exist in this direction.The upper 32 lines represent the likelihood of the direction of the CCF feature,whereas the lower 32 lines represent the directional likelihood of the IID feature.The first column corresponding to the real sound direction has the largest value.The figure shows that the similarity of CCF features quickly decreases to both sides in the direction of the real sound source,but there will be an additional peak.Therefore,using only the CCF feature,the SSL result is accurate or it deviates from the real azimuth.The IID feature similarity only has one peak,but it decreases slowly from the direction of the true sound source to both sides.Combining these two kinds of directional similarities can reduce the multiple-peak problem of CCF feature similarity and the peak width problem of IID feature similarity.

2.3|Frequency and feature weights

The frequency range of human speech is usually 300-3400 Hz,but the sound energy is not evenly distributed.The reliability of different frequencies in SSL is unequal.From the previous discussion,we can see that different localization cues have different localization accuracies.Therefore,we propose a method for weighting the different features and different frequencies.

First,the features are weighted:

where:

FIGURE 4 Directional likelihood in different features and frequency bands with signal from -80 degrees

wherealphais a hyperparameter,Nrepresents the number of frequency bands andkdenotes the index of the frequency band.

Then,feature and frequency are weighted:

Finally,features and frequencies are weighted:

whereαccf,kdenotes the weight of CCF feature similarity in thekthfrequency band andαiid,kdenotes the weight of IID feature similarity in thekthfrequency band.

For the hyperparameter in Equation (8),we use the grid search method to find the optimal value.For signals from directionθ,we define the label=[0，0，…1，0，…，0],in which the position of 1 corresponds toθ.For a signal,after these processing steps,a similarity vector will be obtained in Equations(13)and(14).We define the loss function as a square loss between the label and the similarity vector after a softmax layer:

and the gradient descent method is used to adjust the weights.

3|EXPERIMENTS AND DISCUSSION

3.1|Experimental setup

To evaluate our proposed method,the HRIR ofsubject003in the CIPIC database [25] is used and pure sound signals are taken from the TIMIT database [26].The CIPIC dataset contains HRIR in 25 different directions ranging from -80 degrees to 80 degrees.The front of the head is 0 degrees and 25 directions in CIPIC database from left to right are -80 degrees,-65 degrees,-55 degrees,-45 degrees:5 degrees:45 degrees,55 degrees,65 degrees,80 degrees],respectively.To evaluate the antinoise capability of the proposed method,different signal-to-noise ratios (SNR) and different types of noise are added to binaural signals.In our experiment,the noise includes white,bubble,buccaneer1,etc.from the NOISEX-92 database[27]with SNR in-10:5:35].The signal sampling rate is 16 kHz and each frame is 32 ms.We assume the maximum of the ITD is less than 0.11 s;therefore,τin Equation (3) ranges from -0.11 to 0.11 s.Unless specifically stated,the accuracy listed is single-frame accuracy with white noise.

TABLE 1 Localization accuracy with different features and weights

FIGURE 5 Localization accuracy with different α for simccf and simiid

3.2|Feature weights

Columns 2 and 3 of Table 1 shows the localization accuracy obtained using only the CCF similarity or IID similarity without weighting.The CCF feature performs better than IID feature under all SNR.The CCF and IID features are extracted from two different angles:one from the perspective of time and the other from the perspective of energy.Therefore,it can be speculated that better results can be obtained by appropriately weighting the CCF feature and IID feature.We use the grid method to find the best parameterαin Equation (8).Parameterαchanges from 0 to 1 with an increase of 0.01.The localization accuracy changes withαare shown in Figure 5.Where the SNR changes from-10 to 35 dB with an increase of 5 dB,the asterisk represents the optimalαin each SNR.The localization accuracy first increases with the increase ofα,and then it decreases.The best parameter value in all SNR is between 0.7 and 0.86.The thick line represents the average value of localization accuracy under all SNR and the average optimalαis 0.77 with an accuracy of 63.2%.

FIGURE 6 Weights in different frequency bands

3.3|Frequency weights

Considering that the localization reliability of different frequencies is different,the frequency bands are weighted.To prevent the weight from being negative,the softmax function is used.After training,the gradient descent method is used to solve the optimal parameters.The weights are shown in Figure 6.Column 5 of Table 1 shows the results of localization accuracy with frequency weighting.Compared with the feature weighting method,when the SNR is low,the accuracy is higher,and when the SNR is high,the accuracy is lower.

3.4|Feature and frequency weights

Considering that the localization reliability of different frequencies or features is different,all features and frequency bands are weighted.To prevent the weight from being negative,the softmax function is used.The gradient descent method is used to solve the optimal parameters.After training,the weight of the relatively reliable frequency or feature is larger and the weight of unreliable frequency or feature is smaller,as shown in Figure 9.In this figure,the weight of CCF is relatively large,indicating that the localization capability of CCF features is better and the weights of CCF and IID have a greater correlation in frequency bands.In addition,the localization contribution of low frequency and high frequency is not as good as midfrequency.The best frequency for localization is about 700 Hz.From Figures 6-9,the three sets of weights have a higher similarity;they all have a larger weight in the middle frequency and a smaller weight in the high and low frequencies,which is consistent with the accuracy of a singlefrequency band.

FIGURE 7 Localization accuracy in different directions

FIGURE 8 Spectrogram of white,Volvo,leopard and machine gun noise

FIGURE 9 Weights in different features and frequency bands

The accuracy calculated using feature and frequency weights is shown in columns 4 and 5 of Table 1.The fourth column indicates that the frequency band is not weighted.Only the feature is weighted,and the best parameter has been selected.The fifth column indicates that the feature is not weighted.Only the frequency band is weighted,and the best parameter has been selected.The sixth column indicates that the feature and the frequency band are all weighted and have the highest accuracy.This table shows that both feature and the frequency band weighting have a great effect on SSL.

Figure 7 shows the localization accuracy in a different sound source direction with white noise.The figure on the left shows the accuracy with 0 degrees error,and the figure on the right shows the accuracy with 5 degrees error.The thick line represents the average value of positioning accuracy under all SNR.The accuracy decreases when the sound source moves from the front of the artificial head to both sides,but it is not obvious.From the left to the right side of the artificial head,the accuracy of SSL fluctuates up and down,and there is only a slight rising trend of and then a fall.This is related to the fact that the artificial head used is not completely symmetrical;on the other hand,it is related to the template used.

Figure 10 shows the localization accuracy with different types of noise.The method has the worst performance against white noise,because white noise interferes with all frequency bands.The three kinds of noises with the highest localization accuracy are Volvo,machine gun and leopard,because the energy of the three kinds of noise is mainly at low frequencies,and machine gun is discontinuous in time.Figure 8 shows the spectrogram of white noise;these three types of noise and the darker colors indicate greater intensity.White noise has uniform energy in all frequency bands,Volvo noise only has a greater impact on signals less than 200 Hz,leopard noise only has a greater impact on signals less than 500 Hz and machine gun noise is intermittent in time,which has a greater impact only on certain time.

3.5|Comparison with other algorithms

We compare our proposed method with Karthik et al.[21].In that work,the author proposed a method with a weighted subband for binaural SSL.The authors first used the gammatone filter to divide the sound signal;then,they extracted the ITD feature.Finally,they used the Gaussian mixture model and frequency band confidence to fuse different frequency bands.Table 2 shows our results and the method of Karthik et al.[21].The localization accuracy of our method exceeds the comparison method,especially under the condition of low SNR.

FIGURE 10 Localization accuracy with different types of noise with different signal-to-noise ratios

TABLE 2 Localization accuracy compared with Karthik et al.[21]

4|CONCLUSIONS

A novel feature and frequency band weighted templatematching method is presented.Because the localization cues are related to the frequency band,it is necessary first to divide the signal.In addition,different localization features and different frequency bands have different contributions to SSL,so weighting is necessary.Direction likelihood is obtained by the weighted sum of the interaural CCF similarity and IID similarity in all frequency bands,and the direction of the sound source is determined as the direction with maximum likelihood.Through a series of experiments with different types of noise and SNR,the necessity and effectiveness of feature and frequency band weighting and the effectiveness of the proposed method are verified.In the future,we will explore how to use the relationships between different frequency bands to extract more robust binaural localization cues.

ACKNOWLEDGEMENTS

This work is supported by the National Natural Science Foundation of China (Nos.61673030 and U1613209) and the National Natural Science Foundation of Shenzhen (No.JCYJ20190808182209321).

ORCID

Yongheng Sunhttps://orcid.org/0000-0002-0103-0215

CAAI Transactions on Intelligence Technology2021年2期

CAAI Transactions on Intelligence Technology的其它文章: Studies on situation reasoning approach of autonomous underwater vehicle under uncertain environment; Selective kernel networks for weakly supervised relation extraction; Learning-based control for discrete-time constrained nonzero-sum games; Flow-assisted visual tracking using event cameras; Protecting artificial intelligence IPs:a survey of watermarking and fingerprinting for machine learning; Why AI still doesn’t have consciousness?

国产日韩欧美一区二区三区三州_亚洲少妇熟女av_久久久久亚洲av国产精品_波多野结衣网站一区二区_亚洲欧美色片在线91_国产亚洲精品精品国产优播av_日本一区二区三区波多野结衣 _久久国产av不卡