Trending ▼   ResFinder  

PG Exam 2007 : Chemistry (T. J. P. S College, Guntur)

12 pages, 0 questions, 0 questions with responses, 0 total responses,    0    0
Baje Ibbrahhim
T. J. P. S College, Guntur
+Fave Message
 Home > baje >   F Also featured on: School Page

Formatting page ...

Low-Cost Speaker and Language Recognition Systems Running on a Raspberry Pi Abstract This paper describes two state-of-the-art and portable voice-based authentication and language recognition systems. While the authentication system allows secure access to a media center at home, the language recognition system can be used as a previous step to automatically transcribe and translate the recognized text from its original language into another one. The most important advantage of the developed systems is that they can run on a low cost embedded device, such as a Raspberry Pi (RPi), and using only open-source projects, which makes it feasible to replicate or include in other systems, but also allows its implementation as part of educational projects in electronics. The developed systems have been tested on real data with very good results. Regarding the authentication system, the validation process is done in 3.3 seconds in average with an Equal Error Rate (EER) of 19% on test files with 20 seconds, and tested with up to 87 different speakers. On the other hand, the language recognition system is able to recognize up to six languages. For this system, important efforts were done in order to reduce the processing time and memory requirements while keeping high the recognition rate. The final system uses 64 Gaussians and 200 i-vectors, obtaining an average cost error rate (Cavg) of 8.6% for the six languages. Keywords Speaker recognition, Language recognition, i- vectors, embedded devices, open-source tools. I.INTRODUCTION: This article describes the implementation of two types of recognition systems based on the use of voice. The first is a speaker recognition system that allows access to multimedia content in the home (such as movies, music or photos) by different classes mobile devices (such as tablets, Smartphone or remote controls). Today, most voice recognition systems are used as a security mechanism in applications remotely [1] authentication. However, our system takes a different approach as it is motivated by the increasing number of devices capable of hosting, view or record all types of multimedia productions, as well as the efforts of diverse companies by providing applications and devices multimodal interaction that enable access different contents. Examples of such applications include Windows Media Center or marketed by Microsoft Easy Remote U-Verse AT & T. Despite the wide range of benefits included in general they do not have any authentication system that prevents access to the private content of a user by other users. Moreover, the second system enables automatic identification of the language spoken by a person. Such systems are quite applicable today as a step to more complex automation tasks. So for example, a voice recognition system, necessary to transcribe the words spoken by a person, requires prior knowledge of the language in which that person will talk to in order to properly load the acoustic models and language necessary for recognition ; Another example of application is called call-center or centers of automated customer care. In this case, if a caller speaking in English, the system must recognize that language and transfer the call to an English speaking operator. You can also mention other applications such as automated information kiosks [2] or the automatic translation systems of speech-to-speech [3]. Finally, we should mention that the two developed systems are designed and programmed so that they can run on a low cost embedded system, using an Raspberry Pi. In order to achieve this, a major effort was made to keep processing time and memory use very low in order to allow its use in real systems. Likewise, it also sought to ensure a high rate of recognition for what have been investigated and implemented algorithms more robust recognition and which are the most advanced methods in such systems. Moreover, it is necessary to mention that the implementation of both systems was conducted using code freely available on the web in order to achieve an economically viable solution, with capacity continuous improvement, and allow the system to be replicable. Finally, in order to encourage higher education institutions of various Latin American countries, it is important to note that both projects were developed as part of an initiative to integrate research and innovation in educational subjects degree, in this case the 3rd year students of the subject Digital Systems II for certification of graduate engineer and Technologies Telecommunication Services taught at the Polytechnic University of Madrid. For additional references and see demonstration videos: http://sdii.die.upm.es and http://lsed.die.upm.es. This article is organized as follows: in sections II and III described in detail each of the proposed systems, starting first by the identification system speaker and then passing the language, specifying in each case the algorithms, applications hardware and software configuration used. Then, in Section IV results and measurements for both the speaker recognition system and the results for language recognition system are presented. Finally, in Section V, the general conclusions and future work for each of the subsystems are presented. II. SID: SYSTEM DESCRIPTION Fig. 1 shows the different modules of the system architecture for voice authentication and access multimedia content. As can be seen, the architecture is distributed, which enables users to access the service through a single (the Raspberry Pi). The server in turn can connect to multiple media centers storing each different content and be physically located in different places. Likewise, the server allows the user to access the system using different kinds of client devices such as tablets, PCs or Smartphones. An additional advantage of this architecture is that each module is independent of the others, so this allows the developer to replace them with others without affecting the proper functioning of the system. Finally, all modules were implemented from free distribution applications allowing its functionality can be expanded steadily. Figure 1. Architecture of the modules that make up the system of access to multimedia content. Then explained in detail each module emphasizing the speaker identification system . A. Raspberry Pi (RPI) The Raspberry Pi (http://www.raspberrypi.org/) is a low cost computer developed by the Raspberry Pi Foundation and marketed with great success since the beginning of 2012. The reasons for its success are linked largely their small size, low power consumption, availability of interfaces with a variety of peripherals and the fact that you can run Linux operating system, which makes it perfect for the development of embedded applications. One Raspberry Pi Model B was used with 512 MB of RAM for this system, which includes a ARM1176JZF-S processor with a 700 MHz clock but in this case was reconfigured with the "modest" which allows operation 800 MHz. The purpose of this configuration was to allow a more user-friendly media center software implementation. Raspbmc operating system the distribution was used (http://www.raspbmc.com/)(aunque testing the distribution Raspbian were also made without any problem). In order to run the operating system, programs and storing multimedia content, an SD card 8 GB Class 10. Regarding the RAM memory used by each module is used, tests were conducted and it was determined that the media center used to 70 MB as a movie (http://www.bigbuckbunny.org/) format 1080p H.264 hardware decoding and reproduced while the authentication system with all its modules and matrices came to use up to 6 MB, and the server using 1 MB. Thus it was confirmed that the 512 MB available were sufficient, so that by the end system's memory is partitioned so that it could use 256 MB GPU and leave the rest to the CPU. B. Media center XBMC (http://xbmc.org/) is an application video decoding free distribution platform developed by the XBMC Foundation. This application allows users to play and watch videos, music, podcasts and multimedia files from a local or remote server. It also includes a profile manager and an event server that allows receiving requests load profiles using UDP packets and is used by the identification system for applying load profile of the authenticated user and allow access to lists of personal multimedia content. When the application runs automatically the first time you load a public profile used until authentication is performed. C. Client for Android This module allows users to control the media center using a graphical user interface (GUI). The proposed architecture for this interface is based on Android client included as part of the XBMC project that allows the execution of basic commands from a remote control (eg play, stop, record, browsing lists, etc.). In addition, the interface allows the user to specifyvalues of connection to the server (ie IP address and port). The original interface of this customer was slightly modified in order to include a series of buttons and text boxes that allow users to record your voice and send audio to the server (RPI) in order to authenticate. Moreover, and in order to reduce the data transmission and provide a standard headset interface, a speech detection mechanism based on the energy of the frame he joined. This detection is performed using heuristic rules that guarantee a minimum of speech and avoid sending recordings to the server spurious noise. Finally, after sending voice data packets to the server, the client enters a standby mode until the authentication result. If you receive a positive result, the client displays a welcome message while XBMC interface loads the profile identified user and allows access to private content. If the result is negative, being the score below a predefined configurable threshold, the client displays a message of rejection while XBMC interface keeps the user profile loaded system default. In this way, the user has access to the public contents of your profile while authentication can try again. If no authentication occurs after three attempts, the voice authentication interface is disabled and the user is prompted to enter a password written. If the password is incorrect, the system prompts the user to improve his speech model by incorporating new recordings, and blocks the profile during the next 30 minutes D. Authentication Server This module is responsible for receiving voice packets sent from the Android client, storing the received audio neatly in a default folder and run the speaker identification process. Once the identification is done, the server returns the result to the Android client and waits that in case of a positive identification, the customer asked to load a certain profile which then redirects the client XBMC. E. Speaker Identification System This module is the one that performs the task of identifying text independent speaker. Although it is possible to identify a dependent text, which would make the identification results were higher, it was decided to maintain the independence of the text in order to avoid the use of passwords and reduce the possibility of cheating the system. The steps for making the identification are: Parameterization: After receiving the audio samples captured by the Android client, this module extracts relevant information using acoustic SPro (http://spro.gforge.inria.fr/) tool. In this case, coefficients 12 MFCC (Mel-Frequency Cepstral Coefficients), the logarithm of the energy and the dynamic parameters (delta and delta-delta) [4] are extracted, which are calculated using a Hamming window of 25 ms and an offset of 10 ms. These parameters were chosen because they are the most currently used for such applications. However, future developments is to also incorporate prosodic coefficients characteristic for his contribution user information [5]. Detection voice / no voice: This module is responsible for detecting only speech frames sent by the Android client. This process is complementary to voice detection implemented in the Android client because it allows a more accurate by using more precise detection algorithm. The aim is to improve recognition rates by using only voice information and reduce processing time. This module is implemented by ALIZE / LIA_RAL (http://alize.univ-avignon.fr/) [6] tool. The algorithm is implemented using a GMM two / three Gaussians that are trained with values of the log-energy in each frame and after an iterative training process ensures that frames are assigned higher power to a Gaussian; during the evaluation phase any frame that has a high likelihood with this Gaussian is then labeled voice. Subsequently some rules apply smoothing in order to avoid noisy signals detected as voice short. Given the importance of this process, it is proposed as future line integrate other recognizers more robust to noise as proposed in [7] which is available as open source (Http://cs.uef.fi/pages/tkinnu/VQVAD/VQVAD.zip). Extraction i-vectors: Described in [8], the i-vectors are in a low-dimensional representation of the acoustic characteristics of the voice, which together allow modeling variabilities that may have the announcer's voice over time and channel variations and voice within the same session. Thanks to its robustness, the good results presented at international competitions such as those organized by NIST (http://www.itl.nist.gov/iad/mig/tests/lang/) and their independence by recording the content i -vectors up the state of affairs in both voice recognition and language; why they were used in this system. The implementation was done by ALIZE / LIA_RAL software that allows training, extract and rate the ivectors. One of the necessary steps to remove the i-vectors is to train an independent broadcaster based model (UBM, Universal Background Model) and the extraction matrix i-vectors (matrix T in the terminology of the i-vectors) ; To do this we used a subset of audio files of the database Callfriend Non-Caribean Spanish [9], taking a total of 20 files of men and 10 files with the voice of women with an average duration of 28 and 23 minutes respectively. Additionally 7 files (average of 5 minutes duration) with video recordings taken from the internet in order to improve the robustness of the models were added. The final UBM Gaussian model consists of 32 independent gender and extraction matrix i-dimensional vectors is 39 * 32 x 100 (39 acoustic parameters 12 MFCC + Energy + delta + delta-deltas, 32 Guassianas and dimension projection 100 i-vectors). These values can reduce the computational cost and RAM used for calculations at runtime. Rating: In this stage, the i-vector generated for the test file is compared with i-vectors trained speaker who attempts to access the system. The process of obtaining scores (scoring) was done using the cosine distance [10] as it is quick to compute and is highly compatible with the screening taking place in the multidimensional space of the i-vectors. The classification is made using the average distance between vectors i-i-trained and vector sentence to evaluate. Depending on the distance exceeds a pre-set threshold is reached or not the system. It notes that in the current system of standardized techniques i- most common vectors such as WCCN or LDA [11] do not apply even if you include them in brief raises because these techniques have been included in the latest version of ALIZE / LIA_RAL. III . LID: SYSTEM DESCRIPTION For its part, the language recognition system is composed of three main modules that can be seen in Fig . 2. The first module is the Raspberry Pi that provides the user interface that allows access to all services, collect samples and executing voice recognition algorithms language. The second module is the language identification system that handles parameterize audio files , execute the detection of speech segments , remove the i- vectors, merging the different scores for each type of separate system and be sorted. The third module is a server and web interface that allows users to capture your voice or upload audio files , call recognition system language, then the online service of transcription and translation. Then each module is described in detail . Figure 2. System's recognition language. A. Raspberry Pi ( RPI ) Raspberry Pi from the same speaker verification system was used for this system. The only difference was that the system clock is reconfigured to operate at 1.0 GHz in order to ensure that the system will run faster . Furthermore, all programs are optimized in order working in the RPI and executed using the mathematical distribution in Octave suite for Raspbian. Given the constraints of the RPI, audio files were recorded using a USB microphone and use a sampling rate of 8 kHz @ 16 bits. B. Identification System language This module is configured similarly to the system used in the international competition of language recognition Albayzin 2012 [12]. In this evaluation, the system performed very well thanks to the merger of three different sub-systems [13]: 1) Sound system based on the use of MFCC parameters RASTA filtering-SDC + + + CMNV i-vectors, 2) System acoustic based on the use of features SPWB-SDC + + CMNV RASTA filtering + i-vectors, and 3) fonot ctio based system using trigram posteriogramas accounts + i-vectors. As most of today's most advanced systems make use all subsystems projection subspace by i-vectors [8] which are then fused using calibrated and multiclass logistic regression. One of the main advantages of this system was the use of the SPWB features that provide robustness to noise [14] [15] and the incorporation of fonot ctico system [16] uses a feature vector from undispersed values posteriogramas obtained using the phoneme recognizer free distribution at the University of Brno (http://speech.fit.vutbr.cz/software). Despite the good results obtained using these three sub-systems, in the context of the application for RPI, they included only two acoustic subsystems in order to reduce the computational burden and without thereby be reduced considerably recognition rate. Since the system configuration is the same as for the evaluation Albayzin, this module is able to identify 6 different languages, some of them quite similar to each other, which are: Catalan, Spanish, English, Galician, Portuguese and Basque. The identification is performed in the following steps: Parameterization: As for voice identification system, this step is performed to extract the most relevant acoustic information. The parameters are extracted using SPro also drawing the first 7 MFCCs and calculating the CDS [17] parameters. One CMNV standardization and RASTA filter [18] is then applied. The SPWB parameters were extracted using the program Ctucopy (http://noel.feld.cvut.cz/speechlab) with a equal to the configuration of MFCCs. Voice Activity Detection: This step tools ALIZE / LIA_RAL the same way it is used again for biometric verification system. Extraction i-vectors: For this system resorted to the implementation provided by the Brno University of Technology (http://speech.fit.vutbr.cz/software/joint-factor-analysis- matlab-demo) with some changes. to run in Octave. Fusion and Classification: The aim of this step is to take the i-previously generated test vectors from the different types of acoustic parameters (MFCC and SPWB) and generate a score that measures the similarity between the i- test vectors to the model offline trained for all files that belong to the same language i-vectors. This comparison is performed using a logistic regression classifier multi-class. Subsequently, the generated scores are calibrated and the outputs of the two subsystems acoustic merge. It used a Gaussian Back-end followed by a discriminator multiclass logistic regression in the case of merger. For this system, i- normalized vectors previously re- length and projected using a matrix intra-class covariance (WCCN [19]), thus reducing variations caused by differences in length of the audio files and channel variations and speaker. C. Web Server and integration with Google services In order to allow users access to the language identification system and then to online services, developed a web interface (encoded using Javascript and HTML5) hosted by a web server (in this case Node (http: // nodejs.org/)). The GUI allows the user to record his voice or upload an audio file so that the language will be identified. After identification, the system sends a POST request to the online service Google transcript of passing as the audio file information and the language that has been recognized. Upon receipt of the transcribed text, a new POST request is done this time the system of online translation of Google, sending the transcribed text and the language to be translated (which user is selected through the web interface) . Since the language identification system can recognize up to 6 different languages, the interface allows you to translate the transcribed texts from these languages. However, future versions are expected to expand the system to more languages and thereby the interface in order to accept and translate into other languages supported by Google. IV. RESULTS AND PERFORMANCE STATISTICS A. Biometric Identification System Fig. 3 shows the time it takes to execute each of the steps of the verification system based on the resources available RPI (ie if the XBMC is stopped or is playing a full HD movie). These results were obtained using a voice file lasting 20 seconds. As you can see, the time it takes the system to respond is 2.5 seconds when XBMC is stopped and 5.0 seconds if the film is playing. Although this time is not very high, it is clear that further efforts to reduce required. To do this, it is proposed to follow some of the ideas proposed in [20], in which a verification system described by voice in real time implemented in an FPGA. Although this system is very fast, it is worth mentioning that the identification algorithm implemented is less robust and the device does not allow every employee the functionality offered by the RPI; although it is possible to perform some similar to those reported optimizations. Fig. 4 shows the curve DET (Detection Error Tradeoff) system. Proposed in [21], this curve is a linearized version of the ROC curve (Receiver Operation Characteristic) in which the axes are in logarithmic scale. The curve shows the errors of false alarm (ie when the system passes an impostor) against false rejection errors (ie when the system rejects the correct user). The curve can also discover the equal error rate (EER, Equal Error Rate) which is the value that are equal both false rejection errors such as false alarm; Likewise, since the curve shows the results for all possible threshold values can determine the optimum system for a particular condition compromise between both types of error. For the results of Fig. 4, a test with 87 users (65 men and 22 women), resulting in a total of 429 positive test was used (in which the speaker must be identifiable positively against its model) and 36 894 false evidence (where the system should detect that is a false user). To make more realistic tests, each speaker model is trained using only one minute of audio; for an average of 5 tests files 20 s for each speaker was used. The content of all the audio was different so that the system is independent of text. In the figure shows that the rate of EER is approx. 19%, which is a pretty good result considering the length of the training and test files as well as text regardless of gender. By way of comparison can view the results of the latest international speaker recognition evaluations [22] organized by the North American agency NIST (http://nist.gov/itl/iad/mig/spkr-lang.cfm) and where the best researchers in the world participate. For example, for the 2008 assessment, the best system presented obtained an EER of 9.7% using training files between 3-5 minutes performing tests in files of 10 seconds (condition short210sec). While in the 2010 assessment, using 5 to files 8 minutes with variations in channel (microphone or telephone conversations) and comparing files against 10 seconds (provided core-10sec), the best of all systems received 6.2% of ERR. As you can see, the error rate of our system in the RPI is higher but keep in mind that the training time and the durations of the test files greatly affect the quality of the results, as well as for making the system work faster it was necessary to greatly reduce the size of the i-vectors and the number of Gaussian. In the systems presented in the competitions mentioned it before common use around 600 i-vectors, 2048 Gaussian and dependent gender models [23], since it is not intended that the systems work in real time, but get the best rate possible error, regardless of the resources used. Figure 3. Processing time for each step of the recognition process considering whether the video system of XBMC are playing a movie or not. Curve Figure 4. DET false rejection rates and false acceptance system on recordings of 20 seconds. B. Identification System Language As previously discussed, the system of language identification (LID) is based on the fusion of the scores of two acoustic sub-systems, one based on using RPLPs MFCCs and one with both using the same settings. In Fig. 5 the average rate of system error (Cavg) 941 are shown using the test files used in the evaluation of Albayzin depending on the number of Gaussians and the dimensionality of the i-vectors; The CPU time is calculated by executing the identification system in the RPI on an audio file of approximately 1 minute long. As it can be seen, when the number of Gaussians and i-vectors increases, processing time and memory requirements are increased while the error rate is reduced. It can be seen that the smallest error is achieved using the same configuration used for formal evaluation, in which a UBM Gaussian model 512 is trained and i-dimensional vectors 400. However, the processing time and memory is high, so that the final system uses 64 to 200 i-Gaussian vectors. The reason is that although the error rate is 2.86% Cavg worse in absolute value, this results in terms of statistical significance is equivalent to the more complex but runs 5 times faster and uses only one-tenth of memory. Figure 5. Comparison of system performance language identification executed on the RPI in the number of Gaussian and i- vectors. With the selected configuration, the entire process of recognition of a media file of 30 seconds is performed in approx. 23.5 seconds. This value, although somewhat higher mainly due to two factors: a) the time required to start Octave (about 5 seconds) and b). the computational complexity for calculating the i-vectors is O (K3 + + K2C KCD), where K No. of the idimensional vector, C No. D Gaussian and the size of the feature vectors ( 56 Dimensions: 7 MFCC / SPWB + 49 SDCs). In this system implemented calculation process is performed twice (one for each sub-system sound), a total of 17.2 s. In the next section we will compare this result with other HW implementations. Compared to the same conditions, the PC-based system takes on average 1.9 seconds. As for the average error rate (Cavg), the value of the system is 8.3% which is very similar to the result of 5.6% achieved by the system running on the PC and was presented at the Albayzin competition LRE 2012 [24] [13]. As a comparison between the use of i-based vectors and other algorithms (eg UBM-GMM or SVMs) systems are available the results of the evaluations NIST LRE [25]. For example, during the competition of 2011, one of the best presented systems was based on i-vectors and was able to recognize a total of 24 languages with an error rate of 3% for files of 30 seconds, 7% for 10 seconds, and 18% for 3 seconds [26]. Moreover, based SVMs or UBM-GMM model systems were common several years ago in which for the evaluation of 2003 obtained by 6.1% and 4.8% respectively for the task of 30 s on a total 12 languages. C. Comparison with other algorithms and implementations As already we indicated above, the algorithms built into the two systems have developed quite well in comparison with other more complex systems. In this section we compare implemented on other hardware platforms and with other systems recognition algorithms. Table I summarizes the different systems studied. As you can see, there are a greater number of identification systems that language broadcaster. We believe the main reason is that it is often more necessary portable speaker recognition system that can be easily installed anywhere in order to restrict access to a place or data, a language recognition system that can in general require more in offline applications ( eg machine translation, subtitling , etc. ) and therefore operate from a PC. TABLE I. COMPARISON OF RATE / SPEED SYSTEMS HW with other implementations . Ref. H W Cyclo ne II2C3 5 FPGA Audi o [28]Er c de Xilinx [27] ro! Fonte de refer [29] V5 FPGA (fixed point) Xilinx Virtex -II XC2V 600 0 FGPA Xilinx Caracter Dato s sti cas Vector Audio Quantiz 8KHz ation + 12 Clasifica MFCC do r Sistem basado a en depen distancia dien MFCC+ 3 te SD C (offline y fuera de la MFCC + deltas (offline y Tx a la FPGA) 32 13 MFCC (FPGA) GM 13 MFCC + F0 + jitter + Dependi Tas Tiemp o a de 52, ~15 ms 8 % 89, 7 Idiom % as: Chino, Ingl s 20 ~67 locuto % res Test de 5 s 0,74 ms Tx: 16,8 ms Clasific .: 0,8 ms x clase 40%>0,1ms 31 Sparta locuto n 3E res FPGA Test board 168 90% 3,6 x [31] dsPIC locuto (Micr Duraci n en (s) fichero res ochi audio p) [32]Er FPG 52 N.D 4,65 ms x cada trama de 25 . A ent e de locuto ro! ms texto res Fonte Spart an III Sin de VAD In general , in all implementations found, the HW used was an FPGA because, as described in [33] , has the great advantage of enabling parallel processing of data (common to compare a model specific speaker against many other simultaneously ) , the Once you have a variety of modules and optimized math libraries for architecture (eg FFT, filtering, logarithmic tables for calculating scores, etc). Moreover, in many of these systems, we find that the hardware was used only to implement part of the recognition (typically the classifier), while less parallelizable tasks such as parameterization or execution of user applications were made on a external server. In any case, the fact of comparing these systems together is a difficult task given the variety in the data used, the number and variety of speakers / languages, implemented algorithms, etc. Still, you can check in the table FPGA-based systems are much faster than the systems proposed but success rates and the number of models are smaller. The only exception is the reference [28] but only identifies 3 languages and with 30 seconds of speech. Finally, we believe there is still much room for improvement in our system in terms of its implementation, optimizing some of the steps of the algorithm, re-implementing the [30] SW algorithms more efficiently, better reusing the HW resources of the architecture used, and making the parameters of each frame as the user speaks, so that the system would work in real time. V. CONCLUSIONS AND FUTURE WORK This article described two different systems running on a Raspberry Pi. The first is a speaker verification system that allows access to private media. The system is based on a distributed architecture which allows incorporation of various modules implemented using free software as well as control and access to the system through different mobile devices. The verification system uses the technique of i-vectors which are obtained good rates check. The second system enables the language recognition and automatic transcription and translation using a Raspberry Pi, recognition algorithms based acoustic i-vectors, and integration with online services. Employee recognition system has a similar to that used in an international competition in which scored highest setting. In this case, the main difference is the number of Gaussian, the size of i-vectors and melting only acoustic subsystems in order to reduce processing time and memory requirements. As future work the following arise depending on the system. For biometric identification system: the incorporation of new algorithms normalization of vectors and i- scores such as Factor Radial Eigen [18] and PLDA. Additionally, he will work on the recognition process starts as soon as the user speaks in order to have a system running in real time. As for the language recognition system, it is proposed: First, using a mathematical environment which is more Octave quickly , such as scikit-learn ( http://scikit-learn.org ) or Julia ( http://julialang.org/ ) ; and second, reducing the complexity of the extraction process of i- vectors using mathematical approximations similar to those proposed in [34] . THANKS This work was supported by the eardrum ( TIN2011-28169 - C05-03 ) project and the project No. 563 Observatory Academic Educational Innovation and UPM . Thanks also to Laura Alcocer Perez - sprinkler for their contributions in writing this article. Paper Submitted in October 15th, 2013. This work has-been supported by project TIMPANO (TIN2011-28169-C05-03) and the Academic and Educational Innovation Observatory of the Technical University of Madrid Project No. 563. LF D'Haro, Dept. Of Eng. Electronics, School of Telecommunications, Technical University of Madrid, Madrid, Spain, lfdharo@die.upm.es R. Cordoba, Dept. Of Eng. Electronics, School of Telecommunications, Technical University of Madrid, Madrid, Spain, cordoba@die.upm.es). JI Red, Polytechnic University of Madrid, Madrid, Spain, ji.rojo@alumnos.upm.es J. D ez, Polytechnic University of Madrid, Madrid, Spain, jorge.diez.delafuente@alumnos.upm.es D. Avenda o, Polytechnic University of Madrid, Madrid, Spain, diego.apeces@alumnos.upm.es JM Bermudo, Polytechnic University of Madrid, Madrid, Spain, jose.bmera@alumnos.upm.es REFERENCES [1] FL Alegre, "Application of ANN and HMM to Automatic Speaker Verification" LATIN AMERICA IEEE Transactions, Vol. 5, No. 5, pp. 329-337, Sept. 2007. [2] Karpov, AA, Ronzhin, AL 2009. "Information inquiry kiosk With multimodal user interface". Pattern Recognition and Image Analysis, September 2009, Volume 19, Issue 3, pp 546-558. [3] Eck, M .; Lane, I .; Zhang, Y ., Waibel, A. 2010. "Jib-bigo: speech-to-speech translation on mobile de-vices" IEEE Spoken Language Technology Work-shop (SLT), pp. 165-166. [4] Jurafsky, D., Martin, J. Speech and Language Processing. Pearson Education Limited, 2nd ed, 944 pp. ISBN-10: 1292025433 [5] E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, A. Stolcke. 2005. Modeling prosodic feature sequences for speaker recognition. Speech Communication 46, 2005, pp. 455-472. [6] Bonastre JF, Scheeffer N., Matrouf D., et al. 2008. "ALIZE / SpkDet: a state-of-the-art open source software for speaker recognition", Speaker Odyssey. [7] T. Kinnunen, P. Rajan, 2013. "A practical, self-adaptive voice activity detector for speaker verification telephone and microphone With noisy data", Proc. Int. Conf. On Acoustics, Speech and Signal Processing (ICASSP 2013), pp. 7229--7233, Vancouver, Canada, May 2013. [8] Dehak, N., Kenny, P. Dehak, R. Dumouchel, P. Ouellet, P. 2011. "Front-End Speaker Verification For Factor Analysis," IEEE Trans. on Audio, Speech and Language Processing, Vol. 19 (4). [9] CallFriend Corpus Linguistic Data Consortium, 1996 http://catalog.ldc.upenn.edu/LDC96S58 [10] Dehak, N., Dehak, R., Glass, J. Reynolds, D. Kenny, P. 2010. "Cosine similarity scoring without score normalization techniques" Speaker Odyssey. [11] Garcia-Romero, D., and Espy-Wilson, CY 2011. Analysis of i-vector length normalization in speaker recognition systems, in Int. Conf. On Speech Communication and Technology, 2011, pp. 249-252. [12] Rodriguez-Fuentes, LJ, Br mmer, N., Penagarikano, M., Varona, A., Bordel, G., Diez, M. 2013. "The Albayzin 2012 Language Recognition Evaluation". Interspeech. [13] D'Haro, LF, Cordoba, R. Caraballo, MA, Brown, JM 2013. "Low- resource language recognition using a fusion of phoneme posteriorgram counts, acoustic and glottal.based ivectors". ICASSP. [14] Honig, F., Stemmer, G., Hacker, C., Brugnara, F. "Revising Perceptual Linear Prediction (PLP)". In Eurospeech 2005, p. 2997-3000. [15] Rajnoha, J., and Poll k, P. 2011. "ASR systems in Noisy Environment: Analysis and Solutions for Increasing Noise Robustness". Radionegineering, Vol. 20, No. 1, April 2011, pp. 74-84. [16] D'Haro, LF, Glembek, O., Plchot, O., Matejka, P., Soufifar, M., R. Cordoba ernock , J. 2012. "phonotactic language recognition using phoneme i-Vectors and posteriorgram counts ", Interspeech. [17] B. Bielefeld. 1994. "Language identification using shifted delta cepstrum" Speech Fourteenth Annual Research Symposium, 1994. [18] Bousquet, PM, Matrouf, D., and Bonastre, JF 2011. "Intersession compensation and scoring methods in the i-vectors space for speaker recognition" Interspeech, pp. 485-488. [19] Hatch, A.O .; Stolcke, A. 2006. "Generalized Linear Kernels for One-Versus-All Classification: Application to Speaker Recognition," Acoustics, Speech and Signal Processing, 2006. ICASSP 2006. [20] Ramos-Lara, R., Lopez-Garcia, M., Sang-Navarro, E., and Puente- Rodriguez, L. 2013. "Real-time speaker verification system Implemented on reconfigurable hardware". Journal of Signal Processing Systems, 71 (2), 89-103. [21] Martin, AF, Doddington G., T. Kamm, M. Ordowski, M. Przybocki. 1997. "The DET Curve in Assessment of Task Performance Detection", Proc. Eurospeech '97, Rhodes, Greece, September 1997, Vol. 4, pp. 1899-1903. [22] NIST SRE evaluations. Web: http://www.nist.gov/itl/iad/mig/sre.cfm. Page consulted in January 2014. [23] Saeidi, R., et al. 2013. "I4U submission to NIST SRE 2012: A large- scale collaborative effort for noise-robust speaker veri fi cation" Interspeech 2013, pp. 1986-1990. [24] D'Haro, LF, and Cordoba, R. 2012. "The GTH-LID System for the Evaluation LRE12 Albayzin" Iberspeech 2012, pp. 528-539. [25] NIST LRE evaluations. Web: http://www.nist.gov/itl/iad/mig/lre.cfm. Page consulted in January 2014. [26] E. Singer, P. Torres-Carrasquillo, D. Reynolds, A. McCree, F. Richardson, N. Dehak, and D. Sturim, 2012. "The MITLL NIST LRE 2011 Language Recognition System," Proc. Odyssey, pp. 209-215, Singapore, June 2012. [27] J. Li, D.An, L. Lang, and D. Yang. 2012. "Embedded Speaker Recognition System Design and Implementation Based on FPGA" came Engineering 29, pp. 2633-2637. [28] Z. Nie, X. Zhang, and Z. Yang. 2012. "An FPGA Implementation of Multi-Class Support Vector Machine Classifier Based on Posterior Probability", Int. Proc. of Computer Science and Information Technology, Vol 53 (2), pp. 296-302, October 2012. [29] P. EhKan, T. Allen, F. and S. Quigley. 2011. "FPGA Implementation for GMM-Based Speaker Identification", International Journal of Reconfigurable Computing. [30] AS Poudel, D. Lekhak, Bashyal K., and S. Shrestha. 2013. "Text- Independent Speaker Recognition System Based on FPGA" Final year project, Department of Electronics and Computer Engineering, Tribhuvan University. [31 ] M. Lizondo , PD Aguero, AJ Uriz , Tulli and JC Gonzalez . 2012. "Embedded speaker verification in low cost microcontroller ," Argentine Congress on Embedded Systems 2012. Buenos Aires , Argentina . 15 to 17 August 2012 . [ 32 ] R. Ramos- Lara, M. Lopez - Garcia, E. Sang -Navarro, and Puente- L. Rodriguez. 2009. " SVM Speaker Verification System Based on a low- cost FPGA " , International Conference on Field Programmable Logic and Applications , pp . 582-586 . [ 33 ] A. Naufal , E. Phaklen , RB Ahmad , and S. Naseer . 2013. "Speaker Recognition System : Vulnerable and Challenges " , International Journal of Engineering & Technology ( 0975-4024 ) ; Aug / Sep2013 , Volume 5 Issue 4, pp 3191-3195 . . [ 34 ] Li , M., Tsiartas , A. , Segbroeck , MV , Narayanan S . 2013. "Speaker verification using simplified and supervised modeling i- vector " ICASSP 2013 , Vancouver, Canada.

Formatting page ...

Formatting page ...

Formatting page ...

Formatting page ...

Formatting page ...

Formatting page ...

Formatting page ...

Formatting page ...

Formatting page ...

Formatting page ...

Formatting page ...

 

  Print intermediate debugging step

Show debugging info


 

 

© 2010 - 2025 ResPaper. Terms of ServiceContact Us Advertise with us

 

baje chat