Browsing by Author "Santos, Rafael Geraldo dos"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- PT-PT synthetic speech detectionPublication . Santos, Rafael Geraldo dos; Oliveira, José Valente de; Sousa, Joana CoutinhoRecent developments in the field of artificial intelligence (AI) have led to the creation of powerful generative models. These models have demonstrated such capabilities that it becomes nearly impossible for a human to distinguish between generated and human utterances, between synthetic and natural speech. A relatively recent example of this fact is the deepfake video of former U.S. President Barack Obama [1]. This video not only serves as a demonstration of the capabilities of AI models but also highlights the potential for misinformation, as these models can deceive individuals into believing in fabricated scenarios. This extends to the realm of synthetic speech, where models like Google Duplex [2], leveraging WaveNet technology, a deep neural network for seamless speech creation, exhibit an impressive degree of realism and naturalness. For this reason, two situations may arise. The first is related to new business opportunities, such as the creation of realistic voiceovers for films and animations or enhancement in the communication for individuals with hearing or speech impairments [3]. The other, raises concerns about privacy and security since voice impersonation is easily achievable with today’s tools. Given this fact, an analysis of approaches applied in the ASVspoof challenge [4] was carried on. The ultimate goal is to develop a system capable of distinguishing between real voices and cloned voices, by adapting the research done on this chal lenge to the portuguese from Portugal (PT-PT) language. For this purpose, we first created a PT-PT dataset using both text-to-speech (TTS) and speech-to-speech (STS). Then, we employed and implemented some models from the literature and tested in several datasets that encompass both english and PT-PT voices, to evaluate their per formance and reach conclusions. From this, we found out that while this is a difficult task, by augmenting the data with different impulse response devices (IRs) and com pressions codecs, there was an improvement in the generalization to different attacks from different datasets. Overall, after the evaluation process the best models found through statistical anal ysis were the ResNet-OC and ECAPA-TDNN. Being our goal tailored to PT-PT, by fine-tuning them, we further improved their performance. At the end future steps are highlighted, one of which may be very important to complement the work made so far, which is the integration of the fraud detection component