SynData4GenAI 2024

The first edition of syndata4genai will be co-located with Interspeech, in KOS, Greece on August 31, 2024.

SynData4GenAI

Introduction

The field of speech technology has undergone a revolution in recent years, driven by the rise of foundational models underpinning automatic speech recognition (ASR) and text-to-speech synthesis (TTS). However, a key challenge persists: the reliance on large quantities of real human speech data, raising privacy concerns and regulatory hurdles.

Synthetic data offers a groundbreaking alternative, empowering researchers to develop speech models that are ethical, inclusive, and adaptable to diverse scenarios. While the use of synthetic data has been studied extensively, its role in foundational models has not been explored. This workshop will bring together research under the following themes specifically targeted to this era of foundational models that have been pre-trained on every available and usable data source.


Schedule

Registration and Welcome

  • 8:00 AM - 8:30 AM: Registration, Coffee at the Lobby
  • 8:30 AM - 8:45 AM: Welcome and workshop overview

Keynote 1

8:45 AM - 9:30 AM

Presenter: Sakriani Sakti (Nara Institute of Science and Technology NAIST, Japan)

Title: "Language Technology for All: Leveraging Foundational Speech Models to Empower Low-Resource Languages"

Abstract: The development of advanced spoken language technologies, such as automatic speech recognition (ASR) and text-to-speech synthesis (TTS), has enabled computers to listen and speak. While many applications and services are now available, they only fully support fewer than 100 languages. Although recent research might listen to up to 1,000 languages, there are still more than 6,000 living languages spoken by 350 million people uncovered. This gap exists because most systems are constructed using supervised machine learning, which requires large amounts of paired speech and corresponding transcriptions.

In this talk, I will introduce several successful approaches that aim to achieve "language technology for all" by leveraging foundational speech models to support linguistic diversity with less-resourced data. These approaches include self-supervised learning, visually grounded models, and the machine speech chain. I will also share insights and feedback from the indigenous community, gathered from events, workshops, and panel discussions over the years. The challenges are not only how to construct language technologies for language diversity, but also how to ensure that these technologies are truly beneficial to under-resourced language communities.

Biography: Sakriani Sakti is a full professor and head of the Human-AI Interaction Laboratory at the Nara Institute of Science and Technology (NAIST). She also holds adjunct positions at JAIST and the University of Indonesia and is a visiting research scientist at RIKEN AIP, Japan. A member of JNS, SFN, ASJ, ISCA, IEICE, and IEEE, she currently serves on the IEEE Speech and Language Technical Committee (2021-2026) and as an associate editor for IEEE/ACM TASLP, Frontiers in Language Sciences, and IEICE. Recently, she was appointed as the Oriental-COCOSDA Convener.

Previously, she served as the general chair for SLTU 2016, chaired the "Digital Revolution for Under-resourced Languages (DigRevURL)" Workshops at INTERSPEECH 2017 and 2019, and was part of the organizing committee for the Zero Resource Speech Challenge in 2019 and 2020. She played a pivotal role in establishing the ELRA-ISCA Special Interest Group on Under-resourced Languages (SIGUL), where she has been chair since 2021 and organizes the annual SIGUL Workshop. In collaboration with UNESCO and ELRA, she was the general chair of the "Language Technologies for All (LT4All)" Conference in 2019, focusing on "Enabling Linguistic Diversity and Multilingualism Worldwide," and will lead LT4All 2.0 in 2025 under the theme "Advancing Humanism through Language Technologies."

Keynote 2

9:30 AM - 10:15 AM

Presenter: Tara Sainath (Google)

Title: End-to-End Speech Recognition: The Journey from Research to Production

Abstract: End-to-end (E2E) speech recognition has become a popular research paradigm in recent years, allowing the modular components of a conventional speech recognition system (acoustic model, pronunciation model, language model), to be replaced by one neural network. In this talk, we will discuss a multi-year research journey of E2E modeling for speech recognition at Google. This journey has resulted in E2E models that can surpass the performance of conventional models across many different quality and latency metrics, as well as the productionization of E2E models for Pixel 4, 5 and 6 phones. We will also touch upon future research efforts with E2E models, including multi-lingual speech recognition.

Biography: Dr. Tara Sainath holds an S.B., M.Eng, and PhD in Electrical Engineering and Computer Science from MIT. She has many years of experience in speech recognition and deep neural networks, including 5 years at IBM T.J. Watson Research Center, and more than 10 years at Google. She is currently a Distinguished Research Scientist and the co-lead of the Gemini Audio Pillar at Google DeepMind. There, she focuses on the integration of audio capabilities with large language models (LLMs).

Her technical prowess is recognized through her IEEE and ISCA Fellowships, and awards such as the 2021 IEEE SPS Industrial Innovation Award and the 2022 IEEE SPS Signal Processing Magazine Best Paper Award. She has served as a member of the IEEE Speech and Language Processing Technical Committee (SLTC) as well as the Associate Editor for IEEE/ACM Transactions on Audio, Speech, and Language Processing. Dr. Sainath's leadership is exemplified by her roles as Program Chair for ICLR (2017, 2018) and her extensive work co-organizing influential conferences and workshops, including: Interspeech (2010, 2016, 2019), ICML (2013, 2017), and NeurIPS 2020. Her primary research interests are in deep neural networks for speech and audio processing.

Keynote 3

10:15 AM - 11:00 AM

Presenter: Andrew Rosenberg (Google)

Title: From synthetic data to multimodal foundation models

Abstract: This talk will describe a thread of research that starts with the use of synthetic speech to train speech recognition models, and ends with the joint modeling of speech and text in multimodal foundation models. Along the way, I'll describe work using synthetic speech for training self-supervised pretraining models. This work serves as a transition into text-injection for speech recognition. Finally, I'll describe how this work results in a multimodal foundation model that can also perform speech synthesis (Virtuoso).

Biography: Andrew Rosenberg is currently a Senior Staff Research Scientist at Google, where he works on speech synthesis and speech recognition. He received his PhD from Columbia University. Previously he was a Research Staff Member at IBM, and a professor at CUNY Queens College and the CUNY Graduate Center. His lab was supported by NSF, DARPA, IARPA, Air Force Office of Scientific Research. He is a NSF CAREER award winner.

Morning Poster Session

11:00 AM - 12:30 PM

  1. Improving Text-To-Audio Models with Synthetic Captions. Zhifeng Kong (NVIDIA), Sang-gil Lee (NVIDIA), Deepanway Ghosal (SUTD), Navonil Majumder (Singapore University of Technology and Design), Ambuj Mehrish (SUTD), Rafael Valle (NVIDIA), Soujanya Poria (Singapore University of Technology and Design), Bryan Catanzaro (NVIDIA)
  2. Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition. Samuele Cornell (Carnegie Mellon University), Jordan Darefsky (University of Rochester), Zhiyao Duan (University of Rochester), Shinji Watanabe (Carnegie Mellon University)
  3. Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments. Pai Zhu (Google), Dhruuv Agarwal (Google LLC), Jacob Bartel (Google LLC), Kurt Partridge (Google), Hyun Jin Park (Google Inc.), Quan Wang (Google)
  4. Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model. Hyun Jin Park (Google Inc.), Dhruuv Agarwal (Google LLC), Neng Chen (Google Inc.), Rentao Sun (Google Inc.), Kurt Partridge (Google Inc.), Justin Chen (Google Inc.), Harry Zhang (Google Inc.), Pai Zhu (Google), Jacob Bartel (Google LLC), Kyle Kastner (Google), Yuan Wang (Google), Andrew Rosenberg (Google Inc.), Quan Wang (Google)
  5. On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition. Nick Rossenbach (RWTH Aachen University / AppTek GmbH), Sakriani Sakti (Nara Institute of Science and Technology / Japan Advanced Institute of Science and Technology), Ralf Schlüter (RWTH Aachen University)
  6. Leveraging LLM for Augmenting Textual Data in Code-Switching ASR: Arabic as an Example. Sadeen Alharbi (Saudi Data and Artificial Intelligence Authority (SDAIA)), Reem BINMUQBIL (Saudi Data and Artificial Intelligence Authority (SDAIA)), Ahmed Ali (Saudi Data and Artificial Intelligence Authority (SDAIA)), Raghad Aloraini (Saudi Data and Artificial Intelligence Authority (SDAIA)), Saiful Bari (Saudi Data and Artificial Intelligence Authority (SDAIA)), Areeb Alowisheq (Saudi Data and Artificial Intelligence Authority (SDAIA)), Yaser Alonaizan (Saudi Data and Artificial Intelligence Authority (SDAIA))
  7. SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data. Yichen Lu (Carnegie Mellon University), Jiaqi Song (Carnegie Mellon University), Xuankai Chang (Carnegie Mellon University), Hengwei Bian (Carnegie Mellon University), Soumi Maiti (CMU), Shinji Watanabe (Carnegie Mellon University)
  8. Using Voicebox-based Synthetic Speech for ASR Adaptation. Hira Dhamyal (Carnegie Mellon University), Leda Sari (Meta), Vimal Manohar (Meta Platforms Inc.), Nayan singhal (Meta), Chunyang Wu (Meta), Jay Mahadeokar (Meta AI), Matt Le (Meta), Apoorv Vyas (Meta), Bowen Shi (Meta), Wei-Ning Hsu (Meta), Suyoun Kim (Meta), Ozlem Kalinli (Meta)
  9. SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning. Chien-yu Huang (National Taiwan University), Min-Han Shih (National Taiwan University), Ke-Han Lu (National Taiwan University), Chi-Yuan Hsiao (National Taiwan University), Hung-yi Lee (National Taiwan University)
  10. On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures.{:target _blank"} Benedikt Hilmes (RWTH Aachen University), Nick Rossenbach (RWTH Aachen University / AppTek GmbH), Ralf Schlüter (RWTH Aachen University)

Lunch

12:30 PM - 1:30 PM

Afternoon Session

Keynote 4

1:30 PM - 2:00 PM

Presenter: Gustav Enje Efter (Swedish Royal Institute of Technology)

Title: Multimodal synthesis – an opportunity for synthetic data

Abstract: Just as we humans both perceive and produce information across multiple modalities, so should our generative AI models. However, the classic supervised approach to training multimodal systems requires parallel training data across all modalities simultaneously, which can be much more scarce than data from individual modalities. Foundation models and synthetic data offer a possible way to mitigate this problem. In this talk I review recent work in the multimodal synthesis of human communication – specifically speech audio and 3D motion (co-speech gestures) from text – and describe a straightforward method for creating synthetic data that improves the training of these models, as an example of possible uses of synthetic data for the benefit of multimodal GenAI.

Biography: Gustav Eje Henter is a docent and a WASP assistant professor in machine learning at the Division of Speech, Music and Hearing at KTH Royal Institute of Technology. His main research interests are deep probabilistic modeling for synthesis tasks, especially speech synthesis and 3D motion/animation generation. He has an MSc and a PhD from KTH, followed by post-docs in speech synthesis at the Centre for Speech Technology Research at the University of Edinburgh, UK, and in Prof. Junichi Yamagishi's lab at the National Institute of Informatics, Tokyo, Japan, before returning to KTH in 2018 and being promoted to faculty in 2020.

Afternoon Poster Session

2:00 PM - 3:15 PM

  1. Accent conversion using discrete units with parallel data synthesized from controllable accented TTS. Tuan-Nam Nguyen (Karlsruhe Institute of Technology), Quan Pham (Karlsruhe Institute of Technology), Alexander Waibel (Karlsruhe Institute of Technology)
  2. Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing. Hye-jin Shim (Carnegie Mellon University), Md Sahidullah (Institute for Advancing Intelligence, TCG CREST), Jee-weon Jung (Carnegie Mellon University), Shinji Watanabe (Carnegie Mellon University), Tomi Kinnunen (University of Eastern Finland)
  3. Audio Dialogues: Dialogues dataset for audio and music understanding. Arushi Goel (NVIDIA), Zhifeng Kong (NVIDIA), Rafael Valle (NVIDIA), Bryan Catanzaro (NVIDIA)
  4. Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech. Dareen Alharthi (Carnegie Mellon University), Roshan Sharma (Google), Hira Dhamyal (Carnegie Mellon University), Soumi Maiti (CMU), Bhiksha Raj (Carnegie Mellon University), Rita Singh (Carnegie Mellon University)
  5. Improving Spoken Semantic Parsing using Synthetic Data from Large Generative Models. Roshan Sharma (Google), Suyoun Kim (Meta), Trang Le (Meta), Daniel Lazar (Meta), Akshat Shrivastava (Meta), Kwanghoon An (Meta), Piyush Kansal (Meta), Leda Sari (Meta), Ozlem Kalinli (Meta), Mike Seltzer (Meta)
  6. Exploring synthetic data for cross-speaker style transfer in style representation based TTS. Lucas Ueda (UNICAMP), Leonardo Marques (CPqD), Flávio Simões (CPQD), Mário Uliani Neto (CPQD), Fernando Runstein (CPQD), Bianca Dal Bó (CPQD), Paula Costa (UNICAMP)
  7. Investigating the Use of Synthetic Speech Data for the Analysis of Spanish-Accented English Pronunciation Patterns in ASR. Margot Masson (University College Dublin), Julie Carson-Berndsen (University College Dublin)
  8. Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting. Hyun Jin Park (Google Inc.), Dhruuv Agarwal (Google LLC), Neng Chen (Google Inc.), Rentao Sun (Google Inc.), Kurt Partridge (Google Inc.), Justin Chen (Google Inc.), Harry Zhang (Google Inc.), Pai Zhu (Google), Jacob Bartel (Google LLC), Kyle Kastner (Google), Yuan Wang (Google), Andrew Rosenberg (Google LLC), Quan Wang (Google)
  9. Navigating the United States Legislative Landscape on Voice Privacy: Existing Laws, Proposed Bills, Protection for Children, and Synthetic Data for AI. Satwik Dutta (The University of Texas at Dallas), John H Hansen (Univ. of Texas at Dallas)
  10. Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms. Joseph Konan (Carnegie Mellon University), Shikhar Agnihotri (Carnegie Mellon University), Ojas Bhargave (Carnegie Mellon University), Shuo Han (Carnegie Mellon University), Bhiksha Raj (Carnegie Mellon University), Ankit Parag Shah (Carnegie Mellon University), Yunyang Zeng (Carnegie Mellon University)
  11. Naturalness and the Utility of Synthetic Speech in Model Pre-training. Diptasree Debnath (University College Dublin), ASAD ULLAH (University College Dublin), Helard Becerra (University College Dublin), Andrew Hines (University College Dublin)

Keynote 5

3:15 PM - 4:00 PM

Presenter: Andros Tjandra (META)

Title: Machine Speech Chain

Abstract: Although speech perception and production are closely related, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has largely advanced independently, with little mutual influence. However, human communication relies heavily on a closed-loop speech chain mechanism, where auditory feedback plays a pivotal role in human perception. This talk will explore a novel approach where we bridge this gap by developing a closed-loop machine speech chain model utilizing deep learning techniques.

Our model employs a sequence-to-sequence architecture that leverages both labeled and unlabeled data, enhancing the training process. This dual-direction functionality allows the ASR component to transcribe unlabeled speech features, and the TTS component reconstructs the original speech features from the ASR transcription. Certainly. Conversely, the TTS component synthesizes speech from the unlabeled text, and the ASR reconstructs the original text from the TTS generated speech.

This integration not only mimics human speech behaviors but also marks the first instance of its application into deep learning models. Our experimental results have demonstrated significant performance improvements over traditional systems trained solely on labeled data.

Biography: Andros Tjandra is a research scientist at Meta AI (FAIR) in the United States. He received his B.S. degree in Computer Science (cum laude) in 2014 and M.S. (cum laude) in 2015 from the Faculty of Computer Science, Universitas Indonesia. Later, he received his PhD in 2020 from the Graduate School of Information Science, Nara Institute of Science and Technology, Japan. He received the IEEE ASRU 2017 Best Student Paper Award, and the Acoustical Society of Japan (ASJ) Student Excellence Presentation Award for his research works. His research interests include speech recognition, speech synthesis, natural language processing, and machine learning.

Coffee Break

4:00 PM - 4:30 PM (Lobby)

Keynote 6

4:30 PM - 5:15 PM

Presenter: Junichi Yamagishi (National Institute of Informatics, Japan)

Title: Can we 'generate' large, privacy-aware, unbiased, and fair datasets with speech-generative models?

Abstract: The success of deep learning in speech and speaker recognition relies heavily on using large datasets. However, ethical, privacy and legal concerns arise when using large speech datasets collected from real human speech data. In particular, there are significant concerns when collecting many speaker's speech data from the web.

On the other hand, the quality of synthesized speech produced by recent generative models is very high. Can we 'generate' large, privacy-aware, unbiased, and fair datasets with speech-generative models? Such studies have started not only for speech datasets but also for facial image datasets.

In this talk, I will introduce our efforts to construct a synthetic VoxCeleb2 dataset called SynVox2 that is speaker-anonymised and privacy-aware. In addition to the procedures and methods used in the construction, the challenges and problems of using synthetic data will be discussed by showing the performance and fairness of a speaker verification system built using the SynVox2 database.

Biography: Junichi Yamagishi received a Ph.D. degree from the Tokyo Institute of Technology (Tokyo Tech), Tokyo, Japan, in 2006. From 2007 to 2013, he was a research fellow at the Centre for Speech Technology Research, University of Edinburgh, U.K. He became an associate professor with the National Institute of Informatics, Japan, in 2013, where he is currently a professor. His research interests include speech processing, machine learning, signal processing, biometrics, digital media cloning, and media forensics. He was co-organizer for the bi-annual ASVspoof Challenge and the bi-annual Voice Conversion Challenge. He also served as a member of the IEEE Speech and Language Technical Committee from 2013 to 2019, as an Associate Editor for IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING (TASLP) from 2014 to 2017, as a Senior Area Editor for IEEE/ACM TASLP from 2019 to 2023, as the chairperson for ISCA SynSIG from 2017 to 2021, and as a member at large of IEEE Signal Processing Society Education Board from 2019 to 2023.

Panel Discussion

5:15 PM - 6:00 PM

Participants: Keynote speakers. We aim to have a lively discussion where the organizers will pose a series of questions to the speakers and the audience.

Closing Remarks

6:00 PM - 6:30 PM


Call for Papers

We invite submissions from researchers and practitioners working on any aspect of synthetic data for speech and language processing. We encourage submissions that present novel research results, insightful case studies, and thought-provoking ideas on the future of this field. Submission topics include (but not limited to):

  • Novel techniques for generating realistic and diverse speech and text datasets.
  • Multimodal modality matching to make effective use of synthetic data
  • Methods for fine-tuning/adapting foundational models with synthetic data to improve performance, reduce bias, and reduce time to production
  • Comparative studies evaluating the effectiveness of synthetic data versus real data in training models for speech and language processing.
  • Applications of synthetic data for language resource development in low-resource settings (including medical domains).
  • Tools to promote research as well as new corpora generation for rapid prototyping/applications development
  • Analysis/Interpretability of foundational model’s learning abilities
  • Ethical, legal, and regulatory considerations for the use of synthetic data in speech technology.
  • Use Interspeech 2024 templates (LaTeX, etc.) for your manuscript submission. Author Resources
  • Reviewing will be single single-blind; author identities will be visible to reviewers.
  • Paper length is the same as Interspeech 2024.

Important dates

  • Submission date Jun 24, 2024
  • Notification of acceptance Jul 21, 2024
  • Workshop Date: 08/31/2024

All deadlines are 11:59pm UTC-12 (“anywhere on earth”)


Speakers

  • Sakriani Sakti
  • Tara Sainath
  • Andrew Rosenberg
  • Gustav Enje Efter
  • Andros Tjandra
  • Junichi Yamagishi

Key Themes

The Uncanny Realism of Synthetic Data

Advancements in generative modeling have made it possible to create synthetic speech and text data that is remarkably indistinguishable from human-produced content. This paves the way for responsible research without sacrificing model performance.

Limitless Flexibility

Synthetic data offers unparalleled flexibility. Researchers can tailor datasets with specific accents, dialects, speaking styles, and noise conditions, simulating real-world scenarios. This precision is invaluable for building more robust speech technologies.

Privacy by Design

Using synthetic data mitigates the risk of user privacy violations. It eliminates the need to collect, store, and process sensitive personal speech recordings, ensuring the highest ethical standards and trust from users.

Navigating the Regulatory Landscape

Strict data privacy regulations (GDPR, CCPA) create a complex landscape for using real human data. Synthetic data offers a compliance-friendly solution, accelerating research and development while respecting user rights.

Bridging Data Gaps for Low Resource Languages

Many languages lack sufficient speech data resources. Synthetic data can be used to create balanced datasets, driving the development of speech models that are inclusive and accessible to wider audiences.

Domain Robustness

The acquisition of a genuinely representative, diverse corpus presents challenges. For instance, in speech recognition, there might be a scarcity of children or elderly speech, as well as users from a specific underr epresented sociolinguistic group and diverse acoustic environments with noise and reverberation simulations. Alternatively, we may lack a specific type of domain data, such as medical, legal, or financial, where we anticipate our speech technology to excel

Scientific Committee

  • Mike Seltzer
  • Fadi Biadsy
  • Yu Zhang
  • Soumi Maiti
  • Jee-Weon Jung
  • Zhehuai Chen
  • Alex Acero
  • Andreas Stolcke
  • Jasha Droppo
  • Marco Tagliasacchi
  • Shammur Chowdhury
  • Dan Ellis
  • Ron Weiss
  • Yaser Al-Onaizan

Organizers

  • Pedro Moreno Mengibar
    • Pedro J. Moreno is Chief Scientist at the Saudi National AI center (NCAI) in Riyadh. He received his M.S. and Ph.D in Electrical and Computer Engineering from Carnegie Mellon University. During his Ph.D. studies he conducted research in noise robustness concluding in his Vector Taylor Series Approach to Noise Robustness which over the years has received more than 1000 citations. Prior to joining Google he was a research scientist at Hewlet Packack Cambridge Research Labs (CRL) where he led research in noise robustness, multimedia machine learning and audio indexing. During his tenure at CRL he led the development of speechbot, the first audio indexing engine publicly available. In 2004 he joined google where he was a founding member of the speech team. Over his career at google he has led important research in speech recognition internationalisation, contextualization, language modeling, and foundational models. Together with his team he led the expansion of google voice search to more than 80 languages. He has also been a founding member of the contextual modeling in speech recognition research effort at google. He was also a key contributor to the development of speech technologies for dysarthric speech. His team launched the Relate project to help speech disability users interact with google speech recognition services productively. Pedro has authored more than 150 papers and holds more than 70 patents.
  • Bhuvana Ramabhadran
    • Bhuvana Ramabhadran (IEEE Fellow, 2017, ISCA Fellow 2017) currently leads a team of researchers at Google, focusing on semi-/self-supervised learning and foundational models for multilingual speech recognition. Previously, she was a Distinguished Research Staff Member and Manager in IBM Research AI, at the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, where she led a team of researchers in the Speech Technologies Group and coordinated activities across IBM’s world wide laboratories in the areas of speech recognition, synthesis, and spoken term detection. She has served as an elected member of the IEEE SPS Speech and Language Technical Committee (SLTC), for two terms since 2010, as its elected Vice Chair and Chair (2014–2016) and currently serves as an Advisory Member. She has served as the Area Chair for ICASSP (2011–2018), on the editorial board of the IEEE Transactions on Audio, Speech, and Language Processing (2011–2015), IEEE SPS conference board (2017-2018), Regional Director-At-Large (2018-2020), Chair of the IEEE Flanagan Speech & Audio Award Committee and as a Member-at-Large of the IEEE SPS Board of Governors. She is the Vice President of the International Speech Communication Association (ISCA) board and has served as the area chair for Interspeech conferences since 2012. In addition to organising several workshops at ICML, HLT-NAACL, NeurIPS and ICML, and served as the co-general chair of SLT 2023. She has published over 150 papers and been granted over 40 U.S. patents. Her research interests include speech recognition and synthesis algorithms, statistical modelling, signal processing, and machine learning. Some of her recent work has focused on the use of speech synthesis to improve core speech recognition performance and self-supervised learning.
  • Shinji Watanabe
    • Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar at Georgia institute of technology, Atlanta, GA, in 2009, and a senior principal research scientist at Mitsubishi Electric Research Laboratories(MERL), Cambridge, MA USA from 2012 to 2017. Prior to the move to Carnegie MellonUniversity, he was an associate research professor at Johns Hopkins University, Baltimore,MD, USA, from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published more than 300 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. He serves as a Senior Area Editor of the IEEE Transactions onAudio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA),IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), andMachine Learning for Signal Processing Technical Committee (MLSP).
  • Ahmed Ali
    • Ahmed is a Principal Scientist at the National Center for AI (NCAI), SDAIA. Founder of ArabicSpeech.org. Co-founder and the Chief Scientist Advisor of KANARI AI. Over twenty-five years of experience with high impact in research and industry. Main expertise is speech processing and Natural Language Processing (NLP). Proven capability of performing high quality research in more than 80 peer-reviewed publications in top-tier conferences, journals, and patents. At the Arabic Languages Technologies (ALT) group at Qatar Computing Research Institute (QCRI), Ahmed demonstrated a high ability to generate novel ideas and successfully bring research to fruition through tech-transfer, which was later illustrated in KANARI-AI, a multi-million-dollar start-up. Experience in big corporations such as IBM and Nuance, and startups such as SpinVox and KANARI. Advisor for the UN, Economic and Social Commission for Western Asia (ESCWA), MBC, Aljazeera, BBC, and Deutsche Welle (DW) among others. Large impact on the speech community is shown in numerous articles featuring his research, including MIT Tech Review, BBC, and Speechmagazine. General Chair for the first IEEE speech conference in the Middle East. High-quality teaching and mentoring skills demonstrated in running the annual ArabicSpeech, spoken languages of the world committee, speech hackathon, and John Hopkins summer school JSALT 2022. Ahmed is known for his leadership in Arabic Language Technologies. He won the prestigious Qatar Foundation Best Innovation Award in 2018 and the World Summit Awards in 2024.

Sponsors