SynData4GenAI 2024

The first edition of syndata4genai will be co-located with Interspeech, in KOS, Greece on August 31, 2024.

Introduction

The field of speech technology has undergone a revolution in recent years, driven by the rise of foundational models underpinning automatic speech recognition (ASR) and text-to-speech synthesis (TTS). However, a key challenge persists: the reliance on large quantities of real human speech data, raising privacy concerns and regulatory hurdles.

Synthetic data offers a groundbreaking alternative, empowering researchers to develop speech models that are ethical, inclusive, and adaptable to diverse scenarios. While the use of synthetic data has been studied extensively, its role in foundational models has not been explored. This workshop will bring together research under the following themes specifically targeted to this era of foundational models that have been pre-trained on every available and usable data source.

Schedule

Time	Agenda
8:30 AM - 9:00 AM	Welcome & Opening Remarks
9:00 AM - 9:45 AM	Keynote 1
9:45 AM - 11:30 AM	Keynote 2
11:30 AM - 12:30 PM	Papers: Talks + Posters (aligned with Keynote 1)
12:00 PM - 1:00 PM	Lunch
1:00 PM - 1:45 PM	Keynote 3
1:45 PM - 3:15 PM	Papers: Talks + Posters (aligned with Keynote 1)
3:15 PM - 4:00 PM	Keynote 4
4:00 PM - 4:45 PM	Keynote 5
4:45 PM - 5:15 PM	Coffee Break
5:15 PM - 5:45 PM	Panel Discussion: Next Steps
5:45 PM - 6:00 PM	Closing Remarks

Call for Papers

We invite submissions from researchers and practitioners working on any aspect of synthetic data for speech and language processing. We encourage submissions that present novel research results, insightful case studies, and thought-provoking ideas on the future of this field. Submission topics include (but not limited to):

Novel techniques for generating realistic and diverse speech and text datasets.
Multimodal modality matching to make effective use of synthetic data
Methods for fine-tuning/adapting foundational models with synthetic data to improve performance, reduce bias, and reduce time to production
Comparative studies evaluating the effectiveness of synthetic data versus real data in training models for speech and language processing.
Applications of synthetic data for language resource development in low-resource settings (including medical domains).
Tools to promote research as well as new corpora generation for rapid prototyping/applications development
Analysis/Interpretability of foundational model’s learning abilities
Ethical, legal, and regulatory considerations for the use of synthetic data in speech technology.

Important dates

Submission date Jun 18, 2024
Notification of acceptance Jul 14, 2024
Workshop Date: 08/31/2024

All deadlines are 11:59pm UTC-12 (“anywhere on earth”)

Speakers

Andrew Rosenberg
Sakriani Sakti
Andros Tjandra

Key Themes

The Uncanny Realism of Synthetic Data

Advancements in generative modeling have made it possible to create synthetic speech and text data that is remarkably indistinguishable from human-produced content. This paves the way for responsible research without sacrificing model performance.

Limitless Flexibility

Synthetic data offers unparalleled flexibility. Researchers can tailor datasets with specific accents, dialects, speaking styles, and noise conditions, simulating real-world scenarios. This precision is invaluable for building more robust speech technologies.

Privacy by Design

Using synthetic data mitigates the risk of user privacy violations. It eliminates the need to collect, store, and process sensitive personal speech recordings, ensuring the highest ethical standards and trust from users.

Navigating the Regulatory Landscape

Strict data privacy regulations (GDPR, CCPA) create a complex landscape for using real human data. Synthetic data offers a compliance-friendly solution, accelerating research and development while respecting user rights.

Bridging Data Gaps for Low Resource Languages

Many languages lack sufficient speech data resources. Synthetic data can be used to create balanced datasets, driving the development of speech models that are inclusive and accessible to wider audiences.

Domain Robustness

The acquisition of a genuinely representative, diverse corpus presents challenges. For instance, in speech recognition, there might be a scarcity of children or elderly speech, as well as users from a specific underr epresented sociolinguistic group and diverse acoustic environments with noise and reverberation simulations. Alternatively, we may lack a specific type of domain data, such as medical, legal, or financial, where we anticipate our speech technology to excel

Scientific Committee

Mike Seltzer
Fadi Biadsy
Yu Zhang
Soumi Maiti
Jee-Weon Jung
Zhehuai Chen
Alex Acero
Andreas Stolcke
Jasha Droppo
Marco Tagliasacchi
Shammur Chowdhury
Dan Ellis
Ron Weiss
Yaser Al-Onaizan

Organizers

Pedro Moreno Mengibar
- Pedro J. Moreno Mengibar is a senior research director at Google's speech team. He leads a team of researchers and engineers exploring the use and deployment of speech foundational models across all google speech products such as youtube captioning, voice search, voicemail transcription, and google assistant. Dr. Moreno received his M.S. and Ph.D in Electrical and Computer Engineering from Carnegie Mellon University in 1992 and 1996 respectively. During his Ph.D. studies he conducted research in noise robustness concluding in his Vector Taylor Series Approach to Noise Robustness which over the years has received more than 1000 citations. Prior to joining Google he was a research scientist at Hewlet Packack Cambridge Research Labs (CRL) where he led research in noise robustness, multimedia machine learning and audio indexing. During his tenure at CRL he led the development of speechbot, the first audio indexing engine publicly available. In 2004 he joined google where he was a founding member of the speech team. Over his career at google he has led important research in speech recognition internationalisation, contextualization, language modelling, and foundational models. Together with his team he led the expansion of google voice search to more than 80 languages. He has also been a founding member of the contextual modelling in speech recognition research effort at google. He was also a key contributor to the development of speech technologies for dysarthric speech. His team launched the Relate project to help speech disability users interact with google speech recognition services productively. Dr. Moreno has authored more than 150 papers and holds more than 70 patents.
Bhuvana Ramabhadran
- Bhuvana Ramabhadran (IEEE Fellow, 2017, ISCA Fellow 2017) currently leads a team of researchers at Google, focusing on semi-/self-supervised learning and foundational models for multilingual speech recognition. Previously, she was a Distinguished Research Staff Member and Manager in IBM Research AI, at the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, where she led a team of researchers in the Speech Technologies Group and coordinated activities across IBM’s world wide laboratories in the areas of speech recognition, synthesis, and spoken term detection. She has served as an elected member of the IEEE SPS Speech and Language Technical Committee (SLTC), for two terms since 2010, as its elected Vice Chair and Chair (2014–2016) and currently serves as an Advisory Member. She has served as the Area Chair for ICASSP (2011–2018), on the editorial board of the IEEE Transactions on Audio, Speech, and Language Processing (2011–2015), IEEE SPS conference board (2017-2018), Regional Director-At-Large (2018-2020), Chair of the IEEE Flanagan Speech & Audio Award Committee and as a Member-at-Large of the IEEE SPS Board of Governors. She is the Vice President of the International Speech Communication Association (ISCA) board and has served as the area chair for Interspeech conferences since 2012. In addition to organising several workshops at ICML, HLT-NAACL, NeurIPS and ICML, and served as the co-general chair of SLT 2023. She has published over 150 papers and been granted over 40 U.S. patents. Her research interests include speech recognition and synthesis algorithms, statistical modelling, signal processing, and machine learning. Some of her recent work has focused on the use of speech synthesis to improve core speech recognition performance and self-supervised learning.
Shinji Watanabe
- Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar at Georgia institute of technology, Atlanta, GA, in 2009, and a senior principal research scientist at Mitsubishi Electric Research Laboratories(MERL), Cambridge, MA USA from 2012 to 2017. Prior to the move to Carnegie MellonUniversity, he was an associate research professor at Johns Hopkins University, Baltimore,MD, USA, from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published more than 300 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. He serves as a Senior Area Editor of the IEEE Transactions onAudio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA),IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), andMachine Learning for Signal Processing Technical Committee (MLSP).
Ahmed Ali
- Ahmed is a Principal Engineer in the Arabic LanguageTechnologies group (ALT) at the Qatar Computing Research Institute (QCRI). He is the founder of ArabicSpeech.org. Co-founder and the Chief Scientist of KANARI AI. Ahmed has over twenty years of experience with high impact in research and industry. His main expertise is Automatic Speech Recognition (ASR) and Natural Language Processing (NLP). He has proven capability of performing high-quality research in more than 100 peer-reviewed publications in top-tier conferences, journals, and patents. High ability to generate novel ideas and bring research through tech-transfer illustrated in KANARI, a multimillion-dollar start-up. He is an advisor for the UN, ESCWA, Aljazeera, BBC, and DW among others. His work has a large impact on the speech community, which is shown in numerous articles featuring his research, including MIT Tech Review, BBC, and Speech Magazine. General Chair for the first IEEE speech conference in the Middle East. High-quality teaching and mentoring skills demonstrated by running the annual ArabicSpeech, Spoken Languages of the World Committee, speech hackathon, and John Hopkins Summer School JSALT 2022. Ahmed is known for his leadership in Arabic Language Technologies, particularly; Speech and NLP. He won the prestigious Qatar Foundation Best Innovation Award in 2018. The award was given by HerHighness Sheikha Moza bint Nasser.