Bazinga! A Dataset for Multi-Party Dialogues Structuring - Traitement du Langage Parlé Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

Bazinga! A Dataset for Multi-Party Dialogues Structuring

Claude Barras

Résumé

We introduce a dataset built around a large collection of TV (and movie) series. Those are filled with challenging multi-party dialogues. Moreover, TV series come with a very active fan base that allows the collection of metadata and accelerates annotation. With 16 TV and movie series, Bazinga! amounts to 400+ hours of speech and 8M+ tokens, including 500K+ tokens annotated with the speaker, addressee, and entity linking information. Along with the dataset, we also provide a baseline for speaker diarization, punctuation restoration, and person entity recognition. The results demonstrate the difficulty of the tasks and of transfer learning from models trained on mono-speaker audio or written text, which is more widely available. This work is a step towards better multi-party dialogue structuring and understanding. Bazinga! is available at hf.co/bazinga. Because (a large) part of Bazinga! is only partially annotated, we also expect this dataset to foster research towards self-or weakly-supervised learning methods.
Fichier principal
Vignette du fichier
2022.lrec-1.367.pdf (1.58 Mo) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte

Dates et versions

hal-03737453 , version 1 (27-07-2022)

Licence

Paternité

Identifiants

  • HAL Id : hal-03737453 , version 1

Citer

Paul Lerner, Juliette Bergoënd, Camille Guinaudeau, Hervé Bredin, Benjamin Maurice, et al.. Bazinga! A Dataset for Multi-Party Dialogues Structuring. 13th Conference on Language Resources and Evaluation (LREC 2022), European Language Resources Association (ELRA), Jun 2022, Marseille, France. pp.3434-3441. ⟨hal-03737453⟩
233 Consultations
149 Téléchargements

Partager

Gmail Facebook X LinkedIn More