The Gutenberg Dialogue Dataset

Authors:

Richard Csaky

Gábor Recski

Type:

Proceedings contribution

Proceedings:

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Publisher:

The Association for Computational Linguistics

Pages:

138 - 159

ISBN:

ISBN: 978-1-954085-02-2

Year:

2021

Abstract:

Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue datasets offer a trade-off between quality (e.g., DailyDialog) and size (e.g., Opensubtitles). We narrow this gap by building a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian. We extract and process dialogues from public-domain books made available by Project Gutenberg. We describe our dialogue extraction pipeline, analyze the effects of the various heuristics used, and present an error analysis of extracted dialogues. Finally, we conduct experiments showing that better response quality can be achieved in zero-shot and finetuning settings by training on our data than on the larger but much noisier Opensubtitles dataset. Our open-source pipeline (https://github.com/ricsinaruto/gutenberg-dialog) can be extended to further languages with little additional effort. Researchers can also build their versions of existing datasets by adjusting various trade-off parameters.

TU Focus:

Information and Communication Technology

Reference:

R. Csaky, G. Recski:
"The Gutenberg Dialogue Dataset";
in: "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume", herausgegeben von: Association for Computational Linguistics; The Association for Computational Linguistics, 2021, ISBN: 978-1-954085-02-2, S. 138 - 159.

Zusätzliche Informationen

PDF Link:

Last changed:

06.07.2021 10:22:12

TU Id:

296529

Accepted:

Accepted

Invited:

Department Focus:

Business Informatics

Info Link:

https://publik.tuwien.ac.at/showentry.php?ID=296529&lang=1

Abstract German:

Author List:

R. Csaky, G. Recski

Main menu

The Gutenberg Dialogue Dataset

Who's online

Contact

Offenlegung gemäß § 25 Mediengesetz:

Datenschutzerklärung

In case of problems

The Gutenberg Dialogue Dataset

Search form

Who's online

Contact

Offenlegung gemäß § 25 Mediengesetz:

Datenschutzerklärung

In case of problems