Automatic bilingual phrase dictionary construction from GIZA++ output

Authors

DOI:

https://doi.org/10.17308/sait/1995-5499/2022/4/189-201

Keywords:

phrase translation, collocation translation, construction, bilingual dictionary, phrase dictionary, machine translation, automatic dictionary language resources

Abstract

Modern encoder-decoder based neural machine translation (NMT) models are normally trained on parallel sentences. Hence, they give best results when translating full sentences rather than sentence parts. Thereby, the task of translating commonly used phrases, which often arises for language learners, is not addressed by NMT models. While for high-resourced language pairs human-built phrase dictionaries exist, less-resourced pairs do not have them. In this paper, we propose an automatic approach to create such a dictionary based on the output of the statistical tool GIZA++ followed by filtering with heuristics. We analyze the translation quality obtained with this approach and compare it with reference translations and with phrases translation using a sentences-trained NMT system. The results show that, despite the problems identified, the phrase translations are most often correct, and even if they do not match the reference translation, they represent valid alternative translations. Another important result is that this approach works significantly better than the phrase translation using the NMT system. Using the proposed approach, we obtained a Russian-English dictionary of lexical expressions, which can be used both as a ready-made dictionary and as a raw resource for manual dictionary construction. The resulting Russian-English phrase dictionary was placed on the Internet as a linguistic resource.

Author Biographies

  • Albina M. Khusainova, Innopolis University

    4th year post-graduate student, assistant in Machine Learning and Knowledge Representation Laboratory, Innopolis University

  • Vitaly A. Romanov, Innopolis University

    4th year post-graduate student, assistant in Industrial Software Production Laboratory, Innopolis University

  • Adil M. Khan, Innopolis University

    Candidate of Science in Physics and Mathematics, Professor, Head of the Machine Learning and Knowledge Representation Laboratory, Innopolis University

References

Downloads

Published

2022-12-26

Issue

Section

Computer Linguistics and Natural Language Processing

How to Cite

Automatic bilingual phrase dictionary construction from GIZA++ output. (2022). Proceedings of Voronezh State University. Series: Systems Analysis and Information Technologies, 4, 189-201. https://doi.org/10.17308/sait/1995-5499/2022/4/189-201

Most read articles by the same author(s)