Accéder directement au contenu Accéder directement à la navigation
Article dans une revue

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity

Abstract : We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering datasets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets. Due to its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and cross-lingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and cross-lingual representation models, including static and contextualized word embeddings (such as fastText, M-BERT and XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised cross-lingual word embeddings. We also present a step-by-step dataset creation protocol for creating consistent, Multi-Simlex-style resources for additional languages. We make these contributions -- the public release of Multi-SimLex datasets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning -- available via a website which will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-02975786
Contributeur : Thierry Poibeau <>
Soumis le : mercredi 30 décembre 2020 - 13:08:04
Dernière modification le : jeudi 14 janvier 2021 - 15:14:13

Fichier

coli_a_00391.pdf
Fichiers éditeurs autorisés sur une archive ouverte

Licence


Distributed under a Creative Commons Paternité - Pas d'utilisation commerciale - Pas de modification 4.0 International License

Identifiants

  • HAL Id : hal-02975786, version 1
  • ARXIV : 2003.04866

Collections

Citation

Ivan Vulic, Simon Baker, Edoardo Maria Ponti, Ulla Petti, Ira Leviant, et al.. Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity. Computational Linguistics, Massachusetts Institute of Technology Press (MIT Press), In press. ⟨hal-02975786⟩

Partager

Métriques

Consultations de la notice

55

Téléchargements de fichiers

7