Easy-to-use combination of POS and BERT model for domain-specific and misspelled terms

Alexandra Benamar; Meryl Bothua; Cyril Grouin; Anne Vilnat

Communication Dans Un Congrès Année : 2021

Easy-to-use combination of POS and BERT model for domain-specific and misspelled terms

(1) , (2) , (1) , (1)

1
2

Alexandra Benamar

Fonction : Auteur

Information, Langue Ecrite et Signée

Meryl Bothua

Fonction : Auteur

EDF R&D

Cyril Grouin

Fonction : Auteur
PersonId : 177247
IdHAL : cyril-grouin
ORCID : 0000-0001-5809-188X
IdRef : 163639132

Information, Langue Ecrite et Signée

Anne Vilnat

Fonction : Auteur

Information, Langue Ecrite et Signée

Résumé

In this paper, we present BERT-POS, a simple method for encoding syntax into BERT embeddings without retraining or finetuning data, based on Part-Of-Speech (POS). Although fine-tuning is the most popular method to apply BERT models on domain datasets, it remains expensive in terms of training time, computing resources, training data selection and retraining frequency. Our alternative works at the preprocessing level and relies on POS tagging sentences. It gives interesting results for words similarity regarding out-of-vocabulary both in terms of domain-specific words and misspellings. More specifically, the experiments were done on French language, but we believe that they would be similar on others.

Mots clés

Natural Language Processing Language Models Semantic Similarity Out-of-Vocabulary Words Part-Of-Speech

Domaines

Traitement du texte et du document

Fichier principal

paper132.pdf (1.45 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Cyril Grouin : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03474696

Soumis le : vendredi 10 décembre 2021-12:32:32

Dernière modification le : mardi 6 février 2024-14:40:07

Archivage à long terme le : vendredi 11 mars 2022-19:06:02

Dates et versions

hal-03474696 , version 1 (10-12-2021)

Identifiants

HAL Id : hal-03474696 , version 1

Citer

Alexandra Benamar, Meryl Bothua, Cyril Grouin, Anne Vilnat. Easy-to-use combination of POS and BERT model for domain-specific and misspelled terms. NL4IA Workshop Proceedings, Nov 2021, Milan, Italy. ⟨hal-03474696⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA CENTRALESUPELEC UNIV-PARIS-SACLAY EDF LISN GS-ENGINEERING GS-COMPUTER-SCIENCE LISN-ILES

298 Consultations

1160 Téléchargements

Easy-to-use combination of POS and BERT model for domain-specific and misspelled terms

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager