Representation Learning in Continuous Entity set Associations

October 23, 2022 Post a Comment

Access throughyour institution

Information Sciences

Abstract

Representing variable length texts (e.g., sentences, documents) with low-dimensional continuous vectors has been a topic of recent interest due to its successful applications in various NLP tasks. During the learning process, most of existing methods tend to treat all the words equally regardless of their possibly different intrinsic nature. We believe that for some types of documents (e.g., news articles), entity mentions are more informative than ordinary words and it can be beneficial for certain tasks if they are properly utilized. In this paper, we propose a novel approach for learning low-dimensional vector representations of documents. The learned representations captures information of not only the words in documents, but also the entity mentions in documents and the connections between different entities. Experimental results demonstrate that our approach is able to significantly improve text clustering, text classification performance and outperform previous studies on the TAC-KBP entity linking benchmark.

Introduction

Learning low-dimensional vector representations for documents can be an effective approach in many NLP tasks and has the potential to outperform traditional representation methods like bag-of-words (BoW) [9], [15], [19]. However, despite the fact that different words have different properties and are of different importance to the whole document, such methods always tend to treat all words equally.

In particular, existing document representation learning methods do not distinguish named entity mentions (e.g., the mention of "Hillary Clinton" in a document) with ordinary words. Entity mentions occur frequently in various types of documents and are usually more informative than most ordinary words. In news articles, mentioned entities such as persons, organizations and locations provide essential information about who and where of the reported events. Two documents are more likely to have similar content if some of the entities mentioned by them are the same. Sometimes, knowing what entities are mentioned, we can even infer the possible topics of a document. For example, if a news article mentioned both "Hillary Clinton" and "Donald Trump", then we know there is a high probability that this article is about the US presidential election of 2016. Thus entity mention information can help to better capture the semantic similarities between documents while learning document representations. Moreover, different entities may be related with each other. For example, a person has connections with many other persons and is related to many locations and organizations. Documents that mention different but related entities are also more likely to have similar content. In order to leverage this property, the relatedness between different entities should also to be considered while learning representations for documents.

Therefore, we believe that it is possible to improve the quality of document representations by capturing the entity mention information of documents and the relatedness between different entities. Document representations learned with such information may achieve better performance when applied to tasks such as text clustering and text classification. They are also very suitable for the task of entity linking, which aims to map the mentions in a document to their referred entities in the referent knowledge base, since existing research [6], [10], [21], [30] has already shown that while performing entity linking for a mention, other mentions in the same context can be particularly helpful for the inference.

In this paper, we propose a novel approach to learn distributed representations of documents that are aware of the entity mentions in documents. We name our approach EMADR (Entity Mention Aware Document Representations). EMADR generalizes the PV-DBOW model proposed by Le and Mikolov [19] to make it possible to incorporate multiple types of related information into document representations. The learned document representations captures three types of information: what words are used in each document, what entities are mentioned in each document and the relatedness between different entities.

The main contributions of this paper are:

•: We propose EMADR, which to the best of our knowledge, is the first document representation learning method that leverages entity mention information.
•: We apply EMADR to entity linking with a neural network model. Compare with some existing neural network methods [12], [33] for this task, it has the advantage of only requiring a small amount of training data.
•: We study the performance of EMADR by conducting experiments on text clustering, text classification and entity linking. We find that EMADR is able to significantly improve text clustering and classification performance. Its application in entity linking also beats previous studies on the TAC-KBP entity linking benchmark.

The rest of this paper is structured as follows: In Section 2 we discuss the technical details of learning document representations with our approach. Section 3 shows how we apply the learned representations to the task of entity linking. In Section 4, we conduct a series of experiments on text clustering, text classification and entity linking. Finally, we introduce some related works in Section 5 and dummyTXdummy- concludes our work in Section 6.

Section snippets

Entity mention aware document representations

As previously mentioned in the introduction, for each document, we want its representation to capture both what words are used and what entities are mentioned. We also aim to capture the relatedness between different entities so that the representations of documents with different but related entities may also be similar. In order to do this, we generalize the well-known document representation learning model PV-DBOW [27] by introducing the concept of prediction lists. Then we employ this idea

Application to entity linking

Entity linking is the task of mapping mentions in documents to their referred entities in a knowledge base. It is required by many applications, such as relation extraction [36], entity relationship explaining [35], etc. We introduce, in this section, how we apply our approach to this task.

Given a mention within a document, a typical approach for entity linking first finds the candidate entities that this mention may refer to, then rank them and take the one with the highest score as the

Experiments

We conduct experiments on text clustering, text classification and entity linking to validate the effectiveness of EMADR and EMADR-EL.

The Stanford NER tool¹ is used to find entity mentions in documents throughout the experiments, it classifies the mentions in 4 classes: Location, Person, Organization and Misc.

Related work

Our work is mainly related to varied length text representation learning and entity linking. We will introduce the related work on both topics.

Conclusion and future work

In this paper, we demonstrate that better document representations can be learned by exploiting entity mention information. We propose EMADR, a document representation approach that generalizes the PV-DBOW model. It is different from existing approaches in that it captures information of the entity mentions and the relatedness of different entities into the learned document representations. Empirical results show that EMADR achieves significant performance gains in text clustering and text

Acknowledgments

This work was supported in part by the NSFC (No. U1611461, No. 61402401), the China Knowledge Centre for Engineering Sciences and Technology (CKCEST), Qianjiang Talents Program of Zhejiang Province 2015.

References (40)

et al.
Latent Dirichlet allocation

J. Mach. Learn. Res.

(2003)
P. Blunsom et al.
A convolutional neural network for modelling sentences

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics

(2014)
A. Bordes et al.
Translating embeddings for modeling multi-relational data

Advances in Neural Information Processing Systems

(2013)
D. Ceccarelli et al.
Learning relatedness measures for entity linking

Proceedings of the 22nd ACM International Conference on Information & Knowledge Management

(2013)
W.S. Cleveland
Robust locally weighted regression and smoothing scatterplots

J. Am. Stat. Assoc.

(1979)
S. Cucerzan
Large-scale named entity disambiguation based on wikipedia data.

EMNLP-CoNLL

(2007)
J.R. Finkel et al.
Incorporating non-local information into information extraction systems by Gibbs sampling

Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics

(2005)
A. Globerson et al.
Collective entity resolution with multi-focal attention

Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics (ACL)

(2016)
E. Grefenstette et al.
Multi- step regression learning for compositional distributional semantics

Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)

(2013)
Z. Guo et al.
Robust entity linking via random walks

Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

(2014)

X. Han et al.

Collective entity linking in web text: a graph-based method

Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

(2011)

Z. He et al.

Learning entity representation for entity disambiguation.

ACL (2)

(2013)

G.E. Hinton et al.

Replicated softmax: an undirected topic model

Advances in Neural Information Processing Systems

(2009)

Z. Hu et al.

Entity hierarchy embedding

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP)

(2015)

H.K. Kim et al.

Bag-of-concepts: comprehending document representation through clustering words in distributed representation

Neurocomputing

(2017)

Y. Kim

Convolutional neural networks for sentence classification

EMNLP

(2014)

G. Lample et al.

Neural architectures for named entity recognition

Proceedings of NAACL-HLT

(2016)

H. Larochelle et al.

A neural autoregressive topic model

Advances in Neural Information Processing Systems

(2012)

Q. Le et al.

Distributed representations of sentences and documents

Proceedings of the 31st International Conference on Machine Learning (ICML-14)

(2014)

S. Li et al.

Generative topic embedding: a continuous representation of documents

Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics (ACL)

(2016)

Representation Learning in Continuous Entity set Associations

Entity mention aware document representation

Abstract

Introduction

Section snippets

Entity mention aware document representations

Application to entity linking

Experiments

Related work

Conclusion and future work

Acknowledgments

Latent Dirichlet allocation

J. Mach. Learn. Res.

A convolutional neural network for modelling sentences

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics

Translating embeddings for modeling multi-relational data

Advances in Neural Information Processing Systems

Learning relatedness measures for entity linking

Proceedings of the 22nd ACM International Conference on Information & Knowledge Management

Robust locally weighted regression and smoothing scatterplots

J. Am. Stat. Assoc.

Large-scale named entity disambiguation based on wikipedia data.

EMNLP-CoNLL

Incorporating non-local information into information extraction systems by Gibbs sampling

Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics

Collective entity resolution with multi-focal attention

Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics (ACL)

Multi- step regression learning for compositional distributional semantics

Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)

Robust entity linking via random walks

Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Collective entity linking in web text: a graph-based method

Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

Learning entity representation for entity disambiguation.

ACL (2)

Replicated softmax: an undirected topic model

Advances in Neural Information Processing Systems

Entity hierarchy embedding

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP)

Bag-of-concepts: comprehending document representation through clustering words in distributed representation

Neurocomputing

Convolutional neural networks for sentence classification

EMNLP

Neural architectures for named entity recognition

Proceedings of NAACL-HLT

A neural autoregressive topic model

Advances in Neural Information Processing Systems

Distributed representations of sentences and documents

Proceedings of the 31st International Conference on Machine Learning (ICML-14)

Generative topic embedding: a continuous representation of documents

Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics (ACL)

Post a Comment for "Representation Learning in Continuous Entity set Associations"