This paper is published in Volume-5, Issue-3, 2019
Area
Big Data and Text Mining
Author
Natasha J., Vijayarani J.
Org/Univ
Anna University, CEG Campus, Chennai, Tamil Nadu, India
Keywords
Short text, Conceptualization, Probase, BabelNet, Skipgram, Word2Vec, Concept2Vec
Citations
IEEE
Natasha J., Vijayarani J.. Optimized short text embedding for bilingual similarity using Probase and BabelNet, International Journal of Advance Research, Ideas and Innovations in Technology, www.IJARIIT.com.
APA
Natasha J., Vijayarani J. (2019). Optimized short text embedding for bilingual similarity using Probase and BabelNet. International Journal of Advance Research, Ideas and Innovations in Technology, 5(3) www.IJARIIT.com.
MLA
Natasha J., Vijayarani J.. "Optimized short text embedding for bilingual similarity using Probase and BabelNet." International Journal of Advance Research, Ideas and Innovations in Technology 5.3 (2019). www.IJARIIT.com.
Natasha J., Vijayarani J.. Optimized short text embedding for bilingual similarity using Probase and BabelNet, International Journal of Advance Research, Ideas and Innovations in Technology, www.IJARIIT.com.
APA
Natasha J., Vijayarani J. (2019). Optimized short text embedding for bilingual similarity using Probase and BabelNet. International Journal of Advance Research, Ideas and Innovations in Technology, 5(3) www.IJARIIT.com.
MLA
Natasha J., Vijayarani J.. "Optimized short text embedding for bilingual similarity using Probase and BabelNet." International Journal of Advance Research, Ideas and Innovations in Technology 5.3 (2019). www.IJARIIT.com.
Abstract
Most existing methodologies for text classification represent text as vectors of words, to be specific "bag-of-words." This content portrayal results in a high dimensionality of feature space and much of the time experiences surface jumbling. When it comes to short texts, these become even more serious because of their shortness and sparsity and with the bilingual similarity of text it gets more difficult. This paper proposes an approach to deal with both sparsity and computational complexity of bilingual similarity of short text. English short text is mapped with Probase and Hindi short text is mapped with BabelNet a knowledge base with coverage of words and concepts for 248 languages. A semantic network is created to manipulate the word to word and concept to concept correlation. Unlike the earlier approaches of embedding, words and concepts from both English and Hindi short texts are treated separately to yield word embedding (Word2Vec) and concept embedding (Concept2Vec) respectively. The similarity between bilingual short texts is computed using the skip-gram based word embedding and concept embedding. When evaluated with Pilot and STSS 131 short text benchmark datasets, the proposed optimized bilingual short text embedding gives better similarity score