Twitter has become the main micro-blogging hub around the world where opinions are shared at an incredible rate. How to extract useful information in different languages from the vast repositories of data? That is the question answered by research published in the International Journal of Business Intelligence and Data Mining.
Bidhan Sarkar, Manob Roy, Pijush Kanti Dutta Pramanik, and Prasenjit Choudhury of the National Institute of Technology of West Bengal, in Durgapur, India and colleague Nilanjan Sinhababu of the Sanaka Educational Trust’s Group of Institutions, also in Durgapur, suggest that interpreting, comprehending, and analyzing this emotion-rich information can unearth many valuable insights. They add that the job is trivial if the tweets are in English given the ubiquity of that language on the internet and the nature of tools and software available for data mining.
Recently, however, there has been an increase in the use of languages other than English and researchers would like to be able to access and analyze the output to Twitter and other platforms in those other languages too. The team’s solution seems unsubtle but will probably be the most effective way forward. They have developed a system that automatically identifies and classifies tweets in a language other than English irrespective of the linguistic script or “alphabet” used and converts the tweets into English!
The team calls their system Script Identification, Language Analysis, and Clustered Mining, which makes for a faux acronym of SILC, although strictly speaking it should be abbreviated as SILACM to be sensible albeit unpronounceable. When the framework is used with the top two languages of India other than English it performs with greater precision than current technology.
Sarkar, B., Sinhababu, N., Roy, M., Pramanik, P.K.D. and Choudhury, P. (2020) ‘Mining multilingual and multiscript Twitter data: unleashing the language and script barrier’, Int. J. Business Intelligence and Data Mining, Vol. 16, No. 1, pp.107–127.