Automatic typo reduction for text miners

Documents containing typographical errors, shoddy grammar and terminology invented and used only by the author as well as domain-specific jargon can be problematic for the text miner. Now, Yinghao Huang of CommVault Systems Inc in Tinton Falls, New Jersey working with Yi Lu Murphey of the University of Michigan-Dearborn, and Yao Ge of the Ford Motor Company, also in Dearborn, Michigan, USA, have come up with answer to this timeless problem.

Writing in the International Journal of Knowledge Engineering and Data Mining, Huang and colleagues explain how to fix typos and correct grammar in unstructured text documents, so that duplication, omission, transposition, substitution characters, complex spelling errors, a and unconventional use of acronyms can be remedied. Ironically, the team coins its own unconventional abbreviation – intelligent typo detection and correction (ITDC) – for their system. Nevertheless, they have successfully tested it on vehicle diagnostic documentation from the car industry. “The experiment results show that the proposed system outperforms some of the state-of-art spell checking systems,” the team reports.

The team points out that such a correction system is vital if text mining is to move forward because automatic text retrieval, categorization or classification are often stymied by typographical and other text errors that lead to ambiguities in a given document with multiple possible interpretations even within a simple and known context, especially given a non-standard, unstructured, document format. The team’s machine-learning approach which groups words and generates appropriate context allows them to pluck from a document typos and other errors and fix them inline. Their success with automotive diagnostics documents does not preclude the use of the same system in other domains, such as messages and updates on social media, in text messages or on Twitter, for instance, given that those “documents” too are unstructured, usually contain many typos, non-standard terms and abbreviations.

Huang, Y., Murphey, Y.L. and Ge, Y. (2015) ‘Intelligent typo correction for text mining through machine learning’, Int. J. Knowledge Engineering and Data Mining, Vol. 3, No. 2, pp.115–142.

Author: David Bradley

Award-winning, freelance science writer based in Cambridge, England.