A new approach to detecting the addresses of potentially malicious websites that might compromise an individual or corporate computing environment is being developed by researchers in China. The approach avoids a simplistic analysis based on keywords in the address, the URL (uniform resource locator) and instead uses statistical analyses based on gradient learning and feature extraction to feed the machine learning of an algorithm that can quickly detect malicious website addresses.
Baojiang Cui, Shanshan He, and Peilin Shi of Beijing University of Post and Telecommunications worked with Xi Yao of QIHU 360 Software Co. Limited on the study and report details in the International Journal of High Performance Computing and Networking.
The approach has been validated against the naïve Bayes, decision tree, and support vector machine (SVM) and found to be efficient and to have an accuracy rate of 98.7%. Moreover, the team reports that their system is in practical use and analyzing approximately 2 terabytes of data every day automatically classifying URLs as benign or malicious and blocking access to the latter. The system does not defer to a blacklist of sites as have other security approaches nor does it rely on any single characteristic of the URL being tested.
The approach, the team says, represents “a comprehensive approach that utilises all the features of machine learning.” They hope to be able to improve the accuracy to close to 99.99% by better keyword analysis and the extraction of additional features. The same technique might also be used to identify other types of web attack that appear not only in URLs but also in user agent strings, cookies, and other features of internet traffic.
Cui, B., He, S., Yao, X. and Shi, P. (2018) ‘Malicious URL detection with feature extraction based on machine learning’, Int. J. High Performance Computing and Networking, Vol. 12, No. 2, pp.166–178.