Phishing scam detectionAustralians have lost almost half a million dollars this year to phishing scams according to Scamwatchwith over 13,000 reports of attacks affecting individuals aged 18 to over 65. A form of online malicious activity, phishing has increased significantly since 2010, with the outbreak of COVID-19 and a shift to working from home exacerbating the practise. 

Phishing is the fraudulent attempt to obtain sensitive information such as usernames, passwords, and credit card details by disguising oneself as a trustworthy entity via electronic communication.  

Methods such as blacklists, content analysis platforms and web-based filters are currently used to prevent attacks, however, scammers have continued to develop and spread new and more elaborate attacks faster than effective solutions can be designed to counteract them.  

This makes it challenging to create a robust enough system to successfully detect an assault, but a novel method recently developed by data science specialists at CSIRO’s Data61 is changing that. 

By combining different algorithmic techniques, researchers at Data61 in collaboration with University of New South Wales (UNSW) and Cyber Security Cooperative Research Centre (CSCRC), are using file compression to efficiently and successfully identify phishing attempts. 

PhishZip can correctly identify phishing websites with more than 83% accuracy, a marked improvement from current methods.  “The technology could ultimately prevent significant financial losses for individuals and organisations,” explains Data61 Research Scientist Dr Arindam Pal, who is working on the project with professors Sanjay Jha and Alan Blair, and their PhD student Rizka Purwanto, all of whom are from University of New South Wales, Sydney.

Our goal is to detect and prevent phishing websites before they can do any harm to the users.”

Previous phishing detection methods employed machine learning algorithms that used traditional classification techniques like logistic regression, support vector machines, decision trees and artificial neural networks. These algorithms can’t cope with the dynamic nature of phishing, which often sees fraudsters constantly change the design and hyperlink of an illicit site every few hours.

 

PhishZip 

PhishZip applies file compression to distinguish phishing websites from the legitimate version, a technique that encodes information using fewer bits than the original format to reduce file size.  

We use the DEFLATE file compression algorithm to compress both legitimate and phishing websites and separate them by examining how much they get compressed. Legitimate and phishing websites have different compression ratios,” says Dr Pal.

Phishing scam detection

An example of a phishing scam that recreates a legitimate website login page.

We then introduce a systematic process of selecting meaningful words which are associated with phishing and non-phishing websites and analyse the likelihood of those word occurrencestherefore calculating the optimal likelihood threshold.” 

These words are then used as the pre-defined dictionary for our compression models and used to train the algorithm into identifying instances where a proliferation of these key words indicates a malicious website. 

Unlike machine learning-based models, PhishZip’s approach does not require model training or HTML parsing, whereby HTML code extracts relevant information, such as the title of a page and headings.  

PhishZip has also allowed the team to contribute comprehensive phishing datasets to PhishTank, a free community site where anyone can submit, verify, track and share phishing dataThis enables researchers and engineers around the world to leverage the techniques to improve the security of systems. 

PhishZip is a currently evolving research project, however, if you’re interested in early access, contact us here.

An example of a phishing scam circulating throughout Australia.

Other methods to prevent phishing

As PhishZip isn’t yet publicly available, ensure your online safety and privacy with these expert tips: