BEC: The Impact of Foreign Languages on English-First Algorithms

Unlike other types of email threats, BEC (business email compromise) emails typically feature only a few lines of text, with no URLs, attached files, or other scannable elements. As a result, email security vendors have turned to AI algorithms to analyze textual content that could indicate a BEC attack.  

One AI field that is effective at detecting BEC is Natural Language Processing (NLP). By analyzing text, NLP algorithms can detect urgent language, as well as flag words and phrases common to BEC, such as requests for wire transfers, invoice payments, and gift cards. 

Historically, most BEC emails have been written in English. Recently, however, we’ve noted an increase in BEC in other languages, including Italian, Spanish, German, and Slovenian. This change is a strategic response to the growth of AI in email security, and it presents a major challenge for AI algorithms that are English-first. 

The growth of BEC in additional languages 

The pivot to BEC in additional languages is in line with the growing sophistication of attacks overall. While many malicious emails still feature a comically loose grasp on the English language, sophisticated attacks are free of grammatical errors and other telltale signs of BEC.  

Additionally, hackers are easing in to conversations with victims, rather than getting right to the point upon first contact. This is for two reasons: First, engaging in pretexting brings down the victim’s guard. Second, by exchanging emails with a hacker, a victim unknowingly teaches some algorithms that the sender is legitimate. This can result in white listing of the hacker’s email address.  

AI algorithms, including NLP, are getting better at recognizing the above tactics. English-first algorithms, however, are naturally more effective in their mother tongue, making them less effective at recognizing BEC in other languages. 

A recent article by Time highlighted these challenges with an exploration of Facebook’s hate-speech algorithms. Although Facebook can reportedly analyze content in 40 languages, its algorithms detect only 80 percent of harmful posts. Suffice to say, an 80 percent detection rate in email security is both dismal and dangerous. 

To be effective at analyzing other languages, AI algorithms need significant datasets from which to learn. The English language is the most widely spoken in the world, and email security vendors likely have an abundance of English-language data with which to train their algorithms. The size of datasets in other languages, particularly non-global languages, is likely much lower, which affects the quality of the data. The smaller the dataset, the less reliable the data. 

To increase the language capabilities of AI algorithms, vendors not only need to increase their datasets but also invest significant resources to update their detection engines. This is time-consuming and expensive. Additionally, the data needs to be fresh, constantly updated with new samples of the target language. The number of mailboxes an email security vendor protects and the global footprint are ultimately the best indicators of the quality of its AI algorithms, which are trained with real email samples in many different languages.