Fighting spam on decentralized & federated messaging networks (ie. XMPP) is "essentially war with a multi-headed hydra, when 3 new heads are instantly grown up when you cut off just one." — A. on the XSF mailing lists.
After going through an extensive search on the available ways to fight XMPP spam and reading a lot of emails on the XMPP Standards Foundation mailing lists, I came to a sad conclusion: for now, neither users nor server operators have satisfying ways to alleviate the problem of spam on XMPP. This is huge, considering some users receive tens of spammy messages a day.
As I was once the operator of Jappix.com (a public XMPP server that has grown very popular), I became aware of the problem of spam very early. First, with automated @jappix.com accounts which got created from our once-open In-Band Registration (IBR) and were used to attack our legitimate users, and then from external accounts from the federation (from both legitimate and illegitimate XMPP servers). Spam attacks add a tremendous load to your infrastructure, by overloading user offline message database, and bother your users to a point they'd want to stop using your service and move to another server, or even consider stopping using XMPP at all. Thus, I decided to take action and contribute further to the XMPP community, following my work on Jappix and Giggle.
This post announces the inception of an open-source project that aims to solve the problem of spam on the XMPP network. The following post will go through explaining what is considered as spam in the XMPP world, what are the odds if we do nothing to fix it and how other protocols cope (successfully) with spam. I will then suggest a set of solutions that may allow to prevent spam at the server-level, that can be plugged to existing XMPP servers on the market. Then, I will set a timeline for the project. This post is subject to change, as I collect more data and ideas from the community.
If you want to stay in touch with the Providence community and discuss with other Providence users, join the Providence groupchat on XMPP.
A Brief Description Of XMPP SPIM
In this post, Spam over XMPP will be referred to as SPIM (Spam + IM).
XMPP, as a federated, Extensible Messaging and Presence Protocol, may allow the following kinds of SPIM (vectors of attack):
- Messaging SPIM: SPIM received from XMPP messages (with types
groupchat, or even
- Presence SPIM: SPIM received via roster + presence subscribtion requests from spammy JIDs (mostly with type
- XMPP Protocol Extension SPIM: SPIM that exploits weaknesses / features of largely implemented XEPs, such as VJUD (vCard directory).
This post and subsequent project will only cover Messaging SPIM and Presence SPIM. Considering that no SPIM on any XMPP Protocol Extension has been reported so far on the XSF mailing lists, and being aware of the complexity of hardening the wide range of available XMPP Protocol Extensions, this will be deferred to later (if it becomes needed in the future).
All XMPP servers participating in the federated network are likely to be subject to SPIM. Not only outgoing SPIM, which can be easily prevented by restricting and controlling registrations (accounts hosted on a server, used to send SPIM to accounts to local/remote users). Indeed, incoming SPIM is much harder to control and alleviate as it may come from anywhere (legitimate and illegitimate servers - illegitimate servers being setup and used for the sole purpose of sending SPIMs), under any form (Message stanzas, Presence stanzas, IQ stanzas), and varying among XMPP users (the langage used, the content it promotes). The problem is very similar to email SPAM.
Example Of SPIMs
To give some emphasis to this post, I collected some SPIM messages, that I anonymized. I maintain a non-exhaustive corpus of them contained in the Providence project repository, for test and training purposes.
You can find one of them there (Russian SPIM):
Приглашаем Вас на RAMP (marketplace), это Огромный русский (Silk Road), посвященный продаже ПАВ c 2012 года. Вас ждет самый огромный выбор исключительно НАТУРАЛЬНЫХ веществ от проверенных продавцов со всей России. Полный каталог магазинов (работает только через TOR браузер): [LINK OBFUSCATED] А здесь вы можете почитать отзывы и пообщаться с людьми, найти себе работу, открыть свой магазин и узнать массу полезной информации по веществам. (работает только через TOR браузер): [LINK OBFUSCATED]
The SPIM above basically tries to promote a Tor-based marketplace, as an alternative to the long-defunct Silk Road. I obfuscated the Onion links that followed the message to avoid promoting it there.
It has been reported on the XSF Mailing Lists that some (if not all) of SPIMs are originating from the Matanga botnet. See discussion thread here.
Some well-known XMPP servers are known to host spammy JIDs, among which:
tigase.im(I got an email from Tigase founder: they are aware of this and working on mitigating outgoing spam)
In order to know which JID to target for SPIM, I suspect that spammers made up a list of JIDs to spam by cross-linking it to lists of known emails. Indeed, I received spam the day after setting up a new XMPP server on a previously unused XMPP address, which has however been used as my main email address for a long time. Another technique they may use is the following: they list all users registered on an XMPP server though the VJUD (Virtual Jabber User Directory) service that some XMPP servers expose (mostly ejabberd).
State Of The Market (2016)
As of 2016, some measures have been taken to provide operators ability to filter spam:
Server modules have been / are being developed:
- Prosody modules: mod_spam_reporting
- ejabberd modules: (upcoming Advanced Blocking module)
Some interesting talks have been given:
Ideas are being discussed:
Things have been improved in the defaults of XMPP servers:
Good practices have been implemented in the way XMPP server are shipped. Default configurations come with disabled registration and disabled public user search (ejabberd & Prosody)
But so far, operators have no "no-brainer" way to start blocking SPIM efficiently. Email server operators have a a decent number of open-source filters to help them block spam, and it's working decently well.
The Future If We Do Nothing
If no elegant solution is proposed / implemented to help block / help reduce the amount of SPIM XMPP users receive, we may soon see the following happening:
- Users stopping launching their XMPP client: to avoid the burden of filtering spam.
- Users stopping using XMPP at all: there's too much spam out there.
- Operators applying unproper emergency solutions, among which: blocking IPs (or even blocks), blocking domains, blocking alphabets (I start seeing operators doing those)
- Server offline message databases collapsing: under the amount of SPIM they retain (which should be delivered to the users once they come back online)
When I say that an elegant solution needs to be designed, I mean it in a way it will stop operators implementing SPIM blocking practices that break the fundamentals principles of a truly open XMPP network. Solving a problem via dirty configuration tricks is not sustainable on the long run.
SPIM/SPAM Prevention Techniques
Users on the XSF mailing lists have proposed the following solutions. Here are my comments on those ideas:
- Use client-based anti-spams: not portable, waste of time for client developers as everyone will implement its own code. Some popular clients will remain unprotected and may not want to implement such a system.
- Automated CAPTCHA responses to unknown message senders: would generate a lot of outgoing traffic to perform checks under large SPIM attacks, definitely not user-friendly.
- Block IPs: block remote IPs, as a single IP can host multiple XMPP domains (more efficient than blocking domains). Though, it is very easy to switch a server to a new IP today using popular cloud hosting providers (eg: DigitalOcean allows this very easily).
- Refuse servers with open In-Band Registration (IBR): would break the federation in its current state, and would not be very efficient as you'd need servers to proceed checks against each other by trying to connect to other's IBR services. Plus, a closed IBR doesn't state that the server don't contain spammy accounts. Take for instance a server that was setup for the sole purpose of sending SPIM can register accounts using internal server commands, and yet not expose an open IBR: it would be detected as non-spammy as its IBR appears closed, although it hosts spammy JIDs.
- Enforcing servers to implement an anti-spam: how do you ensure remote servers implement an anti-spam solution? The only way I see it is to perform a remote check, for which a "yes, I implement an anti-spam system" response can be spoofed by spammy servers.
- Deny all messages from JIDs not in roster: too strict, as people may want to speak without being subscribed to each other. Plus, it only prevents Message SPIMs, but don't prevent at all Presence Subscription SPIMs.
- Block alphabets: as most SPIM messages are written in Russian (Cyrillic), some have suggested to block any message containing characters from the Cyrillic alphabet. Note that this is an horrifying solution, as we should not discriminate users by their language. Even if the current SPIM is mostly Russian in Cyrillic, I would definitely not like to see federated servers start blocking Cyrillic as they may discriminate what some of their users may say. As some said on the XSF ML, this is very Nazi, and I couldn't agree more with that.
Given my positive experience of using SPAM filters for other protocols (email), I would suggest:
- Host the filter on the server, implement SPIM reporting XEPs on the clients: keep the complexity on the server, show simple SPIM reporting buttons to the user. Possibly, give the user the ability to view messages that have been retained on the server because categorized as SPIM, and provide a way to report them as non-SPIM. Clients which don't implement SPIM reporting features still benefit from server-side blocking. Reduces c2s network overhead on clients (critical for XMPP on mobile devices).
- Give a score to each incoming message from users who are not in roster: bypass the SPIM filters if the message is coming from a JID which is in the user's roster. JIDs don't need to be authenticated as spoofing is not possible on XMPP.
- Calculate message score as an aggregate of multiple filters: Bayesian Filter, Karma Filter, Authentication Filter (more on that below).
- Run a public DNSBL service, used by federated servers: will short-circuit the filter and block IPs that are being very spammy.
I came up with some more advanced ideas to alleviate SPIM:
- Proof-of-work mechanism on XMPP servers: given that sending a message has a large cost (eg. computer resources) relative to receiving a message, the economics would be no more in the favor of spammers, which need to send a massive amount of messages cheaply. See Hashcash as a starting point, which is used to fight email SPAM. As XMPP is extensible, we could slowly roll over such a protocol for Message and Presence stanzas, to end up enforcing it on the network (as the federation did with encryption in 2014).
The latter idea, which is, getting economics at work against spammers, would make Providence obsolete on the long-run. This should be considered as a long-term solution, that still needs approval from the community, as well as a standard (XEP). Then, both clients and servers will need to implement it, until we reach a point where it is wildly deployed. Only then the XSF can flip the switch and enforce the proof-of-work mechanism in the protocol. Finally defeating those pesky spammers.
However, there are also interesting counter-arguments against Hashcash, pointing out that it can be easily defeated by large botnets.
More on what I propose is being described below.
A (Short) Study On How Email Fights SPAM
There are always benefits to reviewing how others do similar things. As XMPP don't have any past with fighting SPIM, and given the similarities between how the XMPP and Email/SMTP network is operated, it's worth giving a shot to how SPAM is being fougth with email:
1) As email addresses can be spoofed and email content can be altered, senders address/message authenticity needs to be verified using SPF and/or DKIM/DMARC.
Though XMPP doesn't have the spoofing problem, DMARC is still interesting for XMPP for its automated abuse reporting capabilities. Indeed, DMARC provides a way to define a target abuse email in DNS, used by federated servers to report incoming emails that failed DMARC validation for the sender server. If we put this in parallel to XMPP, a feature similar to DMARC could be used to automatically report messages that were marked as SPIM, in a distributed way. Using this mechanism securely implies enforcing DNSSEC. This would provide a way for server operators to be aware of outgoing SPIM activity on their server, and take action in to removing spammy accounts. We can already register some server administrator/abuse addresses using XEP-0157.
2) Email Bayesian filters are very efficient at categorizing HAM from SPAM alone, indistinctively of any sender reputation. I receive a lot of email SPAM and I use a Bayesian filter on my laptop (SpamSieve), which is able to filter SPAM with an accuracy of 98%, and generates very few false positives; less than 2-3 legitimate emails go to the Junk folder per month. SpamAssassin implements a Bayesian filter on the server, and is very popular on email servers. However, some Bayesian filters are not so efficient; their accuracy relies on fine-tuning of Bayesian hyperparameters.
The SPAM emails I receive look, in some ways, very similar to the XMPP SPIM I collected in Providence test corpus. Thus, I predict that using a Bayesian filter on XMPP messages from unknown senders (ie. not in roster) would be as efficient. The Bayesian database has to be user-specific, as it stores information on the words the user is used to send/receive in legitimate messages, and the words that have been reported as spam. It uses a probabilistic model that is very simple and very efficient. As everyone don't use the same words or speak about the same topics, a Bayesian filter is very efficient at detecting anomalies in messages that drift from the word dictionary the user is used to employ. If the Bayesian filter fails at predicting the spamicity of a message, we can still rely on the adjustment filters described right below.
3) Email blocking lists are terribly efficient at preventing spammy SMTP servers from delivering emails (ie. servers that only/mostly send SPAM messages). Public DNSBL are maintained by centralized authorities and used by the federated network of SMTP servers.
DNS-based blocking lists can also work very well for XMPP. The technique relies on the DNS infrastructure and is very efficient, but requires central authorities to exist, and be trusted by federated servers as a reliable source of trust. DNSBL should be used as a last-resort, circuit-breaker solution to prevent very spammy hosts that are involved in large SPIM attacks to trigger more elaborate filters, such as a Bayesian filter. Indeed, DNSBL requires minimal resources, while other filters may use more (if not a lot of) CPU / RAM / disk resources to return a score.
4) Email greylisting is good at minimizing the impact of a sudden SPAM attack on an email server, but it may add some latency for some legitimate emails in being accepted by the SMTP server. Though, it gives the idea of a reputation system with temporary throttling of unknown sending servers.
A similar, fine-grained reputation system can be used for XMPP. A reputation filter will help throttle a sender at a level depending on its past reputation. If the server is known to be less likely to be spammy (given an aggregate result of multiple reputation scoring methods), then it is less likely to be throttled when sending many messages. The other way, if it is unknown to the reputation database, its throttle level should be lower. And finally, if it is known to the reputation database as more likely to be spammy, its throttle level should be very low (ie. it cannot send a lot of messages per unit of time).
If you would like to learn more on the topic of email SPAM, I would recommend giving a look at those essays from Paul Graham, notably A Plan for Spam. The latter explains the fundamentals of detecting SPAM with the Bayesian approach.
A Proper Solution: Providence
To alleviate the problem of SPIM on the federated XMPP network - in a proper way - I propose Providence. Providence is a SPIM filter, dedicated to XMPP, that runs on the server. It is a complex aggregate of simple filters, that are known to work well with generalizing.
What Providence stands for: Providence aims at being a lightweight spam/ham classifier for XMPP servers, and will be built in Rust - a robust systems programming language backed by Mozilla. Providence would run as a daemon on the XMPP server and would be reachable either via an UNIX socket or via a TCP socket (thus, it can also be hosted on a dedicated server, and connected to the XMPP server through a LAN/VLAN; which is useful for large server deployments). A TCP-based Providence protocol will be built (named Providence Network Protocol; aKa PNP), inspired by the SpamAssassin Network Protocol.
Providence will be implementation-independant: meaning that it won't rely on the specifics of any XMPP server / software on the market. Full-fledged XMPP servers, as well as dedicated components, or even advanced XMPP clients, will be able to consume Providence services in a simple way.
What Providence means: the "providence" word is defined by the Oxford British Dictionary as "timely preparation for future eventualities" and "the protective care of God or of nature as a spiritual power". Though I am not a believer, I actually like the latter definition, as one can picture what the system does: it protects you from bad people.
Where Providence is hosted: the Providence project is available at: https://github.com/valeriansaliou/providence. Its development should start by the end of the first semester of 2017.
Providence makes things easy and elegant: the Grand Goal is to provide a well-designed, long-term solution to the problem of SPIM on XMPP. I expect to make it easy for XMPP server operators to install it. It should be pre-configured by default and won't require too much maintenance. Therefore, I plan to maintain both DEB, RPM and Pacman packages for popular Linux distributions.
As well, from the perspective of an operator, it should be easily pluggable to any popular XMPP server. Providence will ship with
mod_providence modules for ejabberd, Prosody and (probably) Openfire. Those modules will implement the PNP and provide some additional filtering logic on the top of it. We can even imagine Providence binary coming as part of XMPP servers and auto-configured, so that when an operator installs, eg. Prosody, he doesn't even have to consider installing the Providence standalone package. Although, running Providence as a standalone package is definitely better for portability, scalability and upgrade purposes.
Below, you can find an exhaustive list of all filters Providence will implement, which make up what I call the Spam/Ham Decisional Pipeline.
Each filter unit returns a score, that is aggregated in a weighed sum. Providence gives more importance to the Bayesian Filter, over Karma Filter and Authentication Filter. Though, the Karma Filter can negatively affect the total score given the sender server has a reputation for spamming users. The weight for each filter has not yet been specified, as I still need to give some thought to it to find the Providence Magic Formula.
- Bayesian Filter: most accurate filter for spam detection, is tied down to user activity and learns / adapts (the messages the user is used to receive / not used to)
- Server-wide Bayes Database: local training database, linked to domain/virtualhost (is this really needed / safe?)
- User-wide Bayes Database: local training database, linked to users JIDs
- Karma Filter: reputation filter, weaker than Bayesian filter but still useful to improve accuracy
- JID Karma Database: holds the reputation of a remote (offending/or not) JID - ponderated w/ spam/ham reports + number of messages this user sends to people not in roster (in local server) + if the user has a vcard w/ basic info (especially: avatar)
- Sender Client Crawler: crawl the sender client and discover its capabilities (given the message origin resource, this can help detect clients that don't reply to CAPS requests and thus may be spam bots)
- Sender User Rate Limit: apply a per sending user rate-limit threshold for sending parties that are not in recipient user roster
- Sender User Blacklist: holds status on the number of spam reports for the sending user (using spam reporting XEP on recipient clients)
- IP Karma Database: holds the reputation of a remote IP (prevents multiple spammy hosts attacks on same IP)
- Sender IP Rate Limit: apply a per sending IP rate-limit threshold for sending parties that are not in recipient user roster (an IP may hold multiple spamming XMPP hosts)
- Sender AS Greylist: some Autonomous Systems are less responsive to abuse requests, and those are know to hold more spammers (eg: AS from Russia and USA)
- Sender IP Blacklist: IPs known to be only related to spam can be safely blocked by a public mechanism similar to DNSBL
- Authentication Filter: gives a technical quality score to a server (the high complexity of a setup suggests the server is less likely to be spammy)
- TLS Certificate Hostname Match Checker: does the remote server certificate matches server hostname?
- TLS Certificate Valid Authority Checker: is the remote server certificate validated by a trusted authority?
- TLS Certificate Expire Checker: is the remote server certificate still valid? (not expired)
- SRV Records Validity Checker: advanced setups may have valid SRV records active; a spammer won't bother with this
- DNSSEC Authentication Checker: advanced setups may have valid DNSSEC records active; a spammer won't bother with this
- DANE/TLSA Checker: given the server has DNSSEC active, attempt to perform a DANE verification if there is a TLSA record
- Reverse DNS Checker: proper server setups have a proper reverse DNS that resolves back to the server IP
- Number Of XMPP Virtual Hosts On IP: approximate how many XMPP hosts/domains are active on the server IP (a spammy server may have more than normal)
- SMTP Checker: checks if the XMPP server domain holds a valid SMTP MX record (checks if the domain also handles email, and thus more likely to be a long-term setup)
- Website Checker: checks if the XMPP server has a website for users to land on (hammy XMPP servers are most likely to have a website)
- In-Band Registration Checker: registrations must not be open to the wild (Web redirect or CAPTCHA-protected)
- XMPP CAPS Checker: checks XMPP server capabilities (the more services the server provides, the more likely it is to be legit)
- Granular Filter Cache: retain a cache of each filter score result to speed up further checks (with pre-defined granular TTL values)
Filters are grouped in nested pipelines, as shown on the following (simplified) schema:
This list of filters can also be found in the project's README.md file. It may be subject to change as I refine the theories behind the filters.
Providence keeps efficient over time thanks to client training of the filters. Clients that implement some SPIM reporting XEPs will be able to make the Providence filters more efficient at SPIM filtering. I call this the Training Feedback Loop:
- Bayesian Training: training of Bayesian filter
- Server-wide Bayes Training: adjust probabilistic rules for spammy/hammy words server-wide (beware of database flood attacks) (is this really needed / safe?)
- User-wide Bayes Training: adjust probabilistic rules for spammy/hammy words user-wide
- Karma Filter: training of karma filter
- JID Karma Training: adjust the reputation of a remote (offending/or not) JID
- Host Karma Training: adjust the reputation of a remote host (+ sub-hosts)
- IP Karma Training: adjust the reputation of a remote IP
This list of training features can also be found in the project's README.md file. It may be subject to changes.
User Privacy & Bayesian Databases
As Providence will maintain a Bayesian database for each account on the server (this is needed as each user use a different word dictionnary), user privacy needs to be taken into account very seriously.
Some XMPP users are self-hosted and thus, have total trust in the server they rely on. Some others are using public servers which they also trust, but for which they are less likely to accept seeing words extracted from the corpus of their messages and being collected in a database.
Considering the worst-case scenario of a server hack, a leaked Bayesian database could potentially reveal, among words, private information that has been exchanged at any point in time (eg. passwords). Thus, Providence will need to remove sensitive information from corpuses before storing words in the database. Such a "message cleaning" system cannot be 100% reliable, though.
I still need to think about a proper, elegant way to do it. I can picture a system where the XMPP server holds an user-specific Providence database key that's encrypted with the user XMPP account's password, and transmitted to Providence for each operation related to the user, in such a way the server operator cannot see it, thus decrypt any of Providence user databases. Though, I can also picture weaknesses to such a system.
Avoiding False Positives
Well-trained Bayesian filters can still "leak" a few legitimate messages and categorize them as SPIM.
Thus, we need a way for the user to be able to fetch all undelivered SPIM messages at any point in time, similarly to how email works with Junk / Spam folders. We still need a XEP to formalize it. Servers need to implement a Junk message box. Clients also need to implement such a XEP to provide the user a way to "list, delete or train Messages as non-SPIM", for those that have been categorized as SPIM.
Similarly, Presence Subscription requests that have not been delivered to the user may still be found in the Junk folder, and the user should be able to train the filters and mark a Presence Subscription stanza as legitimate. This would trigger a feedback on Providence Bayesian and/or Reputation databases.
Interoperability With Existing XEPs
Some already-existing XEPs may be amended or implemented as part of a working Providence implementation:
- XEP-0161: Abuse Reporting
- XEP-0377: Spam Reporting
- XEP-0159: Spim-Blocking Control
- XEP-0268: Incident Handling
Though, a proper implementation of Providence on the XMPP server may involve new XEPs for features such as the retrieval and management of Messages & Presences classified as SPIM (the Junk box).
A Schedule For Initial Tests
Providence should not be used in "full mode" when the project is still in Alpha phase, as it may still prevent too many legitimate messages from being delivered to users if the filters are not well-tuned, and the Providence Magic Formula isn't yet perfect.
Thus, I propose a plan for Alpha tests, allowing the community to report any weakness in the filters, and benchmark how Providence does in real-world situations:
- Build test fixtures of SPIM corpus: Providence will build a database of plaintext SPIM corpus, used to train the test filters and improve our development process. You are more than welcome to submit all the SPIM messages you receive (more on that below). This will help train the Providence Magic Formula and perform automated unit and acceptance tests on the filters.
- Call for community tests: Providence will provide a simple framework for initial testing in "passive mode" on real / production XMPP servers, where the server operator get logs of all Providence decisions, containing the corpus of each message. A Providence toolkit will provide a way to report wrong decisions to the Providence project. This can be useful on small servers where it is okay to log all non-OTR user messages. Regarding larger public servers, I need to think of a way end users can enroll in the Providence test program and get reports of blocked messages directly to their JID via chat message. Then, the user can reply whether the decision is valid or invalid, without the server operator seeing the corpus of any message.
What Server Operators Should Do Now [Or Not Do] (Before Providence Is Available)
If you are an XMPP server operator, there are some actions you should take now to ensure your server is at least safe from Outgoing SPIM:
- Disable In-Band Registration (IBR): open servers are being targeted by spammers to register user accounts and use them as a relay for their spam. If you host a public server, you'd better enable a CAPTCHA or allowing users to register from a Web form only (the latter is not very XMPP standard though).
- Do not block servers: avoid blocking servers suspected to be spammy, as some may also host legitimate accounts. On the long run, operators will forget about those configuration rules being still active and this will hurt the federated network.
- Do not blacklist words: avoid blocking words that you know are spammy, as they may also be used by some of your users (and anyway, this is not a proper way to deal with the issue). Also, avoid blocking alphabets (eg. Cyrillic alphabet).
Also, ensure of the following to avoid Incoming SPIM (ie. SPIM targeting your users):
- Disable VJUD services and public directory services: they may help a spammer collect a list of your user's JIDs.
- Rate limit (suspect) incoming messages: if the Message / Presence sender JID is not in user's roster and is sending stanzas to different users in a short period of time, you may rate limit the sending server.
You can also help build the Providence test database by submitting the anonymized XML stanzas of SPIM messages your users receive. Ensure to remove the
to attribute on stanzas, but keep the
from (ie. the sender JID). You can either email-me an archive of the XML files, or fork the Providence repository on GitHub, add them to the fixtures folder and submit a Pull Request (PR). I will be happy to accept it!
I will keep the community posted on any substantial progress on Providence. I will come up with a technical paper on the Providence Magic Formula as well as an early Alpha version by the end of the 1st semester of 2017.