Analyzing Preview Bot Behaviour

Philip Voß - Sep 3 - - Dev Community

Introduction

You all probably know previews in a lot of apps. For example, when you share something in Microsoft Teams and you can preview the webpage without actually visiting it within your chat. For a while I have been wondering how the behaviour of preview bots and crawlers differs between the different services and what their possible security and privacy implications would be. End to end encryption would also mean, that the preview generation would need to happen client-side, as the server never has access to the message content to actually fetch a preview, for example.

Meme

Content

Experiment Setup

In order to properly track the requests I set up an nginx reverse proxy with the following paths:

  • / (just a simple html as a standard usecase)
  • /same-site-redirect (redirects to /)
  • /external-redirect (redirect to https://example.com)
  • /redirect-loop (redirect to /redirect-loop - this was a good one)
  • /ssrf-redirect-magic (redirect to the magic ip )
  • /ssrf-redirect-local (redirect to good ol' localhost)

For easier testing I have also configured the nginx to log JSON format and set up a small webserver that would read the JSON and display it in a nice data table. All the source code is more or less documented available on github.

Findings summary

1. All of the apps that issued requests, did this from their datacenter, meaning they have access to the message content in some form.

2. Whenever you post a link on twitter (aka X), the AppleBot will pay you a visit within a matter of seconds. At least they are kind enough to check the robots.txt

Apple Bot

3.When you add a link to your instagram bio, the facebook bot will crawl it to get to know you better
4.The facebook bot handles the redirect-loop the worst (requested it like 20 times...). Interestingly it used different IP Adresses for handling the redirects.

Facebook Bot

5. When you click an instagram bio link, facebook automatically adds a parameter "fbclid" to track your interaction

FBClid

6. Signal was one of the few apps that did not cause any request to the nginx

Results Table

Following some notes from the observations. Feel free to check out the referenced github repo and try it yourself.

Component Request Source Simple Request (http, https) Follows Same Site Redirect Follows External Redirect Redirect Loop SSRF Vulnerable
Snapchat Amazon Server Resolving http and https Following Following, resolved in preview Requested 1x Not showing magic link response,
Telegram Android Telegram Servers Same as Telegram Web yes Unclear, not resolved in preview Requested 2x Unclear, not resolved in preview
TelegramBot (like TwitterBot) Telegram Servers Server Side with “TelegramBot (like TwitterBot)” Resolving https only yes Unclear, not resolved in preview Requested 2x Unclear, not resolved in preview
X / Twitter Twitter Inc Servers Retrieves the robots.txt in addition (Nice!) yes Unclear, not resolved in preview Requested 1x Smartass! Not following
AppleBot Apple Network Range Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) Retrieves http Not following Not following Requested 1x Not following
Gmail / No interaction / / / /
Instagram Bio Facebook Inc Resolves http and https Yes, with different IP Unclear Requested 20x
Signal / No interaction / / / /
Bitwarden / No interaction / / / /
.
Terabox Video Player