You may have seen if you follow me on Twitter, that I have been working on a fun new project called WTF YAML. Not a ton of details outside of its name has really been shared. It's a sort of tribute to our love/hate relationship with YAML. To help market the project, I thought it would be fun to create a Twitter bot (@wtfyaml) that tweeted funny Git commit messages related to YAML.
The system at a high level looks something like this:
- Using the GitHub API, scrape public commit messages containing swear words where a YAML file is included
- With a commit message in hand, search for a random GIF from Giphy that matches the swear word
- Upload the GIF to Twitter and schedule the commit message as a tweet
This felt like a good opportunity to use some of the latest services inside of GCP. I could have chosen to stick with AWS, but I opted to learn some new services by actually using them in GCP. The services I chose for the foundation of the architecture:
- Cloud Run for running the scraper, query Giphy, and actually sending out the tweets logic via an API
- Cloud Tasks for my queue of tweets to be sent
- Cloud Scheduler for running the scraper for a given word on a CRON job
All the logic runs via an API container running in Cloud Run. Each endpoint in the API represents a single piece of logic. Each piece gets invoked externally by either a Cloud Task or a Cloud Scheduler job.
app.post('/schedule', async (request, reply) => {
const word = (request.body as {word: string}).word
const octokit = new Octokit({auth: GITHUB_TOKEN})
const tweetStateClient = new TweetStateClient()
const badwordCommits: Commit[] = []
console.log(`query for ${word}`)
const commits = await queryCommits(octokit, word)
console.log(`got ${commits.length} commits for ${word}`)
badwordCommits.push(...commits)
let scheduled = 0
for (const commit of badwordCommits) {
if (scheduled === 30) {
console.log(`scheduled 30 tweets`)
break
}
if (commit.message.length <= 240) {
const tweetState = await tweetStateClient.getTweetState(commit.tree.sha)
if (tweetState === null || dayjs(tweetState.tweetedTime).add(MONTHS_BETWEEN_RETWEETS, 'months') <= dayjs()) {
scheduled++
const scheduleTimeInSeconds = MIN_TIME_BETWEEN_TWEETS_IN_SECONDS * scheduled + Date.now() / 1000
await addTweetTask(commit.tree.sha, commit.message.trim(), commit.word, scheduleTimeInSeconds)
} else {
console.log(
`tweet for ${tweetState.treeShaHash} has already been sent in the past two months, don't schedule it again`,
)
}
}
}
reply.send({ran: true, scheduled: scheduled})
})
app.post('/send', async (request, response) => {
const tweetClient = new TweetGifClient()
const tweetStateClient = new TweetStateClient()
const tweet = request.body as {scheduledInSeconds: number; status: string; word: string; treeShaHash: string}
const tweetState = await tweetStateClient.getTweetState(tweet.treeShaHash)
if (tweetState === null || dayjs(tweetState.tweetedTime).add(MONTHS_BETWEEN_RETWEETS, 'months') <= dayjs()) {
console.log(`tweet for ${tweet.treeShaHash} at scheduled time ${tweet.scheduledInSeconds}`)
await tweetClient.tweet(tweet.status, tweet.word)
console.log(`tweet sent for ${tweet.treeShaHash}, writing state...`)
await tweetStateClient.addOrUpdateTweet(tweet.treeShaHash, tweet.status, tweet.word)
response.send({sent: true})
} else {
console.log(`tweet for ${tweetState.treeShaHash} has already been sent in the past two months, skipping...`)
response.send({sent: false})
}
})
Each endpoint here gets triggered by another service external to the actual API running in Cloud Run. For example, we want to scrape all the relevant tweets for the word damn
every 24 hours. To do this, we have a CRON job inside of Cloud Scheduler. Every 24 hours it makes a POST
to the /schedule
endpoint with the payload { word: damn }
.
The /schedule
logic will go through the following flow:
- Query GitHub for commit messages that contain the word
damn
and contain a YAML file - If the commit message is less than 240 characters (because a tweet can't be longer than that)
- If we haven't recently sent this tweet, then we schedule it to be tweeted
When it comes to scheduling the actual tweet it leverages the queue provided by Cloud Tasks. Here is the function that pushes a new tweet onto the queue.
export async function addTweetTask(treeShaHash: string, status: string, word: string, scheduleTimeInSeconds: number) {
const task: protos.google.cloud.tasks.v2.ITask = {
httpRequest: {
url: `${WTF_YAML_API}/send`,
body: Buffer.from(
JSON.stringify({
scheduledInSeconds: scheduleTimeInSeconds,
status: status,
word: word,
treeShaHash: treeShaHash,
}),
).toString('base64'),
headers: {
'content-type': 'application/json',
},
},
scheduleTime: {
seconds: scheduleTimeInSeconds,
},
}
const [response] = await client.createTask({parent: queue, task})
console.log(`Schedule task ${response.name}`)
}
This is a slick feature of Cloud Tasks. It natively supports HTTP targets and a scheduleTime
(represented in seconds). Meaning the payload gets sent to the HTTP target at the scheduleTime
specified. It's kinda like AWS SNS & SQS all in one.
In the code above, the url is ${WTF_YAML_API}/send
. The WTF_YAML_API
is actually the DNS endpoint for our Cloud Run cluster. The /send
looks familiar right? It's the /send
in the first chunk of code above. This is the logic responsible for actually sending a tweet.
app.post('/send', async (request, response) => {
const tweetClient = new TweetGifClient()
const tweetStateClient = new TweetStateClient()
const tweet = request.body as {scheduledInSeconds: number; status: string; word: string; treeShaHash: string}
const tweetState = await tweetStateClient.getTweetState(tweet.treeShaHash)
if (tweetState === null || dayjs(tweetState.tweetedTime).add(MONTHS_BETWEEN_RETWEETS, 'months') <= dayjs()) {
console.log(`tweet for ${tweet.treeShaHash} at scheduled time ${tweet.scheduledInSeconds}`)
await tweetClient.tweet(tweet.status, tweet.word)
console.log(`tweet sent for ${tweet.treeShaHash}, writing state...`)
await tweetStateClient.addOrUpdateTweet(tweet.treeShaHash, tweet.status, tweet.word)
response.send({sent: true})
} else {
console.log(`tweet for ${tweetState.treeShaHash} has already been sent in the past two months, skipping...`)
response.send({sent: false})
}
})
This /send
endpoint gets invoked by the payload we pushed onto the queue in Cloud Tasks. Cloud Tasks pulls out the next message (based on the scheduleTime
) and then invokes the endpoint contained in the target.
Once invoked the following logic occurs inside of /send
as we can see above.
- Check again that we haven't already sent this before (this is so that we can opt to remove the check in
/schedule
if we want) - If we haven't sent it, tweet it out now with a fun GIF to go with it
That's all there is to it.
With three Google Cloud services, there is a bot that can gather commits from GitHub and send tweets out on whatever kind of schedule we want. This entire project took an afternoon to get running in GCP. The bulk of that time coming from creating the Terraform config to actually provision the various services.
Reflections
Cloud Run is great for this architecture.
You are only billed for the time that your code is actually running. Here, it would only incur costs when one of these endpoints receives a request. But, Cloud Run also comes with 2 million free requests per month. So for this, it's always going to be free.
Cloud Run abstracts away all the infrastructure management. Write your code, create a container image, and deploy to Cloud Run. That's really all there is to it.
There is something to be said about all the logic for the system being in one container image. Each endpoint is its own set of logic, but all the logic lives in one image. There is no jumping across repositories or different projects to follow the logic. There is also less infrastructure to run when placing all the logic in one container. We can just give this to Cloud Run to run and we have a centralized API.
HTTP targets in both Cloud Scheduler and Cloud Tasks means there is less glue code to write.
Both Cloud Scheduler and Cloud Tasks support HTTP targets. I can add a job to Cloud Scheduler with an endpoint to hit, a payload to send, and a CRON expression for when to run it.
Cloud Tasks is a queue that I don't have to pull. Like Scheduler, I place tasks in the queue with an endpoint, payload, and a time for when to push the task out to the target. Switching the queue processing model to a push instead of a pull removes quite a bit of code.
All in all, these services feel like they are making my life easier as a developer. This isn't to say that AWS doesn't have equivalents that could do the same things. They do have equivalents, but in my opinion, they often need more glue. Meaning there are more services I have to provision or more code I have to write.
This is the area to watch when it comes to cloud providers. We are moving in a direction where we want to do less glue work. The more glue we have to create and maintain decreases the amount of time and focus we can put into our actual products and services.