AWS re:Invent 2021 — A first look at SageMaker Canvas
Last night, AWS launched SageMaker Canvas, “a visual, no-code interface to build accurate machine learning models”. Let’s have at it, then.
Once upon a time…
Some of you may remember Amazon Machine Learning, AWS’ first attempt at no-code machine learning. Launched in April 2015, the service let users pull tabular data from S3 or Redshift, in order to train and deploy models on managed infrastructure. Supported task types included binary classification, multi-class classification and linear regression models.
I reviewed it in depth at the time (part 1, part 2, part 3), and actually liked it. Unfortunately, almost everyone else disagreed. Adoption was awful, and the service was quietly deprecated in early 2019, a very rare event in AWS land. Why it failed is up for debate. Too simplistic for ML-savvy users, too fancy for business users, wrong timing, too expensive, etc. Oh well.
Licking their wounds, AWS went back to the drawing board, and launched SageMaker in late 2017. Targeted at data scientists and ML engineers, this new service did arguably much better, thanks to a handy SDK that leverages managed infrastructure and powerful ML capabilities.
Of course, it was only a matter of time until the “ML for business users” discussion would pop up again. SageMaker Canvas, then.
Unboxing SageMaker Canvas
You can access Canvas from the SageMaker Studio console, and launch it just like you would launch SageMaker Studio. After a few minutes of initial setup, the Canvas UI pops up.
We’re greeted by a short intro that introduces the Canvas workflow. +1 for usability.
Selecting data
Basic users can upload files from:
- Their local machine, although this option was greyed out for me (“Contact your administrator”: huh?).
- An S3 bucket.
The documentation fails to list supported file types. Apparently, CSV is the only option for now. On the positive side, joins are supported using a drag and drop interface, without the need for any SQL code.
Alternatively, more advanced users are able to connect to a Redshift or a Snowflake database. They can then run their own SQL code to pull and join data.
Here, I’ll simply import the Titanic survivor dataset from an S3 bucket. It only takes a couple of clicks.
Building a model
Creating a new model, I first select the dataset that I just imported.
Next, I pick the column I’d like to predict (Survived).
The model type is automatically inferred from the target column, which is fine. I could also set it myself if I wanted. Now, why on earth would you say “2 category” instead of “binary classification”, “3+ category” instead of multi-class classification, or “number” instead of linear regression? This sounds weird (even a bit silly), and it will only confuse everyone. Industry standard terms, please!
I also see basic statistics and visualizations on the dataset. Good stuff, although I’d have preferred to get them right after uploading the dataset, without having to create a model.
Clicking on a column shows a summary.
I could also untick columns that I don’t want to include for training. Keeping all columns, I click on “Quick Build” to… build a quick model, I guess. Training fails almost immediately, and I get the following error message:
Spoiler: the 'Cabin' column is responsible. It’s 77.1% empty
This is likely to frustrate and confuse your average business user:
- SHAP values? Whatever they are, I wasn’t told about them earlier.
- If my dataset has too many missing values, why wasn’t I told earlier? The offending column(s) should have been highlighted at the basic stats stage.
- If SHAP values can’t be computed, why not proceed without them and add a mention in the model summary?
Deleting the failed model (why can’t I retry right away?), I create a new one, select the dataset, and unselect the offending column.
Then, I run a quick build again.
Analyzing results
Results are available after a couple of minutes. This is way too fast for a regular SageMaker job. I’m guessing this ran in place, similar to the “Quick Model” feature in SageMaker Data Wrangler.
Model accuracy is not great, but it’s high enough to justify launching a full training later on. Feature importance (aka “column impact”) is nicely graphed using box plots and scatter plots, although I’m missing zooming capabilities.
I also get a fancy representation of the confusion matrix.
… as well proper ML metrics.
Creating another model (why can’t I reuse the quick build?), I launch a full training job which runs for a few hours (SageMaker AutoPilot **cough cough**).
Metrics are significantly better.
Apparently, I could share the model in SageMaker Studio. The feature didn’t work for me.
Generating predictions
I can use my model to predict new data, either in batch mode or in single prediction mode. For simplicity, I’ll use the training set.
First, I launch batch prediction in a couple of clicks.
After a few seconds, results are available, and I can download them to a CSV file.
Predictions are archived for future use. I was unable to delete them, as the button was greyed out.
I also couldn’t test single predictions. Hmm.
Summing things up
So what do I think of Canvas? Here goes nothing.
The good
- SageMaker Canvas does what it says on the tin : zero-code ML for the most popular ML problems in the enterprise (classification, regression and time-series).
- The UI is reasonably clear and friendly , although I’d like to be able to resize panels (a long lasting plague of many AWS consoles), and to zoom on visualizations.
- I would expect Canvas jobs to be as accurate as code-based jobs. Indeed, it’s safe to assume that model training is based on SageMaker AutoPilot and Forecast, which support built-in feature engineering, model tuning, etc.
- Batch predicting with Canvas models is simple enough , without having to deal with any inference-time muck (the one thing I dislike the most about SageMaker in general).
The bad
- Canvas only supports tabular data (categorical, numeric, text, datetime), and input file formats seem to be limited to uncompressed CSV (we won’t know for sure until the doc actually says something useful). Users with slightly more complex data will have to deal with data preparation, and…
- Canvas is not integrated with data preparation tools. If you have a pristine CSV file, lucky you. What if you need to transform it or clean it a bit? UI integration with SageMaker Data Wrangler or Glue Data Brew would go a long way in building a seamless experience.
- Prediction cannot be automated. Imagine that I need to predict a new CSV file every day. Do I need to manually upload it to S3 and use the Canvas console to predict it? Is there a path to automation? Maybe that’s the model sharing capability in Studio, but I couldn’t test it.
- Canvas looks expensive. $1.9 per hour for the console? $30 for a one-million cell training job? I could train the same dataset (50K rows with 20 features) on an ml.m5.large spot instance for a fraction of that, and write 50 lines of Python to automate it for my business users… Am I missing something here?
The ugly
I know all too well how the pressure of lauching services at re:Invent, but…
- Documentation is as minimal as it could be.
- Too many features are broken at launch.
Is Canvas an interesting step forward that will accelerate ML adoption for business users? Or is it too little too late, and the new Amazon Machine Learning?
I’m cautiously leaning towards the former, but it’s a bit early to tell.