Deep Graph Library, part 2 — Training on Amazon SageMaker
In a previous post, I showed you how to use the Deep Graph Library (DGL) to train a Graph Neural Network model on data stored in Amazon Neptune.
I used a vanilla Jupyter notebook, which is fine for experimentation, but what about training at scale on large datasets? Well, as DGL is available on Amazon SageMaker, I’ll show you in this post how to quickly and easily adapt your DGL code for SageMaker.
Adapting our code
Let’s take a look at the notebook I used in the previous post.
dgl/01_karate_club/karate_club.ipynb · master · Julien Simon / dlnotebooks
As you probably guessed, we’re going to use script mode to run this vanilla PyTorch code on Amazon SageMaker.
Script mode boils down to:
- Reading hyperparameters from command line arguments,
- Loading the dataset from a location defined by a SageMaker environment variable,
- Saving the trained model at a location defined by another SageMaker environment variable.
Reading Hyperparameters
In my script, I need two hyperparameters: the number of epochs to train for, and the number of nodes in the graph. SageMaker will pass them as command line arguments, which I extract with argparse.
parser = argparse.ArgumentParser()
parser.add\_argument(‘--epochs’, type=int, default=30)
parser.add\_argument('--node\_count’, type=int)
args, \_ = parser.parse\_known\_args()
epochs = args.epochs
node\_count = args.node\_count
Loading the Dataset
My dataset is stored in S3. As SageMaker will automatically copy it inside the training container, all I have to do is read an environment variable, and load the data.
training\_dir = os.environ[‘SM\_CHANNEL\_TRAINING’]
f = open(os.path.join(training\_dir, 'edge\_list.pickle'), 'rb')
edge\_list = pickle.load(f)
Saving the Model
Same thing: read an environment variable, and save the model.
model\_dir = os.environ[‘SM\_MODEL\_DIR’]
torch.save(net.state\_dict(),
os.path.join(model\_dir, ‘karate\_club.pt’))
We’re done. Let’s train this code on SageMaker.
Training on Amazon SageMaker
I start with importing the SageMaker SDK. Then, I define the S3 bucket that I’ll use to store the dataset, and the IAM role allowing SageMaker to access the bucket.
import sagemaker
from sagemaker import get\_execution\_role
from sagemaker.session import Session
sess = sagemaker.Session()
bucket = sess.default\_bucket()
role = get\_execution\_role()
Next, I upload the dataset to S3.
prefix = ‘dgl-karate-club’
training\_input\_path = sess.upload\_data(‘edge\_list.pickle’,
key\_prefix=prefix+’/training’)
Now, all I have to do is to define a PyTorch estimator for this training job, pointing at my script, passing hyperparameters, and defining infrastructure requirements.
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry\_point=”karate\_club\_sagemaker.py”,
hyperparameters={‘node\_count’: 34, ‘epochs’: 30},
framework\_version=’1.3.1',
py\_version=’py3',
train\_instance\_count=1,
train\_instance\_type=’ml.c4.xlarge’,
role=role,
sagemaker\_session=sess
)
Finally, I launch the training job.
estimator.fit({'training': training\_input\_path})
2020-01-28 08:57:34 Starting - Starting the training job...
2020-01-28 08:57:36 Starting - Launching requested ML instances....
_<output removed>_
Invoking script with the following command:
/opt/conda/bin/python karate\_club\_sagemaker.py --epochs 30 --node\_count 34
_<output removed>_
Epoch 0 | Loss: 0.6188
Epoch 1 | Loss: 0.4804
Epoch 2 | Loss: 0.3139
Epoch 3 | Loss: 0.3143
Epoch 4 | Loss: 0.3152
Epoch 5 | Loss: 0.3158
Epoch 6 | Loss: 0.3152
Epoch 7 | Loss: 0.3142
Epoch 8 | Loss: 0.3136
Epoch 9 | Loss: 0.3134
Epoch 10 | Loss: 0.3133
Epoch 11 | Loss: 0.3133
Epoch 12 | Loss: 0.3133
Epoch 13 | Loss: 0.3133
Epoch 14 | Loss: 0.3133
Epoch 15 | Loss: 0.3133
Epoch 16 | Loss: 0.3133
Epoch 17 | Loss: 0.3133
Epoch 18 | Loss: 0.3133
Epoch 19 | Loss: 0.3133
Epoch 20 | Loss: 0.3133
Epoch 21 | Loss: 0.3133
Epoch 22 | Loss: 0.3133
Epoch 23 | Loss: 0.3133
Epoch 24 | Loss: 0.3133
Epoch 25 | Loss: 0.3133
Epoch 26 | Loss: 0.3133
Epoch 27 | Loss: 0.3133
Epoch 28 | Loss: 0.3133
Epoch 29 | Loss: 0.3133
_<output removed>_
2020-01-28 09:01:19 Uploading - Uploading generated training model
2020-01-28 09:01:19 Completed - Training job completed
Training seconds: 76
Billable seconds: 76
There you go! Once again, script mode makes it extremely simple to run existing code on SageMaker.
This feature is available for all built-in frameworks: if you’re curious about it, here’s a very detailed video example with Keras.
I hope this post was useful. You can find the training script and the notebook on Gitlab.
- dgl/01_karate_club/karate_club_sagemaker.py · master · Julien Simon / dlnotebooks
- dgl/01_karate_club/karate_club_sagemaker.ipynb · master · Julien Simon / dlnotebooks
Happy to answer questions here, or on Twitter. Don’t forget to subscribe to my YouTube channel for more content!