Quick tip: The Plays of William Shakespeare, XML and SingleStoreDB

Akmal Chaudhri - Mar 21 '23 - - Dev Community

Abstract

In this short article, we'll see how to store the plays of William Shakespeare in XML format in SingleStoreDB. We'll then connect from a notebook and run some XPath expressions.

The notebook file used in this article is available on GitHub.

Introduction

In this article, we'll take 37 plays by William Shakespeare in an XML format and store them in SingleStoreDB after some data cleanup. Using Python, SQLAlchemy and an XML library, we'll connect to SingleStoreDB and run several example XPath expressions over the data.

Create a SingleStoreDB Cloud account

A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the Free Shared Tier and take the default names for the Workspace and Database.

Data Download, Formatting and Loading

Data Download

We can obtain all the plays of William Shakespeare in XML format in a single zip file courtesy of Jon Bosak of Sun Microsystems. After unpacking the zip file, there is important information on copyright and distribution in an htm file. All the xml files are well-formed.

Data Formatting

In each xml file, we'll remove all hidden control characters and the meta information at the beginning of a file. The following bash script will achieve this and write the combined output into a single file called all_plays.csv:

#!/bin/bash

# Set the input directory
input_dir="/path/to/input_dir"

# Set the output directory
output_dir="/path/to/output_dir"

# Check if input_dir and output_dir are the same
if [ "$input_dir" -ef "$output_dir" ]; then
  echo "Error: input_dir and output_dir cannot be the same directory"
  exit 1
fi

# Initialise the loop counter
counter=1

# Loop over all files in the input directory
for file in "$input_dir"/*.xml
do
  # Check if the file is a regular file
  if [[ -f "$file" ]]; then
    # Extract the file name without the extension
    filename=$(basename "$file" .xml)

    # Remove control and newline characters from the file using tr command
    tr -dc '[:print:]' < "$file" > "$output_dir/$filename.csv.tmp"

    # Remove XML declaration and DOCTYPE declaration from the file using sed command
    sed -e '1s/<?xml version="1.0"?>//' -e '1s/<!DOCTYPE PLAY SYSTEM "play.dtd">//'< "$output_dir/$filename.csv.tmp" > "$output_dir/$filename.csv"

    # Add the loop counter at the beginning of the output file
    sed -i "1s/^/$counter|/" "$output_dir/$filename.csv"

    # Remove the temporary file
    rm "$output_dir/$filename.csv.tmp"

    # Increment the loop counter
    counter=$((counter+1))
  fi
done

# Concatenate all output files into a single file
for file in "$output_dir"/*.csv
do
  if [[ -f "$file" ]] && [[ "$file" != "$output_dir/all_plays.csv" ]]; then
    cat "$file" >> "$output_dir/all_plays.csv"
    echo >> "$output_dir/all_plays.csv"
    rm "$file"
  fi
done
Enter fullscreen mode Exit fullscreen mode

We'll replace /path/to/ with the actual path.

Data Loading

We'll upload the all_plays.csv file using Stage.

Notebook

First, we'll install the package:

!pip install lxml --quiet
Enter fullscreen mode Exit fullscreen mode

Next, we'll create a connection:

from sqlalchemy import *

db_connection = create_engine(connection_url)
Enter fullscreen mode Exit fullscreen mode

We'll create a table:

DROP TABLE IF EXISTS plays;

CREATE TABLE IF NOT EXISTS plays (
    id INT PRIMARY KEY NOT NULL,
    play LONGTEXT NOT NULL
);
Enter fullscreen mode Exit fullscreen mode

and read the data into the table using a Pipeline:

CREATE PIPELINE plays
AS LOAD DATA STAGE 'all_plays.csv'
SKIP DUPLICATE KEY ERRORS
INTO TABLE plays
FORMAT CSV
COLUMNS TERMINATED BY '|'
(id,play);

START PIPELINE plays;
Enter fullscreen mode Exit fullscreen mode

Let's list the titles of all the plays:

SELECT id,
    SUBSTR(play, LOCATE('<TITLE>', play) + LENGTH('<TITLE>'),
    LOCATE('</TITLE>', play) - LOCATE('<TITLE>', play) - LENGTH('<TITLE>')) AS title
FROM plays
ORDER BY id;
Enter fullscreen mode Exit fullscreen mode

Example output:

+----+----------------------------------------------------+
| id | title                                              |
+----+----------------------------------------------------+
|  1 | The Tragedy of Antony and Cleopatra                |
|  2 | All's Well That Ends Well                          |
|  3 | As You Like It                                     |
|  4 | The Comedy of Errors                               |
|  5 | The Tragedy of Coriolanus                          |
|  6 | Cymbeline                                          |
|  7 | A Midsummer Night's Dream                          |
|  8 | The Tragedy of Hamlet, Prince of Denmark           |
|  9 | The First Part of Henry the Fourth                 |
| 10 | The Second Part of Henry the Fourth                |
| 11 | The Life of Henry the Fifth                        |
| 12 | The First Part of Henry the Sixth                  |
| 13 | The Second Part of Henry the Sixth                 |
| 14 | The Third Part of Henry the Sixth                  |
| 15 | The Famous History of the Life of Henry the Eighth |
| 16 | The Tragedy of Julius Caesar                       |
| 17 | The Life and Death of King John                    |
| 18 | The Tragedy of King Lear                           |
| 19 | Love's Labour's Lost                               |
| 20 | Measure for Measure                                |
| 21 | The Merry Wives of Windsor                         |
| 22 | The Tragedy of Macbeth                             |
| 23 | The Merchant of Venice                             |
| 24 | Much Ado about Nothing                             |
| 25 | The Tragedy of Othello, the Moor of Venice         |
| 26 | Pericles, Prince of Tyre                           |
| 27 | The Tragedy of Romeo and Juliet                    |
| 28 | The Tragedy of King Richard the Second             |
| 29 | The Tragedy of Richard the Third                   |
| 30 | Twelfth Night, or What You Will                    |
| 31 | The Taming of the Shrew                            |
| 32 | The Tempest                                        |
| 33 | The Life of Timon of Athens                        |
| 34 | The Tragedy of Titus Andronicus                    |
| 35 | The History of Troilus and Cressida                |
| 36 | The Two Gentlemen of Verona                        |
| 37 | The Winter's Tale                                  |
+----+----------------------------------------------------+
Enter fullscreen mode Exit fullscreen mode

We'll now get one of the plays:

result = %sql SELECT play FROM plays WHERE play LIKE '%Romeo and Juliet%'

play = " ".join([row.play for row in result])
Enter fullscreen mode Exit fullscreen mode

We'll use the lxml library and get the root element of the document:

root = etree.fromstring(play)
Enter fullscreen mode Exit fullscreen mode

Next, let's get the title of the play:

xpath_q1 = "/PLAY/TITLE"
play_title = root.xpath(xpath_q1)[0]
print(play_title.text)
Enter fullscreen mode Exit fullscreen mode

The output should be, as follows:

The Tragedy of Romeo and Juliet
Enter fullscreen mode Exit fullscreen mode

Now, let's get the personae:

xpath_q2 = "/PLAY/PERSONAE/PERSONA"
personae = root.xpath(xpath_q2)
for persona in personae:
    print(persona.text)
Enter fullscreen mode Exit fullscreen mode

The result should be, as follows:

ESCALUS, prince of Verona. 
PARIS, a young nobleman, kinsman to the prince.
An old man, cousin to Capulet. 
ROMEO, son to Montague.
MERCUTIO, kinsman to the prince, and friend to Romeo.
BENVOLIO, nephew to Montague, and friend to Romeo.
TYBALT, nephew to Lady Capulet.
BALTHASAR, servant to Romeo.
PETER, servant to Juliet's nurse.
ABRAHAM, servant to Montague.
An Apothecary. 
Three Musicians.
Page to Paris; another Page; an officer.
LADY MONTAGUE, wife to Montague.
LADY CAPULET, wife to Capulet.
JULIET, daughter to Capulet.
Nurse to Juliet. 
Citizens of Verona; several Men and Women, relations to both houses; Maskers, Guards, Watchmen, and Attendants.
Chorus.
Enter fullscreen mode Exit fullscreen mode

Let's find the number of times that Juliet is the speaker:

xpath_q3 = "count(//SPEECH[SPEAKER='JULIET'])"
juliet_count = root.xpath(xpath_q3)
print(juliet_count)
Enter fullscreen mode Exit fullscreen mode

The result should be, as follows:

118.0
Enter fullscreen mode Exit fullscreen mode

We could also try the response to a question on Stackoverflow to find the number of times that Juliet speaks immediately after Romeo:

xpath_q4 = "count(//SPEECH[SPEAKER='JULIET' and preceding-sibling::SPEECH[1][SPEAKER='ROMEO']])"
romeo_and_juliet_count = root.xpath(xpath_q4)
print(romeo_and_juliet_count)
Enter fullscreen mode Exit fullscreen mode

The result should be, as follows:

37.0
Enter fullscreen mode Exit fullscreen mode

The lxml library provides many more capabilities, such as streaming large documents, for example.

Summary

In this short article, we have seen one method of storing XML data in SingleStoreDB. We have also seen how to use XPath expressions to query our data.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player