Turn the data in your Twitter archive download into JSON.
Twitter Archive Extractor
This Go program extracts and processes JavaScript files from a ZIP archive, specifically targeting files in the /data directory. It replaces certain window. assignments with var data = and outputs the JSON representation of the processed data.
How It Works
The program opens a ZIP file specified as a command-line argument.
It scans for JavaScript files within the /data directory.
For each JavaScript file, it replaces any window. assignments (e.g., window.__THAR_CONFIG = {) with var data =.
It then uses the goja JavaScript interpreter to execute the modified script and extract the data variable.
The extracted data is marshaled into JSON format and output to the console.
Todo
Select a destination for export
separate files? via an ORM to a database?
Release to go.dev
Prerequisites
Ensure you have Go installed on your system.
Setup
Clone this repository or download the source files.
We're going to hack this process together. It seems one of the most important features to attract people who feel too attached to TwitterX.
First step is to create a main.go file. In this file we'll GO (hah) and do some STUFF;
os.Args: This is a slice that holds command-line arguments.
os.Args[0] is the program's name, and os.Args[1] is the first argument passed to the program.
Argument Check: The function checks if at least one argument is provided. If not, it prints a message asking for a path.
run function: This function simply prints the path passed to it, for now.
packagemainimport("fmt""os")funcrun(pathstring){fmt.Println("Path:",path)}funcmain(){iflen(os.Args)<2{fmt.Println("Please provide a path as an argument.")return}path:=os.Args[1]run(path)}
At every step, we'll run the file like so;
go run main.go twitter.zip
If you don't have a Twitter archive export, create a simple manifest.js file and give it the following JavaScript.
Compress that into your twitter.zip file that we'll use throughout.
Read a Zip file
The next step is to read the contents of the zip file. We want to do this as efficiently as possible, and reduce time data is extracted on the disk.
There are many files in the zip that don't need to be extracted, too.
We'll edit the main.go file;
Opening the ZIP file: The zip.OpenReader() function is used to open the ZIP file specified by path.
Iterating through the files: The function loops over each file in the ZIP archive using r.File, which is a slice of zip.File. The Name property of each file is printed.
packagemainimport("archive/zip""fmt""log""os")funcrun(pathstring){// Open the zip filer,err:=zip.OpenReader(path)iferr!=nil{log.Fatal(err)}deferr.Close()// Iterate through the files in the zip archivefmt.Println("Files in the zip archive:")for_,f:=ranger.File{fmt.Println(f.Name)}}funcmain(){// Example usageiflen(os.Args)<2{log.Fatal("Please provide the path to the zip file as an argument.")}path:=os.Args[1]run(path)}
JS only! We're hunting structured data
This archive file is seriously unhelpful. We want to check for just .js files, and only in the /data directory.
Opening the ZIP file: The ZIP file is opened using zip.OpenReader().
Checking the /data directory: The program iterates through the files in the ZIP archive. It uses strings.HasPrefix(f.Name, "data/") to check if the file resides in the /data directory.
Finding .js files: The program also checks if the file has a .js extension using filepath.Ext(f.Name).
Reading and printing contents: If a .js file is found in the /data directory, the program reads and prints its contents.
packagemainimport("archive/zip""fmt""io/ioutil""log""os""path/filepath""strings")funcreadFile(file*zip.File){// Open the file inside the ziprc,err:=file.Open()iferr!=nil{log.Fatal(err)}deferrc.Close()// Read the contents of the filecontents,err:=ioutil.ReadAll(rc)// deprecated? :/ iferr!=nil{log.Fatal(err)}// Print the contentsfmt.Printf("Contents of %s:\n",file.Name)fmt.Println(string(contents))}funcrun(pathstring){// Open the zip filer,err:=zip.OpenReader(path)iferr!=nil{log.Fatal(err)}deferr.Close()// Iterate through the files in the zip archivefmt.Println("JavaScript files in the zip archive:")for_,f:=ranger.File{// Use filepath.Ext to check the file extensionifstrings.HasPrefix(f.Name,"data/")&&strings.ToLower(filepath.Ext(f.Name))==".js"{readFile(f)return// Exit after processing the first .js file so we don't end up printing a gazillion lines when testing}}}funcmain(){// Example usageiflen(os.Args)<2{log.Fatal("Please provide the path to the zip file as an argument.")}path:=os.Args[1]run(path)}
Parse the JS! We want that data
We've found the structured data. Now we need to parse it. The good news is there are existing packages for using JavaScript inside Go. We'll be using goja.
If you're on this section, familiar with Goja, and you've seen the output of the file, you may see we're going to have errors in our future.
Install goja:
go get github.com/dop251/goja
Now we're going to edit the main.go file to do the following;
Parsing with goja: The goja.New() function creates a new JavaScript runtime, and vm.RunString(processedContents) runs the processed JavaScript code within that runtime.
Handle errors in parsing
packagemainimport("archive/zip""fmt""io/ioutil""log""os""path/filepath""strings")funcreadFile(file*zip.File){// Open the file inside the ziprc,err:=file.Open()iferr!=nil{log.Fatal(err)}deferrc.Close()// Read the contents of the filecontents,err:=ioutil.ReadAll(rc)// deprecated? :/ iferr!=nil{log.Fatal(err)}// Parse the JavaScript file using gojavm:=goja.New()_,err=vm.RunString(contents)iferr!=nil{log.Fatalf("Error parsing JS file: %v",err)}fmt.Printf("Parsed JavaScript file: %s\n",file.Name)}funcrun(pathstring){// Open the zip filer,err:=zip.OpenReader(path)iferr!=nil{log.Fatal(err)}deferr.Close()// Iterate through the files in the zip archivefmt.Println("JavaScript files in the zip archive:")for_,f:=ranger.File{// Use filepath.Ext to check the file extensionifstrings.HasPrefix(f.Name,"data/")&&strings.ToLower(filepath.Ext(f.Name))==".js"{readFile(f)return// Exit after processing the first .js file so we don't end up printing a gazillion lines when testing}}}funcmain(){// Example usageiflen(os.Args)<2{log.Fatal("Please provide the path to the zip file as an argument.")}path:=os.Args[1]run(path)}
SUPRISE. window is not defined might be a familiar error. Basically goja runs an EMCA runtime. window is browser context and sadly unavailable.
ACTUALLY Parse the JS
I went through a few issues at this point. Including not being able to return data because it's a top level JS file.
Long story short, we need to modify the contents of the files before loading them into the runtime.
Let's modify the main.go file;
reConfig: A regex that matches any assignment of the form window.someVariable = { and replaces it with var data = {.
reArray: A regex that matches any assignment of the form window.someObject.someArray = [ and replaces it with var data = [
Extracting data: Running the script, we use vm.Get("data") to retrieve the value of the data variable from the JavaScript context.
packagemainimport("archive/zip""fmt""io/ioutil""log""os""path/filepath""regexp""strings""github.com/dop251/goja")funcreadFile(file*zip.File){// Open the file inside the ziprc,err:=file.Open()iferr!=nil{log.Fatal(err)}deferrc.Close()// Read the contents of the filecontents,err:=ioutil.ReadAll(rc)iferr!=nil{log.Fatal(err)}// Regular expressions to replace specific patternsreConfig:=regexp.MustCompile(`window\.\w+\s*=\s*{`)reArray:=regexp.MustCompile(`window\.\w+\.\w+\.\w+\s*=\s*\[`)// Replace patterns in the contentprocessedContents:=reConfig.ReplaceAllStringFunc(string(contents),func(sstring)string{return"var data = {"})processedContents=reArray.ReplaceAllStringFunc(processedContents,func(sstring)string{return"var data = ["})// Parse the JavaScript file using gojavm:=goja.New()_,err=vm.RunString(processedContents)iferr!=nil{log.Fatalf("Error parsing JS file: %v",err)}// Retrieve the value of the 'data' variable from the JavaScript contextvalue:=vm.Get("data")ifvalue==nil{log.Fatalf("No data variable found in the JS file")}// Output the parsed datafmt.Printf("Processed JavaScript file: %s\n",file.Name)fmt.Printf("Data extracted: %v\n",value.Export())}funcrun(pathstring){// Open the zip filer,err:=zip.OpenReader(path)iferr!=nil{log.Fatal(err)}deferr.Close()// Iterate through the files in the zip archivefor_,f:=ranger.File{// Check if the file is in the /data directory and has a .js extensionifstrings.HasPrefix(f.Name,"data/")&&strings.ToLower(filepath.Ext(f.Name))==".js"{readFile(f)return// Exit after processing the first .js file so we don't end up printing a gazillion lines when testing}}}funcmain(){// Example usageiflen(os.Args)<2{log.Fatal("Please provide the path to the zip file as an argument.")}path:=os.Args[1]run(path)}
Hurrah. Assuming I didn't muck up the copypaste into this post, you should now see a rather ugly print of the struct data from Go.
JSON would be nice
Edit the main.go file to marshall the JSON output.
Use value.Export() to get the data from the struct
Use json.MarshallIndent() for pretty printed JSON (use json.Marshall if you want to minify the output).
packagemainimport("archive/zip""encoding/json""fmt""io/ioutil""log""os""path/filepath""regexp""strings""github.com/dop251/goja")funcreadFile(file*zip.File){// Open the file inside the ziprc,err:=file.Open()iferr!=nil{log.Fatal(err)}deferrc.Close()// Read the contents of the filecontents,err:=ioutil.ReadAll(rc)// deprecated :/iferr!=nil{log.Fatal(err)}// Regular expressions to replace specific patternsreConfig:=regexp.MustCompile(`window\.\w+\s*=\s*{`)reArray:=regexp.MustCompile(`window\.\w+\.\w+\.\w+\s*=\s*\[`)// Replace patterns in the contentprocessedContents:=reConfig.ReplaceAllStringFunc(string(contents),func(sstring)string{return"var data = {"})processedContents=reArray.ReplaceAllStringFunc(processedContents,func(sstring)string{return"var data = ["})// Parse the JavaScript file using gojavm:=goja.New()_,err=vm.RunString(processedContents)iferr!=nil{log.Fatalf("Error parsing JS file: %v",err)}// Retrieve the value of the 'data' variable from the JavaScript contextvalue:=vm.Get("data")ifvalue==nil{log.Fatalf("No data variable found in the JS file")}// Convert the data to a Go-native typedata:=value.Export()// Marshal the Go-native type to JSONjsonData,err:=json.MarshalIndent(data,""," ")iferr!=nil{log.Fatalf("Error marshalling data to JSON: %v",err)}// Output the JSON datafmt.Println(string(jsonData))}funcrun(zipFilePathstring){// Open the zip filer,err:=zip.OpenReader(zipFilePath)iferr!=nil{log.Fatal(err)}deferr.Close()// Iterate through the files in the zip archivefor_,f:=ranger.File{// Check if the file is in the /data directory and has a .js extensionifstrings.HasPrefix(f.Name,"data/")&&strings.ToLower(filepath.Ext(f.Name))==".js"{readFile(f)return// Exit after processing the first .js file}}}funcmain(){// Example usageiflen(os.Args)<2{log.Fatal("Please provide the path to the zip file as an argument.")}zipFilePath:=os.Args[1]run(zipFilePath)}