So far I have used the PlayDrone data to find out what the most downloaded free games, paid games and paid apps are. There are many useful queries that can be run to try and find a profitable app idea. For example, you could look at apps with lots of downloads but with a bad user rating. You could then look at the comments for that app and find out why the users are dissatisfied with it and make a better app! You could even get a list of all paid apps that have made at least £10,000 if you wanted. It could also be useful to look at apps that have failed, by running a query to find apps that have never been downloaded and then find out why, so your apps don’t suffer the same fate.
You can download a 735MB Json (JavaScript Object Notation) file that contains details of the 1.4 million apps on Google Play as of 31/10/2014 from the Internet Archive. I had some fun writing a console app that inserts details about all these apps into SQL, so that queries can be run against it. You can find the source code for this on my GitHub account with the imaginative name of PlayDrone2SQL.
The Making of PlayDrone2SQL
With the 735MB Json file downloaded I got to work in turning this into a SQL model. I made the console app using Visual Studio 2015, SQL Server Express and Entity Framework 6. I am a big fan of Entity Framework and Linq, having used them extensively at work. I started with the code first approach and created classes to represent the information stored in the Json file.
I then let Entity Framework create the database and tables and wrote a some code to read the file into memory and output it to SQL.
I used Newtonsoft’s Json.NET to de-serialise the Json file into a list of my app objects. Each app has a category which I am storing in a separate table, there are only 42 of them.
The program’s logic was quite simple. For each app it first checks to see if the app already exists in the database. If it doesn’t, it then checks to see if the category exists in the database, and adds it if it was missing. It then maps the app into the Entity Framework model and saves it to the database.
I then ran this on the Json file and watched it’s progress in the command window. Loading the Json file from disk into memory took a couple of minutes on my slow laptop, once this was done it de-serialised the file into my model. I then watched as the apps got added to the database. So far so good. With 1.4 million apps I expected it to take a little while, so I went off and did something else for half an hour. On my return, I had hoped it had got quite a way through the list but it had only done about 25,000. Hmm, at this rate it was going to take about 28 hours to complete!
Performance Tweaks
So I decided to make some performance tweaks. The first tweak was to remove the lookup of the category. With only 42 categories, all of them had already been written to the database after a couple of hundred apps had been processed. So on initial load I read these into a dictionary to use as an ID lookup and then only added them to the database if they didn’t already exist.
Apps were now being processed at about 1000 a minute, which would bring the time down to 23 hours. However, the app was now taking a long time to go through all the apps it had already added to the database.
As all apps were being added in sequence I performed a count on the apps in the database and then used the count to pick the next app from the list. I also did a bit of searching on StackOverflow and found a couple of tweaks to speed up Entity Framework.
Bloated dbContext
When Entity Framework processes a large number of records the dbContext gets bloated and this causes it to slow down. One of the ways of fixing this is by recreating the dbContext and doing bulk saves of data. So every 100 apps I would do a db.SaveChanges() and then recreate the dbContext.
Turn off Track Changes
The other common fix is to turn off Track Changes. Every time Entity Framework loads an entity into memory it creates a change tree to keep track of any changes made to it in memory. Turning this off when it isn’t needed can speed up your applications significantly.
So I ended up with a method like this to dispose and create my dbContext.
private static MarketDbContainer DisposeAndCreate(MarketDbContainer db)
{
if (db != null)
{
db.Dispose();
}
db = new MarketDbContainer();
db.Configuration.AutoDetectChangesEnabled = false;
// The timeout was required as my laptop was being slow reading from disk
db.Database.CommandTimeout = 300;
return db;
}
Now you would have thought after all these changes there would be a huge increase in the processing speed. Unfortunately not, it was now processing around 1200 a minute, which is ~19 hours to process the data. There had to be a quicker way.
SqlBulkCopy to the rescue
After some more Googling I came across SqlBulkCopy. I had used it previously but hadn’t realised just how much quicker it was compared to Entity Framework. I found a useful article over at developer fusion on how to perform high performance inserts with a code example that would allow me to use my existing objects with SqlBulkCopy. I chose to insert them in batches of 100,000 just in case error caused me to loose the lot (such as running out of memory).
After a quick rewrite I loaded my console application and waited in anticipation. Luckily I didn’t have to wait long, about 5 minutes for all 1.4 million records to be inserted into the database.
So in short, don’t use Entity Framework for large inserts. It is also recommended to turn off any indexes you have on your table.
So if you fancy running some queries of your own you can get my console app from my GitHub account.
You will find when you download the Json file, there is a property called “metadata_url” which points to another Json file with even more details about the app such as price and rating breakdown. I haven’t done anything with these but in the future, I plan to extend the tool to process these in a similar manner as well process user comments.