This article was originally published at: Free datasets for machine learning
One of the main characteristics of the early 21st century is an outstanding rise in the amount of available data. This is followed by the significant improvements in computational power, storage capacities, as well as the improvements of the algorithms and software for data processing, interpretations, and predictions.
The skills related to data analytics, data science, machine learning, and artificial intelligence are widely demanded and well appreciated. Acquiring such skills requires a significant effort and months or years of learning. But not just that.
To learn to work with data, implement your data-processing code, and understand the math behind, you need something else as well: data. To be more precise, you need appropriate, well-sized, well-balanced, and easily-understood datasets.
This article analyses several interesting and suitable datasets that might be used when learning data science or when testing your own approaches.
Toy Datasets
Toy datasets are usually (relatively) small yet large enough, well-balanced datasets, suitable for learning how to implement algorithms, as well as for testing their own approaches to data processing. Libraries for data science and machine learning like Scikit-Learn, Keras, and TensorFlow contain their own datasets readily available for their users.
We’ll mention several such datasets included in Scikit-Learn and show how to use them.
Boston House Prices is one of the best-known datasets for regression. It’s available in Scikit-Learn. These are its main characteristics:
- Number of observations: 506
- Number of input features: 13
- Input data domain: positive real numbers
- Output data domain: positive real numbers
- Suitable for regression
This dataset contains the data related to houses in Boston like the crime rate, nitric oxides concentration, number of rooms, distances to employment centers, tax rates, etc. The output feature is the median value of homes.
To use this dataset, you should import and call the function load_boston from sklearn.datasets:
>>> from sklearn.datasets import load_boston
>>> dataset = load_boston()
Now, the dataset is ready. You can extract the inputs and outputs as NumPy arrays like this:
>>> x, y = dataset[‘data’], dataset[‘target’]
You can check their shapes:
>>> x.shape, y.shape
((506, 13), (506,))
You can also get the names of the features:
>>> dataset[‘feature_names’]
array([‘CRIM’, ‘ZN’, ‘INDUS’, ‘CHAS’, ‘NOX’, ‘RM’, ‘AGE’, ‘DIS’, ‘RAD’,
‘TAX’, ‘PTRATIO’, ‘B’, ‘LSTAT’], dtype='<U7′)
If you want to load the description of this dataset, you can do it programmatically with the statement dataset[‘DESCR’].
Optical Recognition of Handwritten Digits is one of the most famous datasets for classification. It’s used for image recognition. Its main characteristics are:
- Number of observations: 1,797
- Number of input features: 64
- Input data domain: integers from 0 to 16
- Output data domain: integers from 0 to 9
- Suitable for classification
Each input feature represents the color (shade of gray) of a single-pixel of an image with the width of 8 px and height of 8 px. Thus, 64 input features define all pixels for images. The outputs represent the correct digits written.
You can import and handle this dataset in a very similar manner to the previous one:
>>> from sklearn.datasets import load_wine
>>> dataset = load_wine()
>>> x, y = dataset[‘data’], dataset[‘target’]
>>> x.shape, y.shape
((178, 13), (178,))
>>> x.min(), x.max(), y.min(), y.max()
(0.13, 1680.0, 0, 2)
>>> dataset[‘target_names’]
array([‘class_0’, ‘class_1’, ‘class_2′], dtype=’<u7’)
The Wine Recognition dataset is also used for classification. It is used to recognize the wine class given the features like the amount of alcohol, magnesium, phenol, color intensity, etc. Its main characteristics are:
- Number of observations: 178
- Number of input features: 13
- Input data domain: positive real numbers
- Output data domain: integers from 0 to 2
- Suitable for classification
You can import and handle this dataset like the previous one:
>>> from sklearn.datasets import load_wine
>>> dataset = load_wine()
>>> x, y = dataset[‘data’], dataset[‘target’]
>>> x.shape, y.shape
((178, 13), (178,))
>>> x.min(), x.max(), y.min(), y.max()
(0.13, 1680.0, 0, 2)
>>> dataset[‘target_names’]
array([‘class_0’, ‘class_1’, ‘class_2′], dtype=’<U7’)
Iris Plants dataset is suitable for classification as well as two previous sets. It contains sepal and petal lengths and widths for three classes of plants. Its main characteristics are:
- Number of observations: 150
- Number of input features: 4
- Input data domain: positive real numbers
- Output data domain: integers from 0 to 2
- Suitable for classification
You can import and handle this dataset like the previous ones:
>>> from sklearn.datasets import load_iris
>>> dataset = load_iris()
>>> x, y = dataset[‘data’], dataset[‘target’]
>>> x.shape, y.shape
((150, 4), (150,))
>>> x.min(), x.max(), y.min(), y.max()
(0.1, 7.9, 0, 2)
>>> dataset[‘target_names’]
array([‘setosa’, ‘versicolor’, ‘virginica’], dtype=’<U10′)
Real-World Datasets
Real-world datasets are usually larger than toy datasets. They often have some missing or “garbage” data. Thus, they are often harder to use and understand. Libraries for data science and machine learning contain their own real-world datasets in addition to toy datasets. There are also Web sites that provide many interesting and useful datasets like the Machine Learning Repository by the Center for Machine Learning and Intelligent Systems (University of California, Irvine), Awesome Public Datasets on GitHub or Kaggle.
The famous MNIST Database of Handwritten Digits is one of the datasets included in Keras. It contains 60,000 images of the digits from 0 to 9, along with the 10,000 images for testing. The images are grayscale with the heights and widths of 28 px. This dataset can be used for classification, i.e., image recognition.
You can load these data like this:
>>> from keras.datasets import mnist
>>> dataset = mnist.load_data()
>>> (x_train, y_train), (x_test, y_test) = dataset
California Housing dataset is included in Scikit-Learn and is to some extent similar to Boston House Prices. However, this dataset is much larger. The input features describe the median incomes of residents, house age, number of rooms, etc. The output feature is the median house value. Its main characteristics are:
- Number of observations: 20,640
- Number of input features: 8
- Input data domain: positive real numbers
- Output data domain: positive real numbers
- Suitable for regression
20 Newsgroups Text dataset is a text classification set included in Scikit-Learn. It contains 18,846 observations, i.e., posts each related to one of 20 classes or topics. Its main characteristics are:
- Number of observations: 18,846
- Number of input features: 1
- Input data: text (string)
- Output data domain: integers
- Suitable for classification
Forest Covertypes dataset is a classification dataset for predicting the cover types of the forests in the U.S. There are seven output classes to choose from. Its main characteristics are:
- Number of observations: 581,012
- Number of input features: 54
- Input data domain: integers
- Output data domain: integers
- Suitable for classification
Google Play Store Apps datasets are available on Kaggle. They include two datasets. One of the datasets has 10.841 observation and 13 features, including applications names, categories, ratings, sizes, numbers of reviews and installs, genres, etc. The other set is about the reviews related to the applications. It has 64,295 observations and five features.
They are licensed under the Creative Commons Attribution 3.0 Unported License.
FIFA 19 Complete Player dataset is probably going to interest you if you like football or computer games. It contains various data related to football players from the game FIFA19. This dataset contains 18,207 observations (one for each player) and as much as 89 features. There are all sorts of data related to the players: clubs, nationalities, positions, reputations, ages, wages, skills (like speed, dribbling, heading, crossing and so on), height, weight, etc. It’s also available on Kaggle.
This dataset is licensed under the CC BY-NC-SA 4.0 license.
Stanford Car dataset contains 16,185 images of cars. It’s suitable for classification, i.e., image recognition. The set is split into 8,144 training observations and 8,041 test observations. There are 196 classes of cars. This dataset can be downloaded from Kaggle as well.
Barcelona datasets are the sets from the Portal Open Data BCN. They can be found on Kaggle as well. There are 17 datasets on Kaggle under the CC0: Public Domain license and 425 datasets on Open Data BCN. The data are related to births, deaths, population, immigrants, name frequencies, air quality, transport, etc.
Graduate Admissions dataset is another Kaggle set given under the CC0: Public Domain license. It has 500 observations and nine features. This dataset can be used to predict the chance of graduate admissions given their GRE and TOEFL scores, university rankings, etc.
The Web site TuTiempo.net offers large and impressive Global Climate datasets with the climate parameters from 1929 onwards for most, if not all countries in the world. If you’re into weather prediction or climate science, you should check it out.
The U.S. Government’s Open Data is a Web portal with the links to the datasets related to agriculture, climate, education, energy, finance, health, manufacturing, safety, science, etc.
Conclusions
These are some of the many available datasets. There are many more of them out there. Many public, non-profit, but also commercial institutions decide to publish their data. Scientific articles sometimes come with the raw data used as well.
If you want to dive into data science and machine learning, you need to learn to work with large amounts of data. Toy datasets, as well as some others, can be a good starting point.
However, you should always pay attention to the license and conditions under which data is published!
Thank you for reading.