Browse by Domains

Top 32 Dataset in Machine Learning | Machine Learning Dataset

Table of contents

To build a machine learning model dataset is one of the main parts. Before we start with any algorithm we need to have a proper understanding of the data. These machine-learning datasets are basically used for research purposes. Most of the datasets are homogeneous in nature.

We use a dataset to train and evaluate our model and it plays a very vital role in the whole process. If our dataset is structured, less noisy, and properly cleaned then our model will give good accuracy on the evaluation time.

Check out our free Python Machine Learning course

Top 20 datasets which are easily available online to train your Machine Learning Algorithm:

  1. ImageNet
  2. Coco dataset
  3. Iris Flower dataset
  4. Breast cancer Wisconsin (Diagnostic) Dataset
  5. Twitter sentiment Analysis Dataset
  6. MNIST dataset (handwritten data)
  7. Fashion MNIST dataset
  8. Amazon review dataset
  9. Spam SMS classifier dataset
  10. Spam-Mails Dataset
  11. Youtube Dataset
  12. CIFAR -10
  13. IMDB reviews
  14. Sentiment 140
  15. Facial image Dataset
  16. Wine Quality Dataset
  17. The Wikipedia corpus
  18. Free Spoken digit dataset
  19. Boston House price dataset
  20. Pima Indian Diabetes dataset
  21. Iris Dataset
  22. Diamond Dataset
  23. mtcars Dataset
  24. Boston Dataset
  25. Titanic Dataset
  26. Pima Indian Diabetes Dataset
  27. Beavers Dataset
  28. Cars93 Dataset
  29. Car-seats Dataset
  30. msleep Dataset
  31. Cushings Dataset
  32. ToothGrowth Dataset

1. ImageNet:

Size of the Dataset: ~ 150 GB

  • Each record consist of with bounding boxes and respective class labels
  • ImageNet provides 1000 images for each synset
  •  URLs of the images is given in the ImageNet
  • Because of its large scale image dataset, it helps the researchers

Download the Dataset

2. Coco dataset:

Coco dataset stands for Common Objects in Context dataset Mirror and it is large-scale object detection, segmentation, and captioning dataset. This dataset has 1.5 million object instances for 80 object categories.

COCO has used five types of annotation 

  • object detection
  • keypoint detection
  • stuff segmentation
  • panoptic segmentation
  • image captioning

In COCO dataset annotations are stored in a JSON file.

Features are provided by the COCO dataset:

  • Object segmentation
  • Recognition in context
  • Superpixel stuff segmentation
  • 330K images (>200K labelled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image
  • 250,000 people with keypoints

Download the Dataset

3. Iris Flower Dataset:

The iris flower dataset is built for the beginners who just start learning machine learning techniques and algorithms. With the help of this data, you can start building a simple project in machine learning algorithms. The size of the dataset is small and data pre-processing is not needed. It has three different types of iris flowers like Setosa, Versicolour, and Virginica and their petal and sepal length, stored in a 150×4 numpy.ndarray.

Features

  • The dataset consists of four attributes, i.e., sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm.
  • This dataset has three classes
  • Each class of this dataset has 50 instances and the classes are Virginica, Setosa, and Versicolor.
  • t characteristics of this dataset are multivariate.
  • All of the attributes are real in this data

Download the Dataset

4. Breast cancer Wisconsin (Diagnostic) Dataset:

Breast cancer Wisconsin (Diagnostic) Dataset is one of the most popular datasets for classification problems in machine learning. This dataset based on breast cancer analysis. Features for this dataset computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image.

Features

  • Three types of attributes are mentioned in the dataset, i.e., ID, diagnosis, 30 real-valued input features.
  • In the dataset for each cell nucleus, there are ten real-valued features calculated,i.e., radius, texture, perimeter, area, etc.
  • The main two classes are specified in the dataset to predict i.e., benign and malignant.
  • In this dataset total of 569 instances are present which include 357 benign and 212 malignant.

Attribute Information:

  1.  ID number
  2.  Diagnosis (M = malignant, B = benign)
    3-32)

Ten real-valued features are mentioned for each cell nucleus:

  • Radius (mean of distances from the centre to points on the perimeter)
  • texture (standard deviation of grey-scale values)
  • perimeter
  • area
  • smoothness (local variation in radius lengths)
  • compactness (perimeter^2 / area – 1.0)
  •  concavity (severity of concave portions of the contour)
  • concave points (number of concave portions of the contour)
  • symmetry
  • fractal dimension (“coastline approximation” – 1)

Download the Dataset

5. Twitter sentiment Analysis Dataset:

Analyzing sentiment is one of the most popular application in natural language processing(NLP) and to build a model on sentiment analysis this dataset will help you. This dataset is basically a text processing data and with the help of this dataset you can start building your first model on NLP.

Structure of the dataset:

Three main columns are there in this dataset,

  • ItemID – id of twit
  • Sentiment – sentiment
  • SentimentText – text of the twit

Check out this free course on product categorization machine learning

Features

  • This dataset consists of three types or three tones of data, like neutral, positive, and negative.
  • Format of the dataset is CSV (Comma separated value)
  • Dataset is divided into two parts 1. Train,csv 2. Test.csv
  • So using this dataset you do not need to split your data for training and evaluation part.
  • All you need to do, build your model using train.csv and evaluate your model using test.csv
  • Two data fields are there, i.e., ItemID (ID of tweet) and SentimentText (text of the tweet).

Download the Dataset

6. MNIST dataset (handwritten data):

MNIST dataset is built on handwritten data. This dataset is one of the most popular deep learning image classification datasets. This dataset can be used for machine learning purpose as well. Dataset has 60000 instances or example for the training purpose and 10000 instances for the model evaluation. This dataset is beginner-friendly and helps to understand the techniques and the deep learning  recognition pattern on real-world data.  Data does not take much time to preprocess. For a beginner who is keen to learn deep learning or machine learning, they can start their first project with the help of this dataset.

Size: ~50 MB

Number of Records: 70,000 images in 10 classes (including train and test part)

Features

  • MNIST dataset is one of the best datasets which helps to understand and learn the ML techniques and pattern recognition methods in deep learning on real-world data.
  • Dataset contains four types of files like train-images-idx3-ubyte.gz, train-labels-idx1-ubyte.gz, t10k-images-idx3-ubyte.gz, and t10k-labels-idx1-ubyte.gz.
  • MNIST dataset is divided into two parts 1. Train,csv 2. Test.csv
  • So using this dataset you do not need to split your data for training and evaluation part.
  • All you need to do, build your model using train.csv and evaluate your model using test.csv

Download the Dataset

7. Fashion MNIST dataset:

Fashion MNIST dataset is also one of the most use datasets and build on cloths data. Fashion  MNIST dataset can be used for deep learning image classification problem. This dataset can be used for machine learning purpose as well. Dataset has 60000 instances or example for the training purpose and 10000 instances for the model evaluation. This dataset is beginner-friendly and helps to understand the techniques and the deep learning recognition pattern on real-world data.  Data does not take much time to preprocess. For a beginner who is keen to learn deep learning or machine learning they can start their first project with the help of this dataset. Fashion MNIST dataset is created to replace MNIST dataset. All the images in this dataset are in grayscale with 10 classes.

Size: 30 MB

Number of Records: 70,000 images in 10 classes

Features

  • Fashion MNIST dataset is one of the best dataset which helps to understand and learn the ML techniques and pattern recognition methods in deep learning on real-world data.
  • Fashion MNIST dataset is divided into two parts 1. Train,csv 2. Test.csv
  • So using this dataset you do not need to split your data for training and evaluation part.
  • All you need to do, build your model using train.csv and evaluate your model using test.csv

Download the Dataset

8.  Amazon review dataset:

Amazon review dataset is also used for Natural language processing purpose. Analyzing sentiment is one of the most popular application in natural language processing(NLP) and to build a model on sentiment analysis this dataset will help you. This dataset is basically a text processing data and with the help of this dataset, you can start building your first model on NLP. This dataset contains ratings, text, helpfulness votes, product metadata, description, category information, price, brand,  image features, links for the product, and view and bought graph as well. All the data contains 142.8 billion reviews spanning May 1996-July 2014. This dataset will give you the essence of the real business problem and helps you to understand the trend the sales over the years.

Features

  • Amazon review dataset consists of Amazon product reviews
  • It includes both product and user information, ratings, and review
  • Official Paper: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.
  • This data consists of duplicate data as well.

Download the Dataset

9. Spam SMS classifier dataset:

In today’s society finding spam, the message is one of the most important parts. So data scientist came up with an idea where you can train your model using the dataset and your model will predict the spam message. This dataset will help you to train your model to predict spam message. Machine learning classification algorithm can be used to build your model and this dataset is also beginner-friendly and easy to understand as well.  Spam SMS classifier dataset has a set of SMS labelled messages that are collected for SMS Spam analysis.

Features

  • Spam SMS classifier dataset has 5,574 messages
  • This dataset is written in English.
  • Each line of this dataset contains one message
  • This dataset has two datasets: One column stands for the classification of spam message or not and another one is raw text.
  • Spam SMS classifier dataset is in the CSV format (comma-separated value).

Download the Dataset

10. Spam-Mails Dataset: 

In today’s society finding spam mail is one of the most important parts. So data scientist came up with an idea where you can train your model using the dataset and your model will predict the spam mail. This dataset will help you to train your model to predict spam mail. Machine learning classification algorithm can be used to build your model and this dataset is also beginner-friendly and easy to understand as well.  Spam mails dataset has a set of mail tagged. This dataset is a  collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is basically a UK forum where the cell phone users make public claims about SMS spam messages. Most of them were receiving a huge number of spam messages every day. And the identification process of those spam messages was a very hard and time-consuming task. the process involved careful scanning hundreds of web pages. The Grumbletext Web site is http://www.grumbletext.co.uk/. -> A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is available at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/. -> A list of 450 SMS ham messages collected from Caroline Tag’s PhD Thesis.

  • Most of the part of the dataset are not spam that is about 86% almost.
  • In this dataset you need to split your data, it does not come with train and test division

Download the Dataset

11. Youtube Dataset: 

Youtube video dataset is based on youtube information about the videos they have. It helps to make a video classification model using a machine learning algorithm. YouTube-8M is a video dataset which consists of millions of YouTube video IDs. It has high-quality machine-generated annotations derived from numerous visual entities and audio-visual features from billions of frames and audio segments. This dataset helps to learn machine learning as well as computer vision part also. This dataset has improved quality of annotations and machine-generated labels and also it has  6.1 million URLs, labelled with a vocabulary of 3,862 visual entities. all the videos are annotated with one or more labels (an average of 3 labels per video).

Features

  • This dataset has a large-scaled labelled dataset with the high-quality machine-generated annotations.
  • In this dataset videos are sampled uniformly.
  • Each video in Youtube dataset is associated with at least one entity from the target vocabulary.
  • The vocabulary of the dataset is available in CSV format (Comma-separated value)

Download the Dataset

12. CIFAR -10: 

CIFAR 10 is also an image classification dataset which consists of various object images. With the help of this dataset, we can perform many operations in machine learning and deep learning as well. CIFAR stands for Canadian Institute For Advanced Research. This dataset is one of the most commonly used datasets for machine learning research. CIFAR 10 dataset  has 60,000 32×32 color images in 10 different classes. Those different classes are

  1. aeroplanes
  2. cars
  3. birds
  4. cats
  5. deer
  6. dogs
  7. frogs
  8. horses
  9. ships
  10. and trucks

And each of these class has 6000 images each.CIFAR 10 is used for Computer recognizing algorithm in deep learning to train computer how to recognize the object. Resolution of the images in CIFAR 10 is 32*32 that is considered as low resolution so it allows the learner to learn different algorithm with less time. CIFAR 10 dataset is beginner-friendly as well. This dataset is famous for deep learning algorithm convolutional neural network.

Features:

  • CIFAR 10  dataset is one of the best datasets which helps to understand and learn the ML techniques and object detection methods in deep learning on real-world data.
  • CIFAR 10  dataset is divided into two parts 1. Train 2. Test
  • So using this dataset you do not need to split your data for training and evaluation part.
  • All you need to do, build your model using train data and evaluate your model using test data
  • IN CIFAR 10 Total, there are 50,000 training images and 10,000 test images.
  • The dataset is divided into 6 parts – 5 training batches and 1 test batch.
  • Each batch has 10,000 images.

Size: 170 MB

Number of Records: 60,000 images in 10 classes

Download the Dataset

13.  IMDB reviews: 

IMDB dataset stands for  Large Movie Review Dataset. Analyzing sentiment is one of the most popular application in natural language processing(NLP) and to build a model on sentiment analysis IMDB movie review dataset will help you. This Large Movie Review dataset has 25,000 highly polar moving reviews which are may be good or bad. IMDB datset often use for sentiment analysis purpose using Machine learning or deep learning algorithm. This dataset is prepared by Standford researchers in 2011. This dataset comes with 50/50 split for training and testing purpose. This dataset also achieved 88.89% accuracy. IMDB  data was used for a Kaggle competition titled “Bag of Words Meets Bags of Popcorn” in  2014 to early 2015. In that competition accuracy was achieved above 97% with winners achieving 99%.  IMDB is popular for movie lovers as well and binary sentiment classification was mostly made using this.  Without the training and test review examples in the dataset, there is further unlabeled data for use.

Size: 80 MB

Number of Records: 25,000 highly polar movie reviews for training, and 25,000 for testing

Features:

  • IMDB  dataset is one of the best dataset which helps to understand and learn the ML techniques and  deep learning methods on real-world data.
  • IMDB  dataset is divided into two parts 1. Train 2. Test
  • So using this dataset you do not need to split your data for training and evaluation part.
  • All you need to do, build your model using train data and evaluate your model using test data

Download the Dataset

14. Sentiment 140:

Sentiment 140 dataset built on twitter data. Analyzing sentiment is one of the most popular application in natural language processing(NLP) and to build a model on sentiment analysis Sentiment 140 dataset will help you. This dataset is basically a text processing data and with the help of this dataset, you can start building your first model on NLP. Sentiment 140 dataset is beginner-friendly to start a new project in natural language processing. This data pre removed the emotions and it had six features altogether.

  • polarity of the tweet
  • id of the tweet
  • date of the tweet
  • the query
  • username of the tweeter
  • text of the tweet

Features:

  • It has 1,600,000 tweets which were extracted using the twitter api
  • The tweets were annotated like (0 = negative, 2 = neutral, 4 = positive)
  • These annotations are used to detect  the sentiment for the particular tweet

Fields in the dataset:

  • target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  • ids: The id of the tweet ( 2087)
  • date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  • flag: The query (lyx). If there is no query, then this value is NO_QUERY.
  • user: the user that tweeted (robotickilldozr)
  • text: the text of the tweet (Lyx is cool)

Size: 80 MB (Compressed)

Number of Records: 1,60,000 tweets

Download the Dataset

15. Facial image Dataset:

Facial image dataset is based on face images for male and female both. Using facial image dataset machine learning and deep learning algorithms can be performed to detect gender and emotion. It has a variation of data like variation of background and scale, and variation of expressions.

Information about the dataset:

  • Total number of individuals: 395
  • Number of images per individual: 20
  • Total number of images: 7900
  • Gender:  contains images of male and female subjects
  • Race:  contains images of people of various racial origins
  • Age Range:  the images are mainly of first year undergraduate  students, so the majority of individuals are between 18-20 years old but some older individuals are also present.

Features

  • The dataset has four directories.
  • You can download the dataset according to your system requirement and demand.
  • All the version of the data has the zipped version.
  • Total 395 individuals are there and each of them has 20 images
  • Resolution of the images are 180 * 200 pixel stored in 24 bit RGB JPEG format.

Download the Dataset

16. RED Wine Quality Dataset:

RED wine quality dataset is also popular and interesting for all the machine learning and deep learning enthusiast. This dataset is also beginner friendly and you can easily apply machine learning algorithm in this data. With the help of this dataset you can train your model to predict the wine quality. This dataset has wine’s physicochemical properties. Regression and classification both approach of machine learning can be used by using Red wine quality dataset. In this dataset are related to red and white variants of the Portuguese “Vinho Verde” wine. Because of privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). In the dataset, the classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

Information about input variables based on physicochemical tests:

1 – Fixed acidity

2 – Volatile acidity

3 – Citric acid

4 – Residual sugar

5 – Chlorides

6 – Free sulfur dioxide

7 – Total sulfur dioxide

8 – Density

9 – pH

10 – Sulphates

11 – Alcohol

Output variable (based on sensory data):

12 – Quality (score between 0 and 10)

Features

  •  Two types of variables are there in the dataset, i.e., input and output variables.
  • Input variables are fixed acidity, volatile acidity, citric acid, residual sugar, and so forth.
  • The output variable is quality.
  • 12 attributes are present and the attribute characteristics are real.
  • The number of total records is 4898.

Download the Dataset

 

17. The Wikipedia corpus:

Wikipedia corpus consists of Wikipedia data only. This has the collection of the full text on Wikipedia and contains almost 1.9 billion words from more than 4 million articles. This dataset is basically used for natural language processing purpose. It is a very powerful dataset and you can search by word, phrase or part of a paragraph itself.

Size: 20 MB

Number of Records: 4,400,000 articles containing 1.9 billion words

Features

  • This dataset has a large-scaled and can be used for machine learning and natural language processing purpose
  • As the dataset is big in nature its helps to train the model perfectly
  • It has 4,400,000 articles containing 1.9 billion words

Download the Dataset

18. Free Spoken digit dataset:

Free Spoken digit dataset is simple audio or speech data which consists of recordings of spoken English digits. The format of the file is wav at 8 kHz.  All the recordings are trimmed to have near minimal silence at the beginning and ends. This dataset is created to solve the task of identifying spoken digits in audio. The main thing about the dataset is, it is open. So anyone can contribute to this repository. As it is open so it is expected that the dataset will grow over time

 Characteristics of the Dataset:

  • 4 speakers
  • 2,000 recordings (50 of each digit per speaker)
  • English pronunciations

Files format: {digitLabel}_{speakerName}_{index}.wav Example: 7_jackson_32.wav

Features:

  • Open source
  • Helps to solve digit pronunciations problem
  • Allows to contribute anyone

Download the Dataset

19. Boston House price dataset: 

Boston House price dataset is collected from  U.S Census Service concerning housing in the area of Boston Mass. This dataset is used to predict the house price depending upon a few attributes. Machine learning regression problem can be done using the data. The dataset has five hundred six cases all total.

Total columns in the dataset:

crim

per capita crime rate by town.

zn

proportion of residential land zoned for lots over 25,000 sq.ft.

indus

proportion of non-retail business acres per town.

chas

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox

nitrogen oxides concentration (parts per 10 million).

rm

average number of rooms per dwelling.

age

proportion of owner-occupied units built prior to 1940.

dis

weighted mean of distances to five Boston employment centres.

rad

index of accessibility to radial highways.

tax

full-value property-tax rate per \$10,000.

ptratio

pupil-teacher ratio by town.

black

1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.

lstat

lower status of the population (percent).

medv

median value of owner-occupied homes in \$1000s.

Features:

  • Total cases in the dataset 506
  •  14 attributes are there in each case, like: CRIM, AGE, TAX, and so forth.
  • The format of the dataset is CSV (Comma separated value)
  • Machine learning regression problem can be applied in the dataset

Download the Dataset

20. Pima Indian Diabetes dataset:

Artificial Intelligence is now widely used in the healthcare and medical industry as well. The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Diabetes is one of the most common and dangerous diseases and now spreading of the diabetes is very easy. A chronic condition in diabetes body develops a resistance to insulin and a hormone which converts foods into Glucose. Diabetes affects so many people worldwide and it has Type 1 and Type 2 diabetes. For type 1 and type 2 diabetes, they have different characteristics. So  Pima Indian Diabetes dataset is basically used to predict the diabetes based on certain diagnostic measurements. This machine learning model helps the society and the patient as well to detect the diabetes disease quickly. This is one of the best dataset to make a model on diabetes prediction. Particularly we can say all patients here are females at least 21 years old of Pima Indian heritage. There are to total of nine columns in the dataset:

  1. Pregnancies
  2. Glucose
  3. Blood pressure
  4. Skin thickness
  5. Insulin
     
  6. BMI
  7. DiabetesPedigreeFunction
  8. Age
  9. Outcome

Features:

  • The format of the dataset is CSV (Comma separated value)
  • Almost most of the patients of this dataset are female, and at least 21 years old.
  • There are several variables are there in the dataset, like, number of pregnancies, BMI, insulin level, age, and one target variable.
  • It has a total of 768 rows and 9 columns

Download the Dataset

21. Iris Dataset:

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Format of the dataset:

iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

Download the Dataset.


22. Diamonds Dataset:

This is a dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:

Price: price in US dollars (\$326–\$18,823)

Carat: weight of the diamond (0.2–5.01)

Cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)

Color: diamond colour, from D (best) to J (worst)

Clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

X: length in mm (0–10.74)

Y: width in mm (0–58.9)

Z: depth in mm (0–31.8)

Depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

Table: width of top of diamond relative to widest point (43–95)

Download the dataset.


23. mtcars Dataset: (Motor Trend Car Road Tests)

This data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

This dataset comprises of the following columns:

mpg Miles/(US) gallon

cyl Number of cylinders

disp Displacement (cu.in.)

hp Gross horsepower

drat Rear axle ratio

wt Weight (1000 lbs)

qsec 1/4 mile time

vs Engine (0 = V-shaped, 1 = straight)

am Transmission (0 = automatic, 1 = manual)

gear Number of forward gears

carb Number of carburetors

Download this dataset.


24. Boston Dataset: Housing Values in Suburbs of Boston

The Boston data frame has 506 rows and 14 columns.

Description of columns:

Crim: per capita crime rate by town.

Zn: proportion of residential land zoned for lots over 25,000 sq.ft.

Indus: proportion of non-retail business acres per town.

Chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

Nox: nitrogen oxides concentration (parts per 10 million).

Rm: average number of rooms per dwelling.

Age: proportion of owner-occupied units built prior to 1940.

Dis: weighted mean of distances to five Boston employment centres.

Rad: index of accessibility to radial highways.

Tax: full-value property-tax rate per \$10,000.

Ptratio: pupil-teacher ratio by town.

Black: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.

Lstat: lower status of the population (percent).

Medv: median value of owner-occupied homes in \$1000s.

Download this dataset.


25. Titanic Dataset: Survival of passengers on the Titanic

This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarized according to economic status (class), sex, age and survival.

Format:

A 4-dimensional array resulting from cross-tabulating 2201 observations on 4 variables. The variables and their levels are as follows:

Class: 1st, 2nd, 3rd, Crew

Sex: Male, Female

Age: Child, Adult

Survived: No, Yes

Details about the event:

The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts—from the proportions of first-class passengers to the ‘women and children first’ policy, and the fact that that policy was not entirely successful in saving the women and children in the third class—are reflected in the survival rates for various classes of passenger

Download this dataset.


26. Pima Indian Diabetes Dataset:

A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data was collected by the US National Institute of Diabetes and Digestive and Kidney Diseases.

This data frame comprises of the following columns:

Npreg: number of pregnancies.

Glu: plasma glucose concentration in an oral glucose tolerance test.

Bp: diastolic blood pressure (mm Hg).

Skin: triceps skin fold thickness (mm).

Bmi: body mass index (weight in kg/(height in m)\^2).

Ped: diabetes pedigree function.

Age: age in years.

Type: Yes or No, for diabetic according to WHO criteria.

Download this dataset.


27. Beavers Dataset:

This data set is part of a long study into body temperature regulation in beavers. Four adult female beavers were live-trapped and had a temperature-sensitive radio transmitter surgically implanted. Readings were taken every 10 minutes. The location of the beaver was also recorded and her activity level was dichotomized by whether she was in the retreat or outside of it since high-intensity activities only occur outside of the retreat.

This data frame contains the following columns:

Day: The day number. The data includes only data from day 307 and early 308.

Time: The time of day formatted as hour-minute.

Temp: The body temperature in degrees Celsius.

Activ: The dichotomized activity indicator. 1 indicates that the beaver is outside of the retreat and therefore engaged in high-intensity activity.

Download this dataset.


28. Cars93 Dataset: Data from 93 Cars on Sale in the USA in 1993

The Cars93 data frame has 93 rows and 27 columns. Below is the description of columns:

Manufacturer: Manufacturer of the vehicle

Model: Model of the vehicle

Type:Type: a factor with levels “Small”, “Sporty”, “Compact”, “Midsize”, “Large” and “Van”.

Min.Price: Minimum Price (in \$1,000): price for a basic version.

Price: Midrange Price (in \$1,000): average of Min.Price and Max.Price.

Max.Price: Maximum Price (in \$1,000): price for “a premium version”.

MPG.city: City MPG (miles per US gallon by EPA rating).

MPG.highway: Highway MPG.

AirBags: Air Bags standard. Factor: none, driver only, or driver & passenger.

DriveTrain: Drive train type: rear wheel, front wheel or 4WD; (factor).

Cylinders: Number of cylinders (missing for Mazda RX-7, which has a rotary engine).

EngineSize: Engine size (litres).

Horsepower: Horsepower (maximum).

RPM: RPM (revs per minute at maximum horsepower).

Rev.per.mile: Engine revolutions per mile (in highest gear).

Man.trans.avail: Is a manual transmission version available? (yes or no, Factor).

Fuel.tank.capacity: Fuel tank capacity (US gallons).

Passengers: Passenger capacity (persons)

Length: Length (inches).

Wheelbase: Wheelbase (inches).

Width: Width (inches).

Turn.circle: U-turn space (feet).

Rear.seat.room: Rear seat room (inches) (missing for 2-seater vehicles).

Luggage.room: Luggage capacity (cubic feet) (missing for vans).

Weight: Weight (pounds).

Origin: Of non-USA or USA company origins? (factor).

Make: Combination of Manufacturer and Model (character).

Download this dataset.


29. Car-seats Dataset:

This is a simulated data set containing sales of child car seats at 400 different stores. So, it is a data frame with 400 observations on the following 11 variables:

Sales: Unit sales (in thousands) at each location

CompPrice: Price charged by competitor at each location

Income: Community income level (in thousands of dollars)

Advertising: Local advertising budget for company at each location (in thousands of dollars)

Population: Population size in region (in thousands)

Price: Price company charges for car seats at each site

ShelveLoc: A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site

Age: Average age of the local population

Education: Education level at each location

Urban: A factor with levels No and Yes to indicate whether the store is in an urban or rural location

US: A factor with levels No and Yes to indicate whether the store is in the US or not

Download this dataset.

30. msleep Dataset:

This is an updated and expanded version of the mammals sleep dataset. It is a dataset with 83 rows and 11 variables.

Name: common name

Genus, vore: carnivore, omnivore or herbivore?

Order, conservation: the conservation status of the animal

Sleep_total: total amount of sleep, in hours

Sleep_rem: rem sleep, in hours

Sleep_cycle: length of sleep cycle, in hours

Awake: amount of time spent awake, in hours

Brainwt: brain weight in kilograms

Bodywt: body weight in kilograms

Download this dataset.


31. Cushings Dataset: Diagnostic Tests on Patients with Cushing’s Syndrome

Cushing’s syndrome is a hypertensive disorder associated with over-secretion of cortisol by the adrenal gland. The observations are urinary excretion rates of two steroid metabolites.

The Cushings data frame has 27 rows and 3 columns. The description of the columns is below:

Tetrahydrocortisone: urinary excretion rate (mg/24hr) of Tetrahydrocortisone.

Pregnanetriol: urinary excretion rate (mg/24hr) of Pregnanetriol.

Type: underlying type of syndrome, coded a (adenoma) , b (bilateral hyperplasia), c (carcinoma) or u for unknown.

Download this dataset.


32. ToothGrowth Dataset:

The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

This is a data frame with 60 observations on 3 variables.

Download this dataset.


Dataset is the base and first step to build a machine learning applications.Datasets are available in different formats like .txt, .csv, and many more. For supervised machine learning, the labelled training dataset is used as the label works as a supervisor in the model. And for unsupervised learning algorithm in machine learning dataset label is required. The unsupervised model learns by itself not from the label.

Please read the full article to understand which dataset is preferable for your machine learning algorithm.

I hope this article will help you to understand thoroughly about the best 20 datasets which are available freely.

For free upksilling courses on Machine Learning and data science, visit GL Academy. Also, explore our post graduate programs on data science here.

Happy Learning!

Further Reading

  1. Datasets for Computer Vision using Deep Learning
  2. Top 5 Sources For Analytics and Machine Learning Datasets
  3. Free Data Sets for Analytics/Data Science Project
  4. Top 10 Data Scientists in the World

Find Machine Learning Course in Top Indian Cities

Chennai | Bangalore | Hyderabad | Pune | Mumbai | Delhi NCR


Sampriti Chatterjee

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top