Dataset

Hadoop
Wikipedia Dump Dataset https://dumps.wikimedia.org/enwiki/
Airline on-time Performance Dataset http://stat-computing.org/dataexpo/2009/the-data.html
Freebase Triples Dataset https://developers.google.com/freebase/
AWS Public Datasets(Download data: Need Amazon account ) https://aws.amazon.com/public-datasets/
Sample Datasets for Hadoop Testing and Eval https://streever.atlassian.net/wiki/pages/viewpage.action?pageId=491580
Hadoop-bigdata Datasets https://github.com/algorithmica-repository/hadoop-bigdata/tree/master/datasets
PUMA Benchmarks Dataset https://engineering.purdue.edu/~puma/datasets.htm
Google Books Ngrams http://books.google.com/ngrams/
1000 Genomes- 200TB dataset ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/
The ClueWeb09 Dataset http://lemurproject.org/clueweb09/
Collections of Datasets Weka http://www.cs.waikato.ac.nz/~ml/weka/datasets.html
noaa-27GB dataset ftp://ftp.ncdc.noaa.gov/pub/data/noaa/
Cornell Movie–Dialogs Corpus https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
AREALM Dataset https://drive.google.com/file/d/0B1jY75xGiy7eZV93eGxlZ2YwSFE/view
AREAWATER https://drive.google.com/file/d/0B1jY75xGiy7eR3VpNC1XMzB5cWs/view
EDGES SpatialHadoop Dataset https://drive.google.com/file/d/0B1jY75xGiy7eOG85SHM3TzFVd2c/view
ZCTA5 Dataset https://drive.google.com/file/d/0B1jY75xGiy7eLWhNUll0ZWFRT0U/view
OpenStreetMap Datasets https://drive.google.com/file/d/0B1jY75xGiy7eNjJuRy1KWjRieVU/view
Machine Learning Datasets https://blog.bigml.com/2013/02/28/data-data-data-thousands-of-public-data-sources/
Hackspark Dataset http://hackspark.github.io/environment/download-sample-data/
The USC-SIPI Image Database http://sipi.usc.edu/database/
Criteo Labs Terabyte Dataset http://labs.criteo.com/2013/12/download-terabyte-click-logs/
Data Science Datasets http://blog.mortardata.com/post/67652898761/6-dataset-lists-curated-by-data-scientists?goback=%2Egde_4989164_member_5820574831720022020#%21
Big Data
US Government Web Services and XML Data Sources Dataset http://usgovxml.com/
Soybean (Large) Data Set https://archive.ics.uci.edu/ml/datasets/Soybean+(Large)
Online Retail Data Set http://archive.ics.uci.edu/ml/datasets/online+retail
IMO-IMDG-Codes-Dataset https://github.com/datasets/IMO-IMDG-Codes/tree/master/data
History Data of Proshared ETFs Dataset http://www.ee.columbia.edu/~cylin/course/bigdata/getdatasetinfo.html
2010/11 Regional Household Travel Survey Public Use Dataset http://njtpa.org/data-maps/surveys/household-travel-survey/2010-11-regional-household-travel-survey-data-set.aspx
IMDB Database http://www.imdb.com/interfaces
DataHub Datasets https://datahub.io/dataset
Yahoo Stock Dataset https://finance.yahoo.com/most-active#mkt-movers
Yahoo Labs Datasets https://webscope.sandbox.yahoo.com/
Statistical Computing Datasets http://stat-computing.org/dataexpo/
Deceptive Opinion Spam Corpus v1.4(Download dataset:login Required) http://myleott.com/op_spam/
StarPlus fMRI dataset http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-81/www/
MMTD – Million Musical Tweets Dataset http://www.cp.jku.at/datasets/MMTD/
Web data: Amazon movie reviews dataset http://snap.stanford.edu/data/web-Movies.html
IEMOCAP Database http://sail.usc.edu/iemocap/
Book-Crossing Dataset http://www2.informatik.uni-freiburg.de/~cziegler/BX/
Large Movie Review Dataset http://ai.stanford.edu/~amaas/data/sentiment/
20 Newsgroups Dataset http://qwone.com/~jason/20Newsgroups/
Expedia Hotel Recommendations Dataset https://www.kaggle.com/c/expedia-hotel-recommendations/data
Population-city dataset https://github.com/datasets/population-city/tree/master/data
MusicBrainz Database https://musicbrainz.org/doc/MusicBrainz_Database
Color FERET Database https://www.nist.gov/itl/iad/image-group/color-feret-database
Google Public Dataset https://www.google.com/publicdata/directory
Large Health Data Sets https://www.ehdp.com/vitalnet/datasets.htm
Freebase Dataset http://www.bigfastblog.com/how-to-get-experience-working-with-large-datasets
Billboard Hot 100 (years) Dataset http://ddmal.music.mcgill.ca/research/billboard
Internet of Things
WiFi-City http://data.gov.au/dataset/7d61bac5-a1a1-4203-aa6c-16663920ba9c
Smartcity data set http://smartcity.linkeddata.es/datasets/
Sii-Mobility https://datahub.io/dataset/sii-mobility
INERTIA datasets https://datahub.io/dataset/inertia-energy-consumption-example-linked-data
Parking Sensors dataset https://data.melbourne.vic.gov.au/w/naqr-e4vh/spy9-nmud?cur=o8NvwvOLyZA&from=root
RBKC Parking Bay Locations dataset https://data.gov.uk/dataset/rbkc-parking-bay-locations
Pedestrian Counters City of Melbourne datasets https://data.melbourne.vic.gov.au/Transport-Movement/Pedestrian-volume-updated-monthly-/b2ak-trbp
Gas sensor arrays in open sampling settings Data Set https://archive.ics.uci.edu/ml/datasets/Gas+sensor+arrays+in+open+sampling+settings
Air Quality Sensor LocationsEnvironment https://data.bathhacked.org/Environment/Air-Quality-Sensor-Locations/fjzm-5zsg
National Street Gazetteer dataset https://data.gov.uk/dataset/national-street-gazetteer
Street works | Bath: Hacked dataset https://data.bathhacked.org/w/eh5p-x3eu/n5sb-br6x?cur=q9lz7QP4XLw&from=root
UKCP09 data sets http://www.metoffice.gov.uk/climatechange/science/monitoring/ukcp09/download/
Key technologies for the internet of things dataset https://www.europeandataportal.eu/data/en/dataset/dgl9y8a8yw0hnlpnqvoybq
Tableau – Data visualization for IoT datasets http://semanticommunity.info/Data_Science/Tableau_Public_Data_Sets
Linked Sensor Data set https://datahub.io/dataset/knoesis-linked-sensor-data
Non-PHI/PII IOT Data Collections https://idash.ucsd.edu/data-collections
GeoLife GPS Trajectories dataset https://www.microsoft.com/en-us/download/details.aspx?id=52367&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2Fb16d359d-d164-469e-9fd4-daa38f2b2e13%2F
CityPulse Dataset Collection http://www.ict-citypulse.eu/page/content/tools-and-data-sets