Some pointers to where real data can be delved from the web.
Time-series:
- Economic: http://www.economicswebinstitute.org/ecdata.htm
- Industrial: http://homes.esat.kuleuven.be/~smc/daisy/daisydata.html
- TSDL: http://robjhyndman.com/TSDL/
- gov data: yougov.se, yougov.co.uk, data.gov
- EEG: http://sccn.ucsd.edu/~arno/fam2data/publicly_available_EEG_data.html
- Mike West: http://www.stat.duke.edu/~mw/ts_data_sets.html
- UWO: http://www.stats.uwo.ca/faculty/aim/epubs/datasets/default.htm
- MLdata: http://mldata.org/
- Duke: http://www.stat.duke.edu/~mw/ts_data_sets.html
- UCI data: http://archive.ics.uci.edu/ml/index.html
- MLDATA: http://mldata.org/
- INEX: http://inex.otago.ac.nz/, http://webspam.lip6.fr/
- PASCAL: http://pascallin2.ecs.soton.ac.uk/Challenges/
- Clopinet: http://clopinet.com/challenges/
- KD nuggets: http://www.kdnuggets.com/datasets/competitions.html
- Delicious: http://www.delicious.com/pskomoroch/dataset, http://www.datawrangling.com/some-datasets-available-on-the-web
- Datamob: http://datamob.org
- Ranking: http://learningtorankchallenge.yahoo.com/, http://research.microsoft.com/en-us/projects/mslr/
- ed.ac.uk: http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.html
- Million Song: http://labrosa.ee.columbia.edu/millionsong/
- Nokia: http://research.nokia.com/mdc
- Yandex: http://imat-relpred.yandex.ru/en
- biomed: http://datam.i2r.a-star.edu.sg/datasets/krbd/
- Kaggle: http://www.kaggle.com/
- Mindboggle: http://mindboggle.info/index.html
- CAMrA: http://2011.camrachallenge.com/
- Statistical Machine Translation: http://www.statmt.org/
- ENRON email dataset: https://www.cs.cmu.edu/~./enron
- Movielens: https://www.researchgate.net/publication/305682388_Mise-en-Scene_Dataset_Stylistic_Visual_Features_of_Movie_Trailers_description
- Youtube dataset: http://netsg.cs.sfu.ca/youtubedata/, https://research.googleblog.com/2016/09/announcing-youtube-8m-large-and-diverse.html
- Images: http://homepages.inf.ed.ac.uk/rbf/CVonline/Imagedbase.htm
- Big data:
- http://usgovxml.com
- http://aws.amazon.com/datasets
- http://databib.org
- http://datacite.org
- http://figshare.com
- http://linkeddata.org
- http://reddit.com/r/datasets
- http://thedatahub.org alias http://ckan.net
- http://quandl.com
- Social Network Analysis Interactive Dataset Library (Social Network Datasets)
- Datasets for Data Mining
- http://enigma.io
- Single datasets and data repositories
- http://archive.ics.uci.edu/ml/
- http://crawdad.org/
- http://data.austintexas.gov
- http://data.cityofchicago.org
- http://data.govloop.com
- http://data.gov.uk/
- http://data.medicare.gov
- http://data.seattle.gov
- http://data.sfgov.org
- http://data.sunlightlabs.com
- https://datamarket.azure.com/
- http://developer.yahoo.com/geo/g...
- http://econ.worldbank.org/datasets
- http://en.wikipedia.org/wiki/Wik...
- http://factfinder.census.gov/ser...
- http://ftp.ncbi.nih.gov/
- http://gettingpastgo.socrata.com
- http://googleresearch.blogspot.c...
- http://books.google.com/ngrams/
- http://medihal.archives-ouvertes.fr
- http://public.resource.org/
- http://rechercheisidore.fr
- http://snap.stanford.edu/data/in...
- http://timetric.com/public-data/
- https://wist.echo.nasa.gov/~wist...
- http://www2.jpl.nasa.gov/srtm
- http://www.archives.gov/research...
- http://www.bls.gov/
- http://www.crunchbase.com/
- http://www.dartmouthatlas.org/
- http://www.data.gov/
- http://www.datakc.org
- http://dbpedia.org
- http://www.delicious.com/jbaldwi...
- http://www.faa.gov/data_research/
- http://www.factual.com/
- http://research.stlouisfed.org/f...
- http://www.freebase.com/
- http://www.google.com/publicdata...
- http://www.guardian.co.uk/news/d...
- http://www.infochimps.com
- http://www.kaggle.com/
- http://build.kiva.org/
- http://www.nationalarchives.gov....
- http://www.nyc.gov/html/datamine...
- http://www.ordnancesurvey.co.uk/...
- http://www.philwhln.com/how-to-g...
- http://www.imdb.com/interfaces
- http://imat-relpred.yandex.ru/en...
- http://www.dados.gov.pt/pt/catal...
- http://knoema.com
- http://daten.berlin.de/
- http://www.qunb.com
- http://databib.org/
- http://datacite.org/
- http://data.reegle.info/
- http://data.wien.gv.at/
- http://data.gov.bc.ca
- https://pslcdatashop.web.cmu.edu/ (interaction data in learning environments)
- http://www.icpsr.umich.edu/icpsrweb/CPES/ - Collaborative Psychiatric Epidemiology Surveys: (A collection of three national surveys focused on each of the major ethnic groups to study psychiatric illnesses and health services use)
- http://www.dati.gov.it
- http://dati.trentino.it
Biology
- 1000 Genomes
- Collaborative Research in Computational Neuroscience (CRCNS)
- Gene Expression Omnibus (GEO)
- Human Microbiome Project (HMP)
- ICOS PSP Benchmark
- MIT Cancer Genomics Data
- NIH Microarray data (FTP)
- Protein Data Bank
- PubChem Project
- PubGene (now Coremine Medical)
- Stanford Microarray Data
- The Personal Genome Project or PGP
- UCSC Public Data
- UniGene
- Australian Weather
- Canadian Meteorological Centre
- Climate Data from UEA (updated monthly)
- Global Climate Data Since 1929
- NASA Global Imagery Browse Services
- NOAA Bering Sea Climate
- NOAA Climate Datasets
- NOAA Realtime Weather Models
- WU Historical Weather Worldwide
- CrossRef DOI URLs
- DBLP Citation dataset
- NBER Patent Citations
- NIST complex networks data collection
- Small Network Data
- UCI Network Data Repository
- Protein-protein interaction network
- PyPI and Maven Dependency Network
- Scopus Citation Database
- Stanford GraphBase (Steven Skiena)
- Stanford Large Network Dataset Collection
- The Koblenz Network Collection
- The Laboratory for Web Algorithmics (UNIMI)
- The Nexus Network Repository
- UCI Network Data Repository
- UFL sparse matrix collection
- WSU Graph Database
- 3.5B Web Pages from CommonCraw 2012
- 53.5B Web clicks of 100K users in Indiana Univ.
- CAIDA Internet Datasets
- ClueWeb09 - 1B web pages
- ClueWeb12 - 733M web pages
- CommonCrawl Web Data over 7 years
- CRAWDAD Wireless datasets from Dartmouth Univ.
- Criteo click-through data
- Open Mobile Data by MobiPerf
- UCSD Network Telescope, IPv4 /8 net
- Challenges in Machine Learning
- D4D Challenge of Orange
- DrivenData Competitions for Social Good
- ICWSM Data Challenge (since 2009)
- Kaggle Competition Data
- KDD Cup by Tencent 2012
- Localytics Data Visualization Challenge
- Netflix Prize
- Space Apps Challenge
- Telecom Italia Big Data Challenge
- Yelp Dataset Challenge
Energy
Finance
- CBOE Futures Exchange
- Google Finance
- Google Trends
- NASDAQ
- OANDA
- OSU Financial data
- Quandl
- St Louis Federal
- Yahoo Finance
- BODC - marine data of ~22K vars
- Cambridge, MA, US, GIS data on GitHub
- EOSDIS - NASA's earth observing system data
- Factual Global Location Data
- Geo Spatial Data from ASU
- GeoNames Worldwide
- Global Administrative Areas Database (GADM)
- Landsat 8 on AWS
- Natural Earth - vectors and rasters of the world
- Open Street Map (OSM)
- TIGER/Line - U.S. boundaries and roads
- TwoFishes - Foursquare's coarse geocoder
- TZ Timezones shapfiles
- World countries in multiple formats
- OpenAddresses
- Australia (abs.gov.au)
- Australia (data.gov.au)
- Brazil
- Cambridge, MA, US
- Canada
- Chicago
- Dallas Open Data
- Denver Open Data
- EuroStat
- FedStats
- Finland
- France
- Germany
- Glasgow, Scotland, UK
- Guardian world governments
- Indian Government Data
- London Datastore, UK
- MassGIS, Massachusetts, U.S.
- Netherlands
- New Zealand
- NYC betanyc
- NYC Open Data
- OECD
- Open Government Data (OGD) Platform India
- San Francisco Data sets
- Seattle
- South Africa
- The World Bank
- U.K. Government Data
- U.S. American Community Survey
- U.S. CDC Public Health datasets
- U.S. Census Bureau
- U.S. Department of Housing and Urban Development (HUD)
- U.S. Federal Government Agencies
- U.S. Federal Government Data Catalog
- U.S. Food and Drug Administration (FDA)
- U.S. Open Government
- UK 2011 Census Open Atlas Project
- United Nations
- EHDP Large Health Data Sets
- Gapminder World, demographic databases
- Medicare Coverage Database (MCD), U.S.
- Medicare Data Engine of medicare.gov Data
- Medicare Data File
- Number of Ebola Cases and Deaths in Affected Countries (2014)
- 10k US Adult Faces Database
- 2GB of Photos of Cats
- Affective Image Classification
- Face Recognition Benchmark
- ImageNet (in WordNet hierarchy)
- International Affective Picture System, UFL
- Massive Visual Memory Stimuli, MIT
- SUN database, MIT
- Delve Datasets for classification and regression (Univ. of Toronto)
- Discogs Monthly Data
- eBay Online Auctions (2012)
- IMDb Database
- Keel Repository for classification, regression and time series
- Lending Club Loan Data
- Machine Learning Data Set Repository
- Million Song Dataset
- More Song Datasets
- MovieLens Data Sets
- RDataMining - "R and Data Mining" ebook data
- Registered Meteorites on Earth
- Restaurants Health Score Data in San Francisco
- UCI Machine Learning Repository
- Yahoo! Ratings and Classification Data
- Cooper-Hewitt's Collection Database
- Minneapolis Institute of Arts metadata
- Tate Collection metadata
- The Getty vocabularies
- Blogger Corpus
- ClueWeb09 FACC
- ClueWeb12 FACC
- DBpedia - 4.58M things with 583M facts
- Flickr Personal Taxonomies
- Google Books Ngrams (2.2TB)
- Google Web 5gram (1TB, 2006)
- Gutenberg eBooks List
- Hansards text chunks of Canadian Parliament
- Machine Translation of European languages
- SMS Spam Collection in English
- USENET postings corpus of 2005~2011
- Wikidata - Wikipedia databases
- Wikipedia Links data - 40 Million Entities in Context
- WordNet databases and tools
Public Domains
- Amazon
- Archive.org Datasets
- CMU JASA data archive
- CMU StatLab collections
- Data360
- Datamob.org
- Infochimps
- KDNuggets Data Collections
- Numbray
- Reddit Datasets
- RevolutionAnalytics Collection
- Sample R data sets
- Stats4Stem R data sets
- StatSci.org
- The Washington Post List
- UCLA SOCR data collection
- UFO Reports
- Wikileaks 911 pager intercepts
- Yahoo Webscope
- Academic Torrents of data sharing from UMB
- Archive-it from Internet Archive
- Datahub.io
- DataMarket (Qlik)
- Freebase.com of people, places, and things
- Harvard Dataverse Network of scientific data
- ICPSR (UMICH)
- Open Data Certificates (beta)
- Statista.com - statistics and Studies
- Ancestry.com Forum Dataset over 10 years
- CMU Enron Email of 150 users
- Facebook Data Scrape (2005)
- Facebook Social Networks from LAW (since 2007)
- Foursquare Social Network in 2010, 2011
- Foursquare from UMN/Sarwat (2013)
- General Social Survey (GSS) since 1972
- GetGlue - users rating TV shows
- GitHub Collaboration Archive
- MIT Reality Mining Dataset
- Mobile Social Networks from UMASS
- PewResearch Internet Survey Project
- SourceForge.net Research Data
- StackExchange Data Explorer
- Titanic Survival Data Set
- Twitter Graph of entire Twitter site
- UCB's Archive of Social Science Data (D-Lab)
- UCLA Social Sciences Data Archive
- UNIMI/LAW Social Network Datasets
- Universities Worldwide
- UPJOHN for Labor Employment Research
- Yahoo! Graph and Social Data
- Youtube Video Social Graph in 2007,2008
- Google Scholar citation relations
- Political Polarity Data
- Betfair Historical Exchange Data
- Cricsheet Matches (baseball)
- Ergast Formula 1, from 1950 up to date (API)
- Football/Soccer resouces (data and APIs)
- Lahman's Baseball Database
- Retrosheet Baseball Statistics
Transportation
- Airlines OD Data 1987-2008
- Bike Share Systems (BSS) collection
- Bay Area Bike Share Data
- GeoLife GPS Trajectory from Microsoft Research
- Hubway Million Rides in MA
- Marine Traffic - ship tracks, port calls and more
- NYC Taxi Trip Data 2013 (FOIA/FOILed)
- OpenFlights - airport, airline and route data
- RITA Airline On-Time Performance data
- RITA/BTS transport data collection (TranStat)
- Transport for London (TFL)
- Travel Tracker Survey (TTS) for Chicago
- U.S. Bureau of Transportation Statistics (BTS)
- U.S. Domestic Flights 1990 to 2009
- U.S. Freight Analysis Framework since 2007
- DataWrangling: Some Datasets Available on the Web
- Inside-r: Finding Data on the Internet
- Quora: Where can I find large datasets open to the public?
- RS.io: 100+ Interesting Data Sets for Statistics
- StaTrek: Leveraging open data to understand urban lives
- OpenDataMonitor: An overview of available open data resources in Europe
copy pasta … TBprocessed:
• Big data sets available for free
BioMed:
- Statlib: http://lib.stat.cmu.edu/datasets/
- StatSci: http://www.statsci.org/datasets.html
- Klein-book: http://www.mcw.edu/biostatistics/Faculty/Faculty/JohnPKleinPhD/SurvivalAnalysisBook/DataSetsBothEditions.htm
- PhysioMed: http://physionet.caregroup.harvard.edu/physiobank/database/
- PhysioNet: http://www.physionet.org/challenge/
- GLIMs: http://www.sci.usq.edu.au/staff/dunn/Datasets/tech-glms.html
- NeuroVault: http://neurovault.org/
- OpenfMRI: https://openfmri.org/
Software:
- CVX: http://cvxr.com/cvx/
- Tfocs: http://tfocs.stanford.edu/
- Mosek: http://www.mosek.com/
- Shogun: http://www.shogun-toolbox.org/
- Weka: http://www.cs.waikato.ac.nz/ml/weka/
- Mahout: http://mahout.apache.org/
- Google SensorFlow: https://www.tensorflow.org/
- IBM DataWorks: http://www.ibm.com/analytics/us/en/watson-dataworks-project/
- MS: https://azure.microsoft.com/en-us/services/machine-learning/
ML Networks:
- NERF: http://www.nerf.be/
- Kurzweil: http://www.kurzweilai.net/
- Sciencemag: http://www.sciencemag.org/site/feature/data/compsci/machine_learning.xhtml
- PASCAL: http://www.pascal-network.org/
Blogs:
- Hunch: http://hunch.net/
- Nuit Blanche: http://nuit-blanche.blogspot.se/
- My Biased Coin: http://mybiasedcoin.blogspot.se/
- Mark Reid’s: http://mark.reid.name/
- InherentUncertainty: http://www.inherentuncertainty.org/