Tag Archives: debates

points of view

Dataset Resources

pride

download webmasters   etext   marvin   here

Blog articles which provide dataset directories

http://www.datawrangling.com/some-datasets-available-on-the-web.html
http://www.daniel-lemire.com/blog/data-for-data-mining/ – has blog, tag cloud, wiki dataset categories
http://www.kirix.com/blog/category/data-tagssearch/
http://mobblog.cs.ucl.ac.uk/datasets/
http://www.readwriteweb.com/archives/where_to_find_open_data_on_the.php – Article containing a list of available dataset websites
http://rs.io/2014/05/29/list-of-data-sets.html – Article describing 100+ datasets

Dataset directories

http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public – Public datasets listed on a Quora Q&A thread.
http://caw2.barcelonamedia.org/node/7 – Content Analysis for the Web 2.0 (CAW 2.0) Workshop – part of 18th International Conference of the World Wide Web. Contains training and test datasets from Twitter, MySpace, Slashdot, Ciao and Kongregate.
http://kdd.ics.uci.edu/ – has a machine learning repository
http://archive.ics.uci.edu/ml/datasets.html http://ckan.net/ – listing of links to various datasets
http://www.ldc.upenn.edu/Obtaining/ – Linguistic data consortium catalog
http://www.swivel.com/data_sets
http://datamob.org/datasets
http://infochimps.org/
http://www.freebase.com/
http://numbrary.com/
http://theinfo.org/
http://www.trustlet.org/wiki/Repositories_of_datasets
http://del.icio.us/kirixstrata/publicdata
http://services.alphaworks.ibm.com/manyeyes/browse/data?q=null
http://googleresearch.blogspot.com/ – google research has stated that http://research.google.com will soon host open-source scientific datasets – http://blog.wired.com/wiredscience/2008/01/google-to-provi.html – watch this space.
http://data.un.org/
http://www.data360.org/index.aspx
http://tunedit.org/search?q=arff – 800 datasets in ARFF format for different problems and application domains
http://wikiposit.com
http://gsociology.icaap.org/dataupload.html – The Global Social Change Research Project – social, political and economic datasets

Data sets for a specific field

http://kaggle.com/ – machine learning competitions with data provided by organisations with prize money
http://theinfo.org/get/data – good list here – pay attention to web/news/blogs and Text/Language categories as well as trust network data
http://research.microsoft.com/nlp/ – look under data sets
http://nlp.stanford.edu/links/statnlp.html – look under corpora
http://trec.nist.gov/data/reuters/reuters.html – Reuters Corpora – contains large collection of news stories for use in Natural Language Processing, Information Retrieval and Machine Learning Systems (need to order CDs)

http://trec.nist.gov/data.html – Text retrieval. Has spam, web, question answering, blog and ad hoc (e.g. relevance judgement) tracks
http://plg.uwaterloo.ca/~gvcormac/treccorpus/ (300MB) – Spam Corpus 2005
http://plg.uwaterloo.ca/~gvcormac/treccorpus06/ (75MB – english, 60MB chinese) – Spam Corpus 2006
http://trec.nist.gov/data/reljudge_eng.html – Relevance Judgement
http://ir.dcs.gla.ac.uk/test_collections/blog06info.html (25GB – costs 400 GBP) – Blog 06 data
http://trec.nist.gov/data/qamain.html – Question Answering (many tracks)
http://trec.nist.gov/data/novelty.html – Novelty (some relevance) –

http://infochimps.org/tag/language/datasets – languages
http://infochimps.org/tag/lexicon/datasets – lexicon
http://infochimps.org/tag/lexical/datasets – lexical

http://wordnet.princeton.edu/ – Lexical database that is handy for computational linguistics and natural language processing
http://www.dmoz.org/Computers/Artificial_Intelligence/Machine_Learning/Datasets/ – Machine learning datasets
http://cervisia.org/machine_learning_data.php – Machine learning datasets – benchmark data for comparing different algorithms of your classifier is recommended from http://www.ci.tuwien.ac.at/~meyer/benchdata/
http://mill.ucsd.edu/index.php?page=Datasets&subpage=Overview
http://www.trustlet.org/wiki/Trust_network_datasets#Released_datasets – Trust datasets – includes Epinions
http://stuff.metafilter.com/infodump/ – Metafilter – contains posts, comments, tags, favourites, contact and user data
http://an.kaist.ac.kr/traces/IMC2007.html – YouTube dataset
http://socialnetworks.mpi-sws.mpg.de/ – social network dataset
http://people.csail.mit.edu/jrennie/20Newsgroups/ – newsgroup dataset
http://www.yr-bcn.es/webspam/datasets/ – Webspam datasets

Link Analysis / Social Networks

http://www.cs.toronto.edu/~tsap/experiments/datasets/index.html
http://www.cs.toronto.edu/~tsap/experiments/download/download.html
http://strict.dista.uninsubria.it/?p=364 – Twitter dataset – friends network for 2009 and 2013

Natural Language Processing

http://www.certifiedchinesetranslation.com/openaccess/WordNet/ – Multilingual WordNet List containing many languages

Recommender systems

http://www.grouplens.org/ – MovieLens
http://www.ieor.berkeley.edu/~goldberg/jester-data/ – Jester
http://www.netflixprize.com/ – Netflix
http://www.informatik.uni-freiburg.de/~cziegler/BX/ – Book Crossing

Forums

http://weimo.de/node/642 – Nabble.com + user ratings of posts

Blogs

http://ebiquity.umbc.edu/resource/html/id/212/Splog-Blog-Dataset – Spam blogs (splogs)
http://www.icwsm.org/data.html – 14 million posts, 3 million weblogs – apparently no longer available since Dec 8, 2006
http://ir.dcs.gla.ac.uk/test_collections/blog06info.html – but costs 400 GBP!

Wikis

http://labs.systemone.at/wikipedia3 – wikipedia 3 providing wikipedia datasets
http://download.wikipedia.org/ – official wikipedia database dumps (very large)
http://download.freebase.com/wex/ – English wikipedia articles that have been transformed into XML – all files ~ 55GB
http://dbpedia.org/About – structured information from wikipedia – dataset of this is available

Webpages

http://www.archive.org/web/web.php – 85 billion webpages archived since 1996

Misc

http://opentick.com/ – Stock data
http://lib.stat.cmu.edu/datasets/ – miscellaneous datasets
http://lib.stat.cmu.edu/jasadata/ – datasets from Journal of the American Statistical Association
http://musicbrainz.org/ – music dataset
http://www.jigsaw.com/ – directory of company & business professional dataset
http://www.librarything.com/ – library catalogue
http://www.imeem.com/developers – media library
http://www.scribd.com/doc/9582/integrating-wikipediawordnet – article talking about integrating Wordnet and Wikipedia with YAGO (an extensible and light-weight ontology)
http://wiki.openstreetmap.org/index.php/Potential_Datasources – country maps
http://rdf.dmoz.org/ – open directory project dataset
http://personality-testing.info/_rawdata/ – online personality tests data