ClueWeb09 Dataset 25TB

Admin Posts : 14 Join date : 2015-10-27

This a collection of webpages (25 TBs worth of a collection), Divided into two components the actual pages in html (+ javascript, etc), and a web graph of unique URLs. While you could feasibly make a webscraper to collect this data, it contains over a billion webpages which is probably beyond the scope of anything we could do without purchasing a huge amazon web service instance. Unfortunately you have to apply to access (information in link), from the description I doubt they'd release it to anybody, so unless you're in academic setting probably the best way to get access to this data would be to talk to Cal, and see if he has any advice or would be willing to support a group of us in applying for the dataset. Alternatively under online services it appears there is an API to access the data and it is also available through some cloud services so that could be another possibility. This could be a very interesting dataset to work with, 25 TB is very large for the machines most of us probably have. So we'd probably have to use spark, doing a project on data of this scale would probably be very impressive and there are a lot of interesting questions to ask about the data.

http://lemurproject.org/clueweb09/