Top Performers in Data Science

Forum for Top Performance course users in Data Science
HomeHome  UsergroupsUsergroups  RegisterRegister  Log in  

Share | 

 ClueWeb09 Dataset 25TB

Go down 

Posts : 14
Join date : 2015-10-27

PostSubject: ClueWeb09 Dataset 25TB   Fri Oct 30, 2015 11:34 pm

This a collection of webpages (25 TBs worth of a collection), Divided into two components the actual pages in html (+ javascript, etc), and a web graph of unique URLs. While you could feasibly make a webscraper to collect this data, it contains over a billion webpages which is probably beyond the scope of anything we could do without purchasing a huge amazon web service instance. Unfortunately you have to apply to access (information in link), from the description I doubt they'd release it to anybody, so unless you're in academic setting probably the best way to get access to this data would be to talk to Cal, and see if he has any advice or would be willing to support a group of us in applying for the dataset. Alternatively under online services it appears there is an API to access the data and it is also available through some cloud services so that could be another possibility. This could be a very interesting dataset to work with, 25 TB is very large for the machines most of us probably have. So we'd probably have to use spark, doing a project on data of this scale would probably be very impressive and there are a lot of interesting questions to ask about the data.
Back to top Go down
ClueWeb09 Dataset 25TB
Back to top 
Page 1 of 1

Permissions in this forum:You cannot reply to topics in this forum
Top Performers in Data Science :: Data Sets :: Large Data sets-
Jump to: