NLP Data Sets?

Posts : 2 Join date : 2015-10-28

Hey Theo & All,

I had a thought today that was interesting. Would you consider the PDFs from all Wikileaks cables an interesting data set? Consider for a minute that you could extract the text out of them all, I'm wondering if you could use them to do interesting natural language processing.

Do you think that could be interesting? I can't envision any practical use for it that I could state in a sentence, but to learn more about NLP it could be a good exercise. I guess it's a novelty thing, because you could do the same learning with a data set that was for instance "text from every book in 2015", but the wikileaks cables is just novel and fun sounding haha.

Admin Posts : 14 Join date : 2015-10-27

There are a bunch of text data sets you can use, I'll try and dig up some alternatives for you this weekend. The difficulty with NLP is trying to figure what you're trying to do. Build a better generative model? Get lower perplexity for a given corpus? etc. I think with NLP the goal is actually pretty open. Interesting areas that are being pushed forward are memory neural networks for question answering. You can also do sentiment analysis or community detection with twitter data. The difficulty with the wiki-leaks data is that I can't think of much more to do outside just run latent dirchelet allocation. At the same time if you could post a link or explain how to get "text of all books" that would be great.

» Data Science Textbooks
» 50 Years of Data Science
» Brain data form of MNIST