Hey Theo & All,
I had a thought today that was interesting. Would you consider the PDFs from all Wikileaks cables an interesting data set? Consider for a minute that you could extract the text out of them all, I'm wondering if you could use them to do interesting natural language processing.
Do you think that could be interesting? I can't envision any practical use for it that I could state in a sentence, but to learn more about NLP it could be a good exercise. I guess it's a novelty thing, because you could do the same learning with a data set that was for instance "text from every book in 2015", but the wikileaks cables is just novel and fun sounding haha.