I have been able to contact Diccon Close at SIRCA, who is able to supply access to some much needed ASX Equities information, including full order book and tick information. Access to this would be critical to evaluate the real-world performance of sentiment indicators provided by the system. SIRCA also has distribution rights to Reuters corpora. I've been thinking that it may be a good idea to crawl parts of this Reuters corpus, to incorporate more recent information, even if only to obtain tf-idf scores for equities news. Also, in the original corpus, time stamps have been eradicated which make it difficult to see exact timing of information release vs. stock price.

I have made a diagram on how news articles should be classified - this may need to be modified as time goes on to more fine-grained document types:

