I have parsed each of the 60,000 abstracts through C&C and BOXER, and have obtained some promising results. For the moment, I am just using their parts of speech, entity recognition tags, and words/lemmas from the C&C xml output. I prepare each of the abstracts by combining the abstract information in the CSV file and the parsed xml, find the words lemma, some currency operations, sentiwordnet scores and then serialise the object. After that, I can quickly load the serialised objects and analyse them (10,000 takes approx 2 minutes for analysis). I then write the features and annotations to a file for SVM Multiclass to learn from. Using 10,000 annotated abstracts, I split the data into 20% test, 80% train. I will soon implement code so I can do a 10-fold cross-validation (splitting the data 10 times into 10%/90% parts). This will get me more rounded results.
Current results are:
per class precision: 0.577354048964 overall precision: 0.581120943953
per class recall: 0.558885065227 overall recall: 0.581120943953
per class fscore: 0.542291224417 overall fscore: 0.581120943953
So with an F-score of 54% there is still a lot of improvement needed to obtain more substantial results. More experimenting with the features is needed - at the moment I'm using the top 1000 unigram, bigram and trigram words, and finance, economic and accounting gazetteers for feature inclusion. More work is need on the contextual features around these important terms.