Monday, August 30, 2010

Current Results & Presentation

I did a presentation last Friday on my current progress in the thesis, which can be found below.
Since then, I have managed to get Naive Bayes running in NLTK and it performs particularly well, almost in-line with SVM and above Maximum Entropy. I have been able to combine the three classifiers in a max-votes strategy, with the results:
weighted precision 66.74% modified precision 78.12%
weighted recall 67.29% modified recall 75.86%
weighted fscore 67.01% modified fscore 76.97%
Its important to remember that these (finalish) results are based on single annotations.
As I'm currently getting more annotations, I will update these with hopefully better performance
(but on a smaller subset of articles).

Monday, August 9, 2010

Progress over the holidays

I have been hard at work and finally got some machine-learning results from the corpus data.
I have parsed each of the 60,000 abstracts through C&C and BOXER, and have obtained some promising results. For the moment, I am just using their parts of speech, entity recognition tags, and words/lemmas from the C&C xml output. I prepare each of the abstracts by combining the abstract information in the CSV file and the parsed xml, find the words lemma, some currency operations, sentiwordnet scores and then serialise the object. After that, I can quickly load the serialised objects and analyse them (10,000 takes approx 2 minutes for analysis). I then write the features and annotations to a file for SVM Multiclass to learn from. Using 10,000 annotated abstracts, I split the data into 20% test, 80% train. I will soon implement code so I can do a 10-fold cross-validation (splitting the data 10 times into 10%/90% parts). This will get me more rounded results.

Current results are:
per class precision: 0.577354048964 overall precision: 0.581120943953
per class recall: 0.558885065227 overall recall: 0.581120943953
per class fscore: 0.542291224417 overall fscore: 0.581120943953

Confusion matrix:


So with an F-score of 54% there is still a lot of improvement needed to obtain more substantial results. More experimenting with the features is needed - at the moment I'm using the top 1000 unigram, bigram and trigram words, and finance, economic and accounting gazetteers for feature inclusion. More work is need on the contextual features around these important terms.

Friday, June 4, 2010

Progress Report

I have completed a progress report on the treatise, which includes approx. 20 pages of introductory and background material, as well as a 'work carried out thus far' report. I have decided to look into combinatory categorial grammar as a way to disambiguate the deep semantic features of the text, and use these in my sentiment analysis. This is with the aid of James Curran's C&C Parser and Bos' Boxer. Looking forward, I plan to start coding the system and experimentation the first week of the holidays (28th June +). The system will be written in Python due to the large support it has for natural language parsing.

Friday, May 21, 2010

Sentence Examples

I believe the language model of finance to be inherently different to other domains (which is rather justified, as it is a unique language domain - with its own vocabulary etc). I propose a language model based on a combination of several factors, including entities (companies, CEOs, management), financial terms (net income, profit, price etc), industry specific terms (technology, resources - iron ore, aluminium etc), quantitative values ($ million, $ billion, per cent), directions (positive - rise, increase, outperform and negative - decrease, decline, fall) and general sentiment of regular english words (SentiWordNet - good, bad, successful, poor). Combinations of these are expected to have a significant impact on the polarity of the news abstracts (which are annotated). The reason why direction and values are included specifically is the sentiment of these abstracts is largely based on the amount (larger increases % wise, bumper profits of $100 milllion). Of course, as is the case in any natural language, the general rule always comes with exceptions. In addition, the subject matter of each abstract is typically on one subject matter (takeover, reports, government, resource etc).

I have made a couple of sentence examples, of desired system output, which sum up the problem:

Wednesday, May 19, 2010


I have been able to contact Diccon Close at SIRCA, who is able to supply access to some much needed ASX Equities information, including full order book and tick information. Access to this would be critical to evaluate the real-world performance of sentiment indicators provided by the system. SIRCA also has distribution rights to Reuters corpora. I've been thinking that it may be a good idea to crawl parts of this Reuters corpus, to incorporate more recent information, even if only to obtain tf-idf scores for equities news. Also, in the original corpus, time stamps have been eradicated which make it difficult to see exact timing of information release vs. stock price.

I have made a diagram on how news articles should be classified - this may need to be modified as time goes on to more fine-grained document types:

Tuesday, May 18, 2010

Biomedical Named Entity Recognition

This was an SNLP assignment which is relavent to my thesis: an in-depth look at biomedical named entity recognition. Although the task is complicated and not directly applicable to financial news, information about POS tagging, entity recognition and machine learners will definitely be useful.

Biomedical Named Entity Recognition

Wednesday, April 21, 2010

Preliminary Corpora Analysis

After annotating a few hundred articles, I've notice some unique problems inherent in the data set.
  • With such a large data set, it may be nearly impossible to annotate all 60,000 articles twice. Although I can do approximately 150-200 articles an hour, that still requires between 300 and 400 man hours (150ish days), for each annotator. Another issue is the other annotator may be slower and 'worry' about the classification of each article.
  • I find the classification 'ambiguous' to be too much overhead on both the annotator and machine learner (such as in one of the previously annotated corpus) - for this task, ambiguous should be equivalent to neutral, as an ambiguous article and a neutral article would have no effect on the sentiment towards a company. 
  • Much of the ambiguity came from a vague article title, and not the content of the paragraph. And if it was to be ambiguous in the paragraph, this may likely be when company X has rumours of being taken over by company Y, and company X, and the article is primarily about company X not Y (in other words, it has been classified as company Y and not X). To me, this is neutral. In the case of the article talking about something silly, like Alcoa acquiring famous paintings for its head office, I think this should either be removed or labelled neutral.
  • With the large data set in mind, I have decided to split the articles into sectors, based on their GICS (Global Industry Classification Standard) Sector labels (provided and adhered to by the ASX). We want to look at the 10 GICS Sectors across 24 industrial groups:
    1. GICS Australian Real Estate Investment Trusts (A-REITs)
    2. GICS Consumer Discretionary
      • Consumer Services
      • Automobile & Components
      • Media
      • Retailing
      • Consumer Durables & Apparel
    3. GICS Consumer Staples
      • Food Beverage & Tobacco
      • Food & Staples Retailing
      • Household & Personal Products
    4. GICS Energy
    5. GICS Financials
      • Diversified Financials
      • Banks
      • Insurance
    6. GICS Health Care
      • Health Care Equipment & Services
      • Pharmaceuticals, Biotechnology & Life Sciences
    7. GICS Industrials
      • Transportation
      • Capital Goods
    8. GICS Information Technology
      • Semiconductors & Semiconductor Equipment
      • Software & Services
      • Technology Hardware & Equipment
    9. GICS Materials
      • Materials
      • Metal & Mining
    10. GICS Telecommunication Services
      • Utilities
      • Telecommunication Services
  • The reason for this sector list is that I find the language and key words used in the news articles in Materials would be different to that in Information Technology. For example, Materials would be talking about the drive of gold prices or alumina pushing up the price of stock X, where as IT stocks would be affected by the flow of work offshore to Bangladesh. At the same time, Consumer Discretionary involves some widely varied topics, automobiles, media, retailing etc, and Financials include both banks and insurance. I propose that each sector and even industry grouping would invariably have large differences in the domain language. 
  • We should use the fact we already know what sector a company is in (just by a reference to a list - ie. not a classification task), and put higher emphasis on terms and orthographic queues we've seen in this type of article previously. Of course, there is some terms shared across all articles, such as 'profit', 'acquisition', 'merger' etc. And all of the sectors would be affected by 'goverment policy' changes.
  • This list is perhaps too comprehensive in parts, and in others it is too specific. For instance, materials contains many metal and mining stocks (and not necessarily similar), and from a brief inspection, I would expect quite an overlap between Telecommunications and IT. Thus, some testing in the design phase would be beneficial.
  • The corpora does contain all 24 industry groups, so at this stage it might be appropriate to get an overall view of these highly granular sectors, before we convert them down to 10+ sectors, for sentiment analysis.
  • So, if we get a good overall selection articles based on sectors, I believe we can cut the size of the corpora to be annotated by at least half to 30,000
Also, I found that over 450 companies have been either delisted or changed names since the time the data was gathered, and this poses an largely manual task for their industry classification. I was able to automatically query AspectHuntley for 245, but this still leaves about 200 unclassified companies! :(

    Friday, April 16, 2010

    Document Classification & Thesis Proposal

    A scientific research article I wrote on Document Classification, evaluating the performance of different approaches to parsing, thresholding and machine learning classifiers.

    Also, here is my Thesis Proposal including background, method of attack and plan for the year.

    Wednesday, March 17, 2010

    Tuesday, December 8, 2009

    Meeting Minutes

    Meeting with Rafael Calvo, Robert Dale
    • Academic Paper: Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews, P.D Turney, July 2002
    • Much work has been done in the past on sentiment in product/movie reviews, but these may not transfer to the domain of finance, and some finance domain knowledge is necessary in analysis.
    • AFR Titles may tend to be more extreme (sensationalized) in relation to their actual content
    • Necessary to find sources of positive/negative related terms, and how well they correlate with the data set/annotations and vise versa.
    • May not be able to be done on word frequency count alone (taking out company name, terms such as the, a etc).
    • Could use synonym trees to help with the large term set (lexical database such as WordNet)
    • Found information on some research done in this area by Sirca
    • Interesting abstract and presentation given by C Robertson: "Enabling Sophisticated Financial Text Mining"