Wednesday, April 21, 2010

Preliminary Corpora Analysis

After annotating a few hundred articles, I've notice some unique problems inherent in the data set.
  • With such a large data set, it may be nearly impossible to annotate all 60,000 articles twice. Although I can do approximately 150-200 articles an hour, that still requires between 300 and 400 man hours (150ish days), for each annotator. Another issue is the other annotator may be slower and 'worry' about the classification of each article.
  • I find the classification 'ambiguous' to be too much overhead on both the annotator and machine learner (such as in one of the previously annotated corpus) - for this task, ambiguous should be equivalent to neutral, as an ambiguous article and a neutral article would have no effect on the sentiment towards a company. 
  • Much of the ambiguity came from a vague article title, and not the content of the paragraph. And if it was to be ambiguous in the paragraph, this may likely be when company X has rumours of being taken over by company Y, and company X, and the article is primarily about company X not Y (in other words, it has been classified as company Y and not X). To me, this is neutral. In the case of the article talking about something silly, like Alcoa acquiring famous paintings for its head office, I think this should either be removed or labelled neutral.
  • With the large data set in mind, I have decided to split the articles into sectors, based on their GICS (Global Industry Classification Standard) Sector labels (provided and adhered to by the ASX). We want to look at the 10 GICS Sectors across 24 industrial groups:
    1. GICS Australian Real Estate Investment Trusts (A-REITs)
    2. GICS Consumer Discretionary
      • Consumer Services
      • Automobile & Components
      • Media
      • Retailing
      • Consumer Durables & Apparel
    3. GICS Consumer Staples
      • Food Beverage & Tobacco
      • Food & Staples Retailing
      • Household & Personal Products
    4. GICS Energy
    5. GICS Financials
      • Diversified Financials
      • Banks
      • Insurance
    6. GICS Health Care
      • Health Care Equipment & Services
      • Pharmaceuticals, Biotechnology & Life Sciences
    7. GICS Industrials
      • Transportation
      • Capital Goods
    8. GICS Information Technology
      • Semiconductors & Semiconductor Equipment
      • Software & Services
      • Technology Hardware & Equipment
    9. GICS Materials
      • Materials
      • Metal & Mining
    10. GICS Telecommunication Services
      • Utilities
      • Telecommunication Services
  • The reason for this sector list is that I find the language and key words used in the news articles in Materials would be different to that in Information Technology. For example, Materials would be talking about the drive of gold prices or alumina pushing up the price of stock X, where as IT stocks would be affected by the flow of work offshore to Bangladesh. At the same time, Consumer Discretionary involves some widely varied topics, automobiles, media, retailing etc, and Financials include both banks and insurance. I propose that each sector and even industry grouping would invariably have large differences in the domain language. 
  • We should use the fact we already know what sector a company is in (just by a reference to a list - ie. not a classification task), and put higher emphasis on terms and orthographic queues we've seen in this type of article previously. Of course, there is some terms shared across all articles, such as 'profit', 'acquisition', 'merger' etc. And all of the sectors would be affected by 'goverment policy' changes.
  • This list is perhaps too comprehensive in parts, and in others it is too specific. For instance, materials contains many metal and mining stocks (and not necessarily similar), and from a brief inspection, I would expect quite an overlap between Telecommunications and IT. Thus, some testing in the design phase would be beneficial.
  • The corpora does contain all 24 industry groups, so at this stage it might be appropriate to get an overall view of these highly granular sectors, before we convert them down to 10+ sectors, for sentiment analysis.
  • So, if we get a good overall selection articles based on sectors, I believe we can cut the size of the corpora to be annotated by at least half to 30,000
Also, I found that over 450 companies have been either delisted or changed names since the time the data was gathered, and this poses an largely manual task for their industry classification. I was able to automatically query AspectHuntley for 245, but this still leaves about 200 unclassified companies! :(

    1 comment: