tag:blogger.com,1999:blog-70805353921070329662024-02-19T13:43:33.522+11:00Sentiment Analysis for Financial ApplicationsAndrew Maynehttp://www.blogger.com/profile/12657173474301903710noreply@blogger.comBlogger10125tag:blogger.com,1999:blog-7080535392107032966.post-57838132465037136792010-08-30T14:23:00.003+10:002010-08-30T14:28:44.210+10:00Current Results & PresentationI did a presentation last Friday on my current progress in the thesis, which can be found below.<div>Since then, I have managed to get Naive Bayes running in NLTK and it performs particularly well, almost in-line with SVM and above Maximum Entropy. I have been able to combine the three classifiers in a max-votes strategy, with the results:</div><div><div></div><blockquote><div>weighted precision<span class="Apple-tab-span" style="white-space:pre"> </span>66.74%<span class="Apple-tab-span" style="white-space:pre"> </span>modified precision<span class="Apple-tab-span" style="white-space:pre"> </span>78.12%<span class="Apple-tab-span" style="white-space:pre"> </span></div><div>weighted recall<span class="Apple-tab-span" style="white-space:pre"> </span>67.29%<span class="Apple-tab-span" style="white-space:pre"> </span>modified recall<span class="Apple-tab-span" style="white-space:pre"> </span>75.86%<span class="Apple-tab-span" style="white-space:pre"> </span></div><div>weighted fscore<span class="Apple-tab-span" style="white-space:pre"> </span>67.01%<span class="Apple-tab-span" style="white-space:pre"> </span>modified fscore<span class="Apple-tab-span" style="white-space:pre"> </span>76.97%<span class="Apple-tab-span" style="white-space:pre"> </span></div><div></div></blockquote><div><span class="Apple-style-span" style="white-space: pre;">Its important to remember that these (finalish) results are based on single annotations. </span></div><div><span class="Apple-style-span" style="white-space: pre;">As I'm currently getting more annotations, I will update these with hopefully better performance </span></div><div><span class="Apple-style-span" style="white-space: pre;">(but on a smaller subset of articles).</span></div><div><br /></div><br /><iframe src="https://docs.google.com/present/embed?id=dcj5zb2j_094x8fcz" frameborder="0" width="410" height="342"></iframe></div>Andrewhttp://www.blogger.com/profile/15934717318250669061noreply@blogger.com2tag:blogger.com,1999:blog-7080535392107032966.post-73585654980815275362010-08-09T15:12:00.003+10:002010-08-09T15:29:27.853+10:00Progress over the holidaysI have been hard at work and finally got some machine-learning results from the corpus data.<div>I have parsed each of the 60,000 abstracts through C&C and BOXER, and have obtained some promising results. For the moment, I am just using their parts of speech, entity recognition tags, and words/lemmas from the C&C xml output. I prepare each of the abstracts by combining the abstract information in the CSV file and the parsed xml, find the words lemma, some currency operations, sentiwordnet scores and then serialise the object. After that, I can quickly load the serialised objects and analyse them (10,000 takes approx 2 minutes for analysis). I then write the features and annotations to a file for SVM Multiclass to learn from. Using 10,000 annotated abstracts, I split the data into 20% test, 80% train. I will soon implement code so I can do a 10-fold cross-validation (splitting the data 10 times into 10%/90% parts). This will get me more rounded results.</div><div><div><br /></div><div>Current results are:</div><div>per class precision: 0.577354048964<span class="Apple-tab-span" style="white-space:pre"> </span>overall precision: 0.581120943953</div><div>per class recall: 0.558885065227<span class="Apple-tab-span" style="white-space:pre"> </span>overall recall: 0.581120943953</div><div>per class fscore: 0.542291224417<span class="Apple-tab-span" style="white-space:pre"> </span>overall fscore: 0.581120943953</div><div><br /></div>Confusion matrix:<br /><br /><table border="0"><tbody><tr><td>annotation-></td><td>pos</td><td>neg</td><td>neutral</td></tr><tr><td>pos</td><td>483</td><td>90</td><td>71</td></tr><tr><td>neg</td><td>48</td><td>164</td><td>28</td></tr><tr><td>neutral</td><td>210</td><td>121</td><td>141</td></tr></tbody></table><br /></div><div>So with an F-score of 54% there is still a lot of improvement needed to obtain more substantial results. More experimenting with the features is needed - at the moment I'm using the top 1000 unigram, bigram and trigram words, and finance, economic and accounting gazetteers for feature inclusion. More work is need on the contextual features around these important terms. </div>Andrewhttp://www.blogger.com/profile/15934717318250669061noreply@blogger.com2tag:blogger.com,1999:blog-7080535392107032966.post-9046764490754404262010-06-04T15:02:00.000+10:002010-06-04T15:02:28.557+10:00Progress ReportI have completed a progress report on the treatise, which includes approx. <a href="http://docs.google.com/fileview?id=0B2LZ9iqIP4Z-OGM5MjVlNmEtNTkwMi00ZmI4LTk5Y2EtMjE1YzdhMmZiMTA3&hl=en">20 pages of introductory and background material,</a> as well as a 'work carried out thus far' <a href="http://docs.google.com/Doc?docid=0AWLZ9iqIP4Z-ZGRjaDVwYndfMGdtanZueGRo&hl=en">report</a>. I have decided to look into combinatory categorial grammar as a way to disambiguate the deep semantic features of the text, and use these in my sentiment analysis. This is with the aid of James Curran's C&C Parser and Bos' Boxer. Looking forward, I plan to start coding the system and experimentation the first week of the holidays (28th June +). The system will be written in Python due to the large support it has for natural language parsing.Andrew Maynehttp://www.blogger.com/profile/12657173474301903710noreply@blogger.com0tag:blogger.com,1999:blog-7080535392107032966.post-31540307700637933232010-05-21T12:15:00.000+10:002010-05-21T12:15:46.643+10:00Sentence ExamplesI believe the language model of finance to be inherently different to other domains (which is rather justified, as it is a unique language domain - with its own vocabulary etc). I propose a language model based on a combination of several factors, including entities (companies, CEOs, management), financial terms (net income, profit, price etc), industry specific terms (technology, resources - iron ore, aluminium etc), quantitative values ($ million, $ billion, per cent), directions (positive - rise, increase, outperform and negative - decrease, decline, fall) and general sentiment of regular english words (SentiWordNet - good, bad, successful, poor). Combinations of these are expected to have a significant impact on the polarity of the news abstracts (which are annotated). The reason why direction and values are included specifically is the sentiment of these abstracts is largely based on the amount (larger increases % wise, bumper profits of $100 milllion). Of course, as is the case in any natural language, the general rule always comes with exceptions. In addition, the subject matter of each abstract is typically on one subject matter (takeover, reports, government, resource etc).<br />
<br />
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">I have made a couple of sentence examples, of desired system output, which sum up the problem:</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8rLieLFhC8SsPoJG2RNwLc9d2uYEBU2jVRp_bw40jfnm2Li9_b7bXPuPxYx6p7-myw_MIIBbUKVcG8GLue7kuaUlh01f3hW_9gzPqN-cvEEbEFZAtU9sZfWNZhnwn3W1Ct_FoaAyIMTpe/s1600/Sentence+Example+1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="216" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8rLieLFhC8SsPoJG2RNwLc9d2uYEBU2jVRp_bw40jfnm2Li9_b7bXPuPxYx6p7-myw_MIIBbUKVcG8GLue7kuaUlh01f3hW_9gzPqN-cvEEbEFZAtU9sZfWNZhnwn3W1Ct_FoaAyIMTpe/s640/Sentence+Example+1.jpg" width="640" /></a></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br />
</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgbOX-qYjTcAes0PzqJ1lPYZR7AOrlWj5M19AJ10bu6zVkUeu0dvQAgKXGN-viZRG9iRChfjBvmcCZGVeSyVmtBtdwwiTObcScteBs1271PTczMzjO5nH2GzWhKaFYyoYn7EpdWnsb8dRL/s1600/Sentence+Example+2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgbOX-qYjTcAes0PzqJ1lPYZR7AOrlWj5M19AJ10bu6zVkUeu0dvQAgKXGN-viZRG9iRChfjBvmcCZGVeSyVmtBtdwwiTObcScteBs1271PTczMzjO5nH2GzWhKaFYyoYn7EpdWnsb8dRL/s640/Sentence+Example+2.jpg" width="640" /></a></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br />
</div>Andrew Maynehttp://www.blogger.com/profile/12657173474301903710noreply@blogger.com1tag:blogger.com,1999:blog-7080535392107032966.post-86584447797273951592010-05-19T14:02:00.002+10:002010-05-19T14:05:14.554+10:00SIRCAI have been able to contact Diccon Close at SIRCA, who is able to supply access to some much needed ASX Equities information, including full order book and tick information. Access to this would be critical to evaluate the real-world performance of sentiment indicators provided by the system. SIRCA also has distribution rights to Reuters corpora. I've been thinking that it may be a good idea to crawl parts of this Reuters corpus, to incorporate more recent information, even if only to obtain tf-idf scores for equities news. Also, in the original corpus, time stamps have been eradicated which make it difficult to see exact timing of information release vs. stock price.<br />
<div><br />
</div><div>I have made a diagram on how news articles should be classified - this may need to be modified as time goes on to more fine-grained document types:<br />
<div class="separator" style="clear: both; text-align: center;"><br />
</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJ3B0b7dWiONK_rafTbXqtoC01860d-BPDlpgu62yoomLjAHP9B3rclJwwMA-r-kMdhcxmbeGhbsW4Spu59RRwH0Gt8rnQrDS-quX2BFKMjkNS4kauX53ManOWULru7es0z6sDqjELKqiG/s1600/News+types.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="312" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJ3B0b7dWiONK_rafTbXqtoC01860d-BPDlpgu62yoomLjAHP9B3rclJwwMA-r-kMdhcxmbeGhbsW4Spu59RRwH0Gt8rnQrDS-quX2BFKMjkNS4kauX53ManOWULru7es0z6sDqjELKqiG/s640/News+types.jpg" width="640" /></a></div><br />
<br />
<br />
</div>Andrew Maynehttp://www.blogger.com/profile/12657173474301903710noreply@blogger.com0tag:blogger.com,1999:blog-7080535392107032966.post-22467618394777677982010-05-18T00:01:00.000+10:002010-05-18T00:01:30.323+10:00Biomedical Named Entity RecognitionThis was an SNLP assignment which is relavent to my thesis: an in-depth look at biomedical named entity recognition. Although the task is complicated and not directly applicable to financial news, information about POS tagging, entity recognition and machine learners will definitely be useful.<br />
<br />
<a href="https://docs.google.com/fileview?id=0B2LZ9iqIP4Z-OWVkNjAxYWItNjQ1Yi00MWFhLTk5ODEtN2RmZWRhZDU0ODE0&hl=en">Biomedical Named Entity Recognition</a>Andrew Maynehttp://www.blogger.com/profile/12657173474301903710noreply@blogger.com0tag:blogger.com,1999:blog-7080535392107032966.post-34183209138245475592010-04-21T16:17:00.003+10:002010-04-21T16:56:21.674+10:00Preliminary Corpora AnalysisAfter annotating a few hundred articles, I've notice some unique problems inherent in the data set. <br />
<ul><li>With such a <b>large</b> data set, it may be nearly impossible to annotate all 60,000 articles <b>twice</b>. Although I can do approximately 150-200 articles an hour, that still requires between 300 and 400 man hours (150ish days), for each annotator. Another issue is the other annotator may be slower and 'worry' about the classification of each article.</li>
<li>I find the classification <i>'ambiguous'</i> to be too much overhead on both the annotator and machine learner (such as in one of the previously annotated corpus) - for this task, ambiguous should be equivalent to neutral, as an ambiguous article and a neutral article would have no effect on the sentiment towards a company. </li>
<li>Much of the ambiguity came from a vague article title, and not the content of the paragraph. And if it was to be ambiguous in the paragraph, this may likely be when company X has rumours of being taken over by company Y, and company X, and the article is primarily about company X not Y (in other words, it has been classified as company Y and not X). To me, this is neutral. In the case of the article talking about something silly, like Alcoa acquiring famous paintings for its head office, I think this should either be removed or labelled neutral.</li>
<li>With the large data set in mind, I have decided to split the articles into sectors, based on their GICS (Global Industry Classification Standard) Sector labels (provided and adhered to by the ASX). We want to look at the <b>10 GICS Sectors </b>across 24 industrial groups:</li>
<ol><li>GICS Australian Real Estate Investment Trusts (A-REITs)</li>
<li>GICS Consumer Discretionary</li>
<ul><li>Consumer Services</li>
<li>Automobile & Components</li>
<li>Media</li>
<li>Retailing</li>
<li>Consumer Durables & Apparel</li>
</ul>
<li>GICS Consumer Staples</li>
<ul><li>Food Beverage & Tobacco</li>
<li>Food & Staples Retailing</li>
<li>Household & Personal Products</li>
</ul>
<li>GICS Energy</li>
<li>GICS Financials </li>
<ul><li>Diversified Financials</li>
<li>Banks</li>
<li>Insurance</li>
</ul>
<li>GICS Health Care</li>
<ul><li>Health Care Equipment & Services</li>
<li>Pharmaceuticals, Biotechnology & Life Sciences</li>
</ul>
<li>GICS Industrials</li>
<ul><li>Transportation</li>
<li>Capital Goods</li>
</ul>
<li>GICS Information Technology</li>
<ul><li>Semiconductors & Semiconductor Equipment</li>
<li>Software & Services</li>
<li>Technology Hardware & Equipment</li>
</ul>
<li>GICS Materials</li>
<ul><li>Materials</li>
<li>Metal & Mining</li>
</ul>
<li>GICS Telecommunication Services</li>
<ul><li>Utilities</li>
<li>Telecommunication Services</li>
</ul></ol>
<li>The reason for this sector list is that I find the language and key words used in the news articles in Materials would be different to that in Information Technology. For example, Materials would be talking about the drive of gold prices or alumina pushing up the price of stock X, where as IT stocks would be affected by the flow of work offshore to Bangladesh. At the same time, Consumer Discretionary involves some widely varied topics, automobiles, media, retailing etc, and Financials include both banks and insurance. I propose that each sector and even industry grouping would invariably have large differences in the domain language. </li>
<li>We should use the fact we already <b>know</b> what sector a company is in (just by a reference to a list - ie. not a classification task), and put higher emphasis on terms and orthographic queues we've seen in this type of article previously. Of course, there is some terms shared across all articles, such as <i>'profit', 'acquisition', 'merger'</i> etc. And <b>all</b> of the sectors would be affected by <i>'goverment policy'</i> changes.</li>
<li>This list is perhaps <b>too comprehensive</b> in parts, and in others it is too specific. For instance, materials contains many metal and mining stocks (and not necessarily similar), and from a brief inspection, I would expect quite an overlap between Telecommunications and IT. Thus, some testing in the design phase would be beneficial.</li>
<li>The corpora does contain all 24 industry groups, so at this stage it might be appropriate to get an overall view of these <b>highly granular</b> sectors, before we convert them down to 10+ sectors, for sentiment analysis.</li>
<li>So, if we get a good overall selection articles based on sectors, I believe we can cut the size of the corpora to be annotated by at least half to <b>30,000</b>. </li>
</ul>Also, I found that over 450 companies have been either delisted or changed names since the time the data was gathered, and this poses an largely manual task for their industry classification. I was able to automatically query AspectHuntley for 245, but this still leaves about 200 unclassified companies! :( <br />
<ul></ul>Andrew Maynehttp://www.blogger.com/profile/12657173474301903710noreply@blogger.com1tag:blogger.com,1999:blog-7080535392107032966.post-17083497596979263512010-04-16T11:40:00.002+10:002010-04-21T16:56:39.435+10:00Document Classification & Thesis ProposalA scientific research article I wrote on <a href="https://docs.google.com/fileview?id=0B2LZ9iqIP4Z-ZTdjNGFkZTgtM2YzMS00YjI4LWFhMjItYzFmNjBkMGI2ZWFm&hl=en">Document Classification</a>, evaluating the performance of different approaches to parsing, thresholding and machine learning classifiers.<br />
<br />
Also, here is my <a href="https://docs.google.com/fileview?id=0B2LZ9iqIP4Z-NDkxOWNiMTYtZDdmZS00MGEwLTk0ZGItYWVmZTgxMDg1ZWZm&hl=en">Thesis Proposal</a> including background, method of attack and plan for the year.Andrew Maynehttp://www.blogger.com/profile/12657173474301903710noreply@blogger.com2tag:blogger.com,1999:blog-7080535392107032966.post-55856802754520941692010-03-17T17:01:00.001+11:002010-04-21T16:56:52.998+10:00ReadingsI have found some very useful readings:<br />
<ul><li><a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1294472">Individual Investor Sentiment and Stock Returns, Rod Kaniel, Gideon Saar, and Sheridan Titman 2004</a></li>
<li><a href="http://www.mitpressjournals.org/doi/abs/10.1162/coli.08-012-R1-06-90">Recognizing Contextual Polarity: An Exploration of Features for Phrase-Level Sentiment Analysis, T Wilson, J Wiebe, P Hoffman 2008</a></li>
<li><a href="http://doras.dcu.ie/14830/">Topic-dependent sentiment analysis of financial blogs, O'Hare et al 2009</a></li>
<li><a href="http://www.cs.brandeis.edu/%7Emarc/misc/proceedings/lrec-2006/pdf/394_pdf.pdf">Sentiments on a Grid: Analysis of Streaming News and Views, Khurshid Ahmad, Lee Gillam, David Cheng 2006</a> </li>
<li><a href="http://www.ncess.ac.uk/events/ASW/textmining/presentaions/20060428-gillam-sentiment.pdf">Sentiment Analysis and Financial Grids, Lee Gillam Presentation</a></li>
<li><a href="http://www.lrec-conf.org/proceedings/lrec2008/pdf/276_paper.pdf">Sentiment Analysis and the Use of Extrinsic Datasets in Evaluation, Ann Devitt, Khurshid Ahmad 2008</a> </li>
<li><a href="http://pos.sissa.it/archive/conferences/026/001/GRID2006_001.pdf">Multi-lingual Sentiment Analysis of Financial News Streams, Ahmad, D Cheng, Y Almas 2006</a> </li>
<li><a href="http://acl.ldc.upenn.edu/P/P07/P07-1124.pdf">Sentiment Polarity Identification in Financial News: A Cohesion-based Approach, Ann Devitt, Khursihid Ahmad 2007</a></li>
<li><a href="http://acl.ldc.upenn.edu/P/P07/P07-1124.pdf">A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Bo Pang, Lillian Lee </a></li>
<li><a href="http://algo.scu.edu/%7Esanjivdas/chat_FINAL.pdf">Yahoo! for Amazon: Sentiment extraction from small talk on the web, University of Sydney SR Das, MY Chen 2007</a></li>
<li><a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.102.7125&rep=rep1&type=pdf">Sentiment analyzer: Extracting sentiments about a given topic using natural, University of Sydney J Yi, T Nasukawa, R Bunescu, W Niblack 2003</a></li>
<li><a href="http://acl.ldc.upenn.edu/D/D07/D07-1115.pdf%0A">Building lexicon for sentiment analysis from massive collection of HTML, N Kaji, M Kitsuregawa 2007</a></li>
<li><a href="http://acl.ldc.upenn.edu/acl2002/MAIN/pdfs/Main425.pdf">Thumbs up or thumbs down? Semantic orientation applied to unsupervised Classification of Reviews, PD Turney 2002</a> </li>
</ul>Andrew Maynehttp://www.blogger.com/profile/12657173474301903710noreply@blogger.com0tag:blogger.com,1999:blog-7080535392107032966.post-62251884215773614772009-12-08T22:03:00.002+11:002010-09-13T11:51:40.941+10:00Meeting MinutesMeeting with Rafael Calvo, Robert Dale<br />
<div><ul><li>Academic Paper: <a href="http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf">Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews</a>, P.D Turney, July 2002</li>
<li>Much work has been done in the past on sentiment in product/movie reviews, but these may not transfer to the domain of finance, and some finance domain knowledge is necessary in analysis.</li>
<li>AFR Titles may tend to be more extreme (sensationalized) in relation to their actual content</li>
<li>Necessary to find sources of positive/negative related terms, and how well they correlate with the data set/annotations and vise versa.</li>
<li>May not be able to be done on word frequency count alone (taking out company name, terms such as the, a etc).</li>
<li>Could use synonym trees to help with the large term set (lexical database such as <a href="http://wordnet.princeton.edu/">WordNet</a>)</li>
<li>Found information on some research done in this area by <a href="http://www.sirca.org.au/">Sirca</a></li>
<li>Interesting abstract and presentation given by C Robertson: "<a href="http://www.eresearch.edu.au/robertson2009">Enabling Sophisticated Financial Text Mining</a>"</li>
</ul></div>Andrew Maynehttp://www.blogger.com/profile/12657173474301903710noreply@blogger.com0