{"id":1537,"date":"2014-12-14T18:32:58","date_gmt":"2014-12-14T16:32:58","guid":{"rendered":"http:\/\/blog.idethloff.de\/wordpress\/?p=1537"},"modified":"2018-03-03T13:18:07","modified_gmt":"2018-03-03T11:18:07","slug":"learning-analytics-mooc-week-8-text-mining-nuts-and-bolts","status":"publish","type":"post","link":"https:\/\/blog.idethloff.de\/wordpress\/?p=1537","title":{"rendered":"Learning Analytics MOOC &#8211; Week 8 &#8211; Text Mining Nuts and Bolts"},"content":{"rendered":"<p>Whereas one focus of week 8 was working with LightSide, basic information about the following steps of text mining was provided in this week&#8217;s videos (s. <a title=\"https:\/\/www.youtube.com\/user\/dalmooc\" href=\"https:\/\/www.youtube.com\/user\/dalmooc\" target=\"_blank\">https:\/\/www.youtube.com\/user\/dalmooc<\/a>). We had some reflection tasks, hands-on-experimentation, assignments and group tasks this week &#8211; I made the choice to note the main aspects of week 8 and also to do something in LightSide in order go get some practice with it.<\/p>\n<p><strong>1. Text Mining: Data Preparation<\/strong><br \/>\nAs the process of preparing data for data mining \/ text mining can be very complex and requires a lot of time and thought, you should think at the beginning if it is realistic to achieve and if it is worth doing (e.g. when you can use the trained model for other studies where similar data is collected etc.).<\/p>\n<p>1.1 Cleaning text<br \/>\nIt would be nice to have raw data already in a tablular form or at least already in structured form (xml, json, sql) and so being able to use plugins \/ programming language to get it in a tabular form. In addtion, it can be necessary to aggregate data first because not every entry should be a unit in the dataset for machine learning (perhaps might be done in Excel with some macros). Things like reformatting because of non standard character encoding (UTF-8 would be good, LightSide can handle that format), disfluencies (perfectly formed English in data is unrealistic when doing learning analytics) or text which is in another language (LightSide is configured for English) might need attention\u00a0 and additional software plugins.<\/p>\n<p>1.2 Annotating data<br \/>\n\u201eTraining a predictive model requires annotated training data\u201c &#8211; a set of 1000 instances of labeled data is a good start: 200 as development data, 700 for cross-validation and 100 as final test set. A dataset example which results out of a simple poll already has a label given by the poll (yes\/no), but otherwise you would have to think about what you would like to detect in student interaction. Maybe there is already a coding manual of the codes you are interested in.<\/p>\n<p><strong>2. Text Mining: Getting a Sense of Data<\/strong><br \/>\nThe step \u201egetting sense into data\u201c is a step in the data mining process that many people don\u2019t spend enough time doing and which gets better with own experience and reading linguistic books. The qualitative analysis is an important \u201eprecursor to predictive modeling\u201c.<br \/>\nRegarding sentiment analysis, it\u2019s more complicated than reading text, counting positive and negative words (individual words are not enough): context matters, rhetorical strategies may appear, sentiment might be expressed indirectly<\/p>\n<p><strong>3. Text Mining: Basic Feature Extraction with LightSide<\/strong><br \/>\nFeature extraction is about thinking what we would like to include in the model, what will correlate with what we\u2019re trying to predict.<br \/>\nA noisy predictor of class value would be a term which can be used in different meanings and might therefore for some might mean agreement and for others might mean disagreement with something (like in our Healthcare poll example \u201ecost for one person&#8220; or &#8222;cost for society as a whole&#8220; &#8211; More context would be needed to be sure of the meaning).<\/p>\n<p><em>LightSide provides very easy access to a broad range of simple low-level text features. <\/em><br \/>\nIn LightSide, the panel \u201eExtract Features\u201c would automatically check off the text field to extract features from &#8211; but if you\u2019ve got other variables in other columns of the dataset, in the menu \u201eFeature Extractor Plugins\u201c besides \u201eBasic Features\u201c the option \u201eColumn Features\u201c should be checked off also.<\/p>\n<p><strong>In \u201eConfigure Basic features\u201c you have to choose among Unigrams (=Individual words), Bigrams, Trigrams, POS Bigrams (= Part of Speech Bigrams), POS Trigrams, Word\/POS Pairs, Line Length, Count occurences, Include Punctuation, Stem N-Grams, and other options (handling of stopwords etc.).<\/strong><\/p>\n<p>Unigrams are an easy way to try to grab the content of a sentence, but you loose the context and structure of the sentence. With bigrams, there is a already a little bit of ability to disambiguate. With using a combination of unigrams and bigrams the feature space gets much larger which leads to a higher possibility of overfitting &#8211; adding richer features gives you more information but comes with a cost.<br \/>\nAnother idea is to think about words in terms of grammatical categories, in parts of speech (noun, preposition, verb\u2026): which parts of speech tags occur next to each other? There are standard tag sets for \u201ePart of Speech Tagging\u201c which can be used.<\/p>\n<p>Line length just counts the number of words in a text and could be meaningful dependent on the kind of text.<br \/>\nStopwords are often removed in text classification (one of the things which come from information retrieval) &#8211; but in text chat it would be the other way round: \u201econtains non-stopwords\u201c would be interesting.<\/p>\n<p>Features like N-Grams, Part of Speech Bigrams and Word\/POS pairs were described as being binary features (true\/false), but another way would be thinking about them as count features &#8211; that happens if you check off \u201eCount Occurences\u201c &#8211; then they aren\u2019t binary encoded any more.<\/p>\n<p>You have to decide if you want punctuation as part of your feature space or not: Sometimes it just adds noise &#8211; not everybody uses it and some use it inconsistently (Yes, that would be me, when writing in a foreign language and thinking about difficult concepts&#8230; punctuation isn&#8217;t a priority).<\/p>\n<p>Another decision is: Do you want to use stemming or not? Stemming removes the endings from various forms of a word and makes the feature space a little more compact.<\/p>\n<p>These selections interact with eath other and so part of speech tagging is done first (before stemming or stopword removal).<\/p>\n<p><strong>4. Text Mining: Interpretation of Feature Weights<\/strong><br \/>\nThis starts when the model is already built. LightSide has a panel \u201eExplore Results\u201c (normally used for error analysis) with which you also can look at feature weights. In the confusion matrix (called \u201eCell highlights\u201c) you can select \u201efeature weight\u201c.<br \/>\nWords that are negative should have a large negative weight when you selected negative data &amp; positive prediction in the confusion matrix (for example \u201ebad\u201c = -0.8231) At the bottom of the LightSide interface you can choose the Extractor plugin \u201eDocuments display\u201c\u00a0 and check off \u201eFilter documents by selected feature\u201c and \u201eDocuments from selected cell only\u201c in order to see where in the original text the words occur.<\/p>\n<p><strong>5. Text Mining: Comparing Performance of Alternative Models<\/strong><br \/>\nIn LightSide, you can compare different models by using the panel \u201eCompare Models\u201c. If you want to compare two models with a different feature space (one with Unigrams and one with Unigrams and Bigrams) for a specific text, with the option \u201eComparison Plugin = Basic Model Comparison\u201c you can see the performance values and confusion matrixes in one screen. If you switch to the LightSide option \u201eDifference Matrix\u201c, you can look at misclassifications in the text context.<\/p>\n<p><strong>6. Text Mining: Advanced Feature Extraction<\/strong><br \/>\n<em>\u201eAdvanced features enrich the feature space, but expand the size of the feature space \u2013 large feature spaces mean added risk of overfitting\u201c<\/em><br \/>\nI\u2019d like to keep this short, because as a beginner, I\u2019ll stay with the simpler things at first (= LightSide\u2019s \u201eBasic Features\u201c in the Feature Extractor Plugin).<br \/>\nAdvanced options would be: Stretchy Patterns (for context around a word: definition of pattern legth, gap length and using categories \u2013 there are some predefined categories in the Lightside Toolkit Folder), Regular Expressions (help available in LightSide), Character N-Grams (for spelling modifications, consistent endings,..), Parse Features (slow, produces a huge number of features, seldom used).<\/p>\n<p>&nbsp;<\/p>\n<p><strong>7. Text Mining: Working with LighSide<\/strong><\/p>\n<p>I did a lot of things in LightSide: Extracting features, building and comparing models, inspecting models and interpreting weights&#8230; I&#8217;m optimistic, that I understood the technical part of how to do this and that I got an impression of the process and that&#8217;s about it.<\/p>\n<p>My results with LightSide are in this attached pdf file:<br \/>\n<a href=\"http:\/\/blog.idethloff.de\/wordpress\/wp-content\/uploads\/2014\/12\/w8-assignment-ID.pdf\" target=\"_blank\">w8-assignment-ID.pdf<\/a><\/p>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/blog.idethloff.de\/wordpress\/wp-content\/uploads\/2014\/12\/w8-assignment-ID.pdf\" target=\"_blank\"><img decoding=\"async\" src=\"https:\/\/www.idethloff.de\/images\/dalmooc-w8-lightside-kl.jpg\" alt=\"LightSide screenshot\" \/><\/a><\/p>\n<p><em>(screenshot from my pdf)<\/em><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Whereas one focus of week 8 was working with LightSide, basic information about the following steps of text mining was provided in this week&#8217;s videos (s. https:\/\/www.youtube.com\/user\/dalmooc). We had some reflection tasks, hands-on-experimentation, assignments and group tasks this week &#8211; I made the choice to note the main aspects of week 8 and also to &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/blog.idethloff.de\/wordpress\/?p=1537\" class=\"more-link\"><span class=\"screen-reader-text\">\u201eLearning Analytics MOOC &#8211; Week 8 &#8211; Text Mining Nuts and Bolts\u201c <\/span>weiterlesen<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[16],"class_list":["post-1537","post","type-post","status-publish","format-standard","hentry","category-dalmooc","tag-dalmooc"],"_links":{"self":[{"href":"https:\/\/blog.idethloff.de\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/1537","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.idethloff.de\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.idethloff.de\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.idethloff.de\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.idethloff.de\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1537"}],"version-history":[{"count":20,"href":"https:\/\/blog.idethloff.de\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/1537\/revisions"}],"predecessor-version":[{"id":1923,"href":"https:\/\/blog.idethloff.de\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/1537\/revisions\/1923"}],"wp:attachment":[{"href":"https:\/\/blog.idethloff.de\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1537"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.idethloff.de\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1537"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.idethloff.de\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1537"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}