Learning Analytics MOOC – Week 7 – Text Mining Introduction

The Google Hangout of week 7 was as interesting as ever (and for me again the archived version). At this time of the MOOC it’s no surprise for me, that Carolyn and George emphasized that analytics isn’t easy (I think we all felt that in this course as well) and that handling the specialized software is even the easy part of the process. What I understand completely by now is that learning analytics is very interdisciplinary – right now we have got computer linguistics in the mixture. At Heidelberg University we have got an „Institut für Computer-Linguistik“ – it might be interesting for me to contact them about elearning & analytics some time in the future.

As I worked on this week’s blog post for some days, it got longer and longer and therefore I divided the article in three sections. One section is about saying thank you for a great learning experience although we’ve still got two more weeks  – but who knows, with christmas preparations and further demanding tasks in text mining ahead, if I find the time later on 😉

1.) Some thoughts about the DALMOOC structure
My observation regarding the course structure is that the segments of the MOOC are kind of independent (I think you really could do the cMOOC thing and pick just one topic and engage with it) and on the other hand I see the full picture by now and why these parts were chosen by the instructors and how they fit together very well. I think it is extremely difficult to design a good MOOC for everyone – for experienced learners in the chosen topic as well as for beginners – and to reduce a topic (which you normally would spend a semester on as Ryan said) to a few weeks, combine different teachers and do this on a high level of including current research. Great job so far!
In addition to the discussion forums on edX, the Google Hangouts provided an important element of continuity, live feeling and caring – It would be interesting to know, how much time the facilitators spend each day with the MOOC… Twitter in its way was also motivating to get in contact with fellow students and helped me to stay on-board – so thanks a lot for favs and retweets!
In the beginning I had two goals: to learn something about learning ananlytics and to have a closer look at the dual structure of the MOOC. I had to reduce my second goal due to limited time and getting more engaged in the content part than I had planned / expected. So my parallel visits to ProSolo weren’t as frequent as hoped, but I experienced the different structure of initially the same weekly resources. It has much charm, but I returned to edX for my learning because I know (and like) the edX interface very well from former MOOCs and from my professional job. Until now, I even stayed away from Bazaar, because after the first weeks the course content was so new and difficult for me, that I had the impression that I wouldn’t be able to contribute in time something meaningful via a synchronous text chat channel, in a foreign language and where I maybe would be paired with someone who expects a meaningful discussion on a higher level than I could offer. But that’s a very personal assessment, I’m sure that others saw it in a totally different way. In a MOOC – and especially this one – there are so many different possibilities and learning pathways that you have to choose a combination you are happy with (and I’m happy with mine). I have stopped counting the many hours I spent on the MOOC each week and I’m fully aware that this is an exception which I can’t do often. A similar very valuable demanding MOOC for me was Nellie Deutsch’s (first) „Moodle MOOC on WizIQ“ in June 2013, where (besides blogging and taking part in many forum discussions) I created a lot of digital artifacts in the 4 week course duration. In comparison, the HarvardX Justice MOOC in 2014 was „easy“ for me because of the very plannable amounts of time: It was possible with about 4 hours a week to get a very good learning experience with the consistent (and definitely not boring) video, self-test/quiz and poll structure, the exam at the end and even without any live sessions – reflections about the topic included.

2.) Text mining methods as part of data mining – overview of the process of building and evaluation a model
An example of „collaborative learning process analysis“ illuminated that a theory driven approach (from the fields of psychology, sociolinguistics and language technologies) is considered to be more effective than shallow approaches to the analysis of discussion data: If you build models from an understanding of these theories, the models will be more effective.

(Accidentally) overfitting is always a risk, so you have to gain awareness of the important methodological issues for avoiding it – overfitting is „where you build a model that’s really too particular to the data that you have trained it on, in such a way that you don’t get good generalization to new data that you want to apply your model to“. Keep the data you train on and the data you test on separate, but it’s good when the data set you train the models on is representative of the data you later test on.

The text mining process in simple form consists of:

  • Raw textual data ->
  • Extraction of Features (with some awareness of the structure of language and of what we try to capture)  ->
  • Building a Model from those features  (From then on it’s like other kinds of data mining)  ->
  • Classification

A lot of work in text mining is „representation„: „You have to know what it is about the data that you want to preserve so those instances that should be classified the same look similar –  and those instances that should be classified differently look differently.“ Three sets of data are recommended: A development/exploration set, an evaluation/cross-validation set (for training and testing) and a final test set.
A starting point is qualitative analysis and setting aside data for development (which you don’t use later on for training and testing!) and looking for examples from each of the categories you want your model to distinguish between – then you have to think about how to extract those features from the raw text (in order to build these vectors of features that you can apply a machine learning algorithm to). You’ll extract those features from your cross-validation data and do a cross-validation to evaluate how well that model is doing. Usually it is not good enough at the first round, so you do an error analysis: You train the model on your cross-validation data, apply it to your development data and look at where the errors are occuring on the development data. With a new set of features you cross-validate in your cross-validation set, e.g. you work iteratively on your model and try to improve it and test/compare the performance on the cross-validation set.

  • The development set is for: Qualitative analysis before machine learning + error analysis + ideas for design of new features
  • The cross-validation set is for: evaluation your performance

When you think you are done, you apply that model to the final test set. The whole process is a partnership between the developer (brains) and the algorithms (software).

3.) Sentiment analysis and my thoughts about the risks of learning analytics
One of this week’s videos was about a study about the application of sentiment analysis to MOOC discussion forum data – regarding expressed sentiment and exposure to sentiment. A set of four different independent variables was used in the survival model: The individual positivity, the individual negativity, the thread level positivity, the thread level negativity – the dependent variable was „dropout“.
It was very convincing for me to hear that it’s much more complicated to really get to what a student’s attitude towards a course is than merely counting the number of positive or negative words (and a machine learning model might not be able to do it) – we have to look below the surface level analysis of text. Students might simply be discussing intensively using negative terms while on the other hand being very engaged. Or they might be discussing topics in which appear „negative“ words. It doesn’t mean automatically that they have a negative attitude towards the course or a negative experience with it. So „simplistic ideas of sentiment predicting droput are not supported“.
(s. http://educationaldatamining.org/EDM2014/uploads/procs2014/long%20papers/130_EDM-2014-Full.pdf)

I think, that it is a nice example which shows some of the difficulties with learning analytics in other situations (obviously I don’t criticize the above study): You apply LA from a worthy starting point and it seems to be so simple, but just it is not and you could even do harm if you don’t do it carefully and don’t know what you’re doing. I think, that a lot of people who talk about Learning Analytics don’t really know what they are talking about but might have a strong opinion for or against it – it might be another one of those topics in elearning where people think they know it all… In addition, specialized software for those persons could mislead to the wrong assumption, that such software results couldn’t be wrong, would be impartial and even easily transferrable to other educational settings…  In times, when cost reduction and measurements are esteemed, scenarios with monitoring student and staff performance out of „business“ reasons (and not really for improvement of teaching & studying) for me don’t seem to be too unrealistic. Yesterday I saw an article on Mashable which clearly said that even business analytics isn’t easy (http://mashable.com/2014/12/05/analytics/?utm_cid=mash-com-Tw-main-link) – and in an educational context analytics is even far more complicated.

„Code of practice for learning analytics : A literature review of the ethical and legal issues“ (Niall Sclater)
http://analytics.jiscinvolve.org/wp/2014/12/04/jisc-releases-report-on-ethical-and-legal-challenges-of-learning-analytics/ I think this resource from JISC (which I saw thanks to a retweet from Dragan) is a very important one and hopefully I’ll find the time to have a closer look.

I look forward to week 8 and some more insights in text mining. In week 7 I did the LightSide task with the prepared csv files (and got the correct result), read the LightSide manual  with all these definitions of terms/acronyms I’ve never heard of before (e.g. unigram = A unigram feature marks the presence or absence of a single word within a text) and could understand that Lightside „is divided into a series of six tabs following the entire process of machine learning“.