Definitely not for beginners in the field of analytics. That was my first impression of week 5 and still is my impression. The videos were way too fast for me in speaking and content, but nevertheless, I’ll try to note what I got out of them, hopefully it’s correct. In this blog post I’ll also cover my experiences with the „week 5 activity“.
What’s the use of prediction modeling? Sometimes to predict the future, sometimes to make interferences about the present. There could be automated decisions by software or informing teachers so that they can do something. Starting point: There is something you want to predict – that’s called „Label“ (= predicted variable).
a) Regression = numerical (how much of a video a student will watch, what will be the student’s score,…)
In order to build a model, you obtain a dataset where you already know the answer (= training label). There are other variables (= features, predictor variables) which are used to predict the label. Regression means that you determine which features in which combination can predict the label’s value. In order to interpret the weight of the features, transformation is necessary.
One way of regression is linear regression (often more accurate than complex models particularly when you cross-validate). Another kind of regression would be regression trees (either with linear equations at each of the leaves of the tree or as non-linear regression trees).
b) Classification = set of categories (correct/wrong, will drop out/won’t drop out,…)
You get the labels from survey data, field observations, school records etc. With each label there are some features, which could be used to predict the label. A classifier is to determine which features in which combination can predict the label. Software like RapidMiner, Weka etc. have a lot of classification algorithms, but is is hard to say which work best in a certain context. Educational data has lots of systematic noise, so the advice is to use conservative classifiers and find simple models. From experience, considered as not so useful for educational data are „Support Vector Machines“, „Genetic Algorithms“ and „Neural Networks“.
For educational data there are „step regression“ (for binary decisions like will the student drop out y/n via linear regression function and rounding to 0 or 1), „logistic regression“ (for binary decisions like will the student drop out y/n via finding out the frequency of a specific value of the dependent variable, relatively conservative) , „J48/C4.5 Decision Trees“ (good in dealing with interaction effects, can handle numerical and categorical predictor variables, relatively conservative, good when the same result (drop out y/n) can be arrived in different ways), „JRip Decision Rules“ (set of if/then rules – many algorithms, decision trees created), „K* Instance-Based Classifiers“ (predicting the data point from a neighboring data point, good for very divergent data without easy patterns, you need the whole dataset) – and many other algorithms.
Week 5 Activity – Assignment: Problems and surprises
* Walk-Through
That wasn’t so easy to do, because the walk-through (which should explain the handling of the RapidMiner software) was in Flash and didn’t have a „back“ button – in my opinion, a simple pdf file would have been more helpful to get the context of the different steps. Thankfully, in the meantime we got a doc-file with the content. Another problem for me was that the software obviously expected that users would use drag&drop in order to add an operator and not „return“ like I did and which worked for some operators – but not all of them.
* External resource
The actual assignment, which was in a math tutor system, wasn’t available until you answered one question correctly. It took me a long time to find this correct answer, as I didn’t expect that I had to check the example csv file in Excel in order to compare if RapidMiner listed the correct attribute types when importing the csv. As RapidMiner identified some field attributes as binominal instead of polynominal I got a wrong result for kappa.
Finally, I had the correct answer and saw the start page of the math tutor system with a login window. I tried my edx login, but got no result as obviously it wasn’t intended that I got to this page. Not until I changed my Browser settings to accept cookies from ALL 3rd-parties, I saw the first question. These questions were hard to answer without background in statistics and I was relieved that there were some helpful infos in the discussion forum.
I answered a question regarding kappa without the field „studentid“, then about kappa without some other fields, conducted analysis with „Naive Bayes“, „W-JRip“ and got stuck with „Logistic Regression“ as my PC needed a long time to come to no result because I lost my patience after 50 minutes and stopped the process. I don’t think it was intended to last for such a long time and therefore tried a different approach with a subset in the operator „Nominal to Numerical“ which didn’t work either. As the questions in the math tutor are not numbered, I don’t know exactly where I am at the moment (somewhere in the middle I think – going back is not possible and going forward is not possible until I give the correct answer).
Also frustrating was that in my opinion one question included a double negative and I thought for some time if I should say yes for fields to exclude or say yes for fields to include – My choice was nearly 100 % wrong (so maybe my answers weren’t so wrong at all…).
I have to admit, that I would have stopped working for the MOOC if this topic would habe been in week 1… What I got out of this week is that predictions are very difficult even with software which does a lot of the job. When you don’t understand what you are doing, you are getting very wrong results – therefore a deeper engagement with statistics is required. At the beginning of the week, I nearly didn’t understand anything, but after my practice in RapidMiner I have the feeling, that in some examples I actually understood what I was doing and that’s enough for me at the moment.
* Working with RapidMiner to get the kappa and therefore access to the Math Tutor assignment
Meanwhile I’m very good in the steps which are necessary to get the kappa with W-J48 because I had a lot of attempts until I got it right…. I’ll never forget binominal and polynominal… So, maybe this 720p video helps somebody who is still trying. And on a foggy November weekend with a heavy bad cold (therefore no audio), I had time for this 🙂