Learning Analytics MOOC- Week 5 – Prediction Modeling

Definitely not for beginners in the field of analytics. That was my first impression of week 5 and still is my impression. The videos were way too fast for me in speaking and content, but nevertheless, I’ll try to note what I got out of them, hopefully it’s correct. In this blog post I’ll also cover my experiences with the „week 5 activity“.

What’s the use of prediction modeling? Sometimes to predict the future, sometimes to make interferences about the present. There could be automated decisions by software or informing teachers so that they can do something. Starting point: There is something you want to predict – that’s called „Label“ (= predicted variable).

a) Regression = numerical (how much of a video a student will watch, what will be the student’s score,…)
In order to build a model, you obtain a dataset where you already know the answer (= training label). There are other variables (= features, predictor variables) which are used to predict the label. Regression means that you determine which features in which combination can predict the label’s value. In order to interpret the weight of the features, transformation is necessary.
One way of regression is linear regression (often more accurate than complex models particularly when you cross-validate). Another kind of regression would be regression trees (either with linear equations at each of the leaves of the tree or as non-linear regression trees).

b) Classification = set of categories (correct/wrong, will drop out/won’t drop out,…)
You get the labels from survey data, field observations, school records etc. With each label there are some features, which could be used to predict the label. A classifier is to determine which features in which combination can predict the label. Software like RapidMiner, Weka etc. have a lot of classification algorithms, but is is hard to say which work best in a certain context. Educational data has lots of systematic noise, so the advice is to use conservative classifiers and find simple models. From experience, considered as not so useful for educational data are „Support Vector Machines“, „Genetic Algorithms“ and „Neural Networks“.
For educational data there are „step regression“ (for binary decisions like will the student drop out y/n via linear regression function and rounding to 0 or 1), „logistic regression“ (for binary decisions like will the student drop out y/n via finding out the frequency of a specific value of the dependent variable, relatively conservative) , „J48/C4.5 Decision Trees“ (good in dealing with interaction effects, can handle numerical and categorical predictor variables, relatively conservative, good when the same result (drop out y/n) can be arrived in different ways), „JRip Decision Rules“ (set of if/then rules – many algorithms, decision trees created), „K* Instance-Based Classifiers“ (predicting the data point from a neighboring data point, good for very divergent data without easy patterns, you need the whole dataset) – and many other algorithms.

 

Week 5 Activity – Assignment: Problems and surprises

* Walk-Through
That wasn’t so easy to do, because the walk-through (which should explain the handling of the RapidMiner software) was in Flash and didn’t have a „back“ button – in my opinion, a simple pdf file would have been more helpful to get the context of the different steps. Thankfully, in the meantime we got a doc-file with the content. Another problem for me was that the software obviously expected that users would use drag&drop in order to add an operator and not „return“ like I did and which worked for some operators – but not all of them.

* External resource
The actual assignment, which was in a math tutor system, wasn’t available until you answered one question correctly. It took me a long time to find this correct answer, as I didn’t expect that I had to check the example csv file in Excel in order to compare if RapidMiner listed the correct attribute types when importing the csv. As RapidMiner identified some field attributes as binominal instead of polynominal I got a wrong result for kappa.
Finally, I had the correct answer and saw the start page of the math tutor system with a login window. I tried my edx login, but got no result as obviously it wasn’t intended that I got to this page. Not until I changed my Browser settings to accept cookies from ALL 3rd-parties, I saw the first question. These questions were hard to answer without background in statistics and I was relieved that there were some helpful infos in the discussion forum.

I answered a question regarding kappa without the field „studentid“, then about kappa without some other fields, conducted analysis with „Naive Bayes“, „W-JRip“ and got stuck with „Logistic Regression“ as my PC needed a long time to come to no result because I lost my patience after 50 minutes and stopped the process. I don’t think it was intended to last for such a long time and therefore tried a different approach with a subset in the operator „Nominal to Numerical“ which  didn’t work either. As the questions in the math tutor are not numbered, I don’t know exactly where I am at the moment (somewhere in the middle I think – going back is not possible and going forward is not possible until I give the correct answer).

Also frustrating was that in my opinion one question included a double negative and I thought for some time if I should say yes for fields to exclude or say yes for fields to include – My choice was nearly 100 % wrong (so maybe my answers weren’t so wrong at all…).

I have to admit, that I would have stopped working for the MOOC if this topic would habe been in week 1… What I got out of this week is that predictions are very difficult even with software which does a lot of the job. When you don’t understand what you are doing, you are getting very wrong results – therefore a deeper engagement with statistics is required. At the beginning of the week, I nearly didn’t understand anything, but after my practice in RapidMiner I have the feeling, that in some examples I actually understood what I was doing and that’s enough for me at the moment.

 

* Working with RapidMiner to get the kappa and therefore access to the Math Tutor assignment

Meanwhile I’m very good in the steps which are necessary to get the kappa with W-J48 because I had a lot of attempts until I got it right…. I’ll never forget binominal and polynominal… So, maybe this 720p video helps somebody who is still trying. And on a foggy November weekend with a heavy bad cold (therefore no audio), I had time for this 🙂

Learning Analytics MOOC – Week 4 – SNA Case Studies

Upon the recommendation of a friend, this week, I read the book „The Circle“ from Dave Eggers, which had quite an effect on me – especially when taking a MOOC with the topic SNA at the same time… Therefore and because I had a heavy workload in my job, I postponed engaging with DALMOOC from the evenings to this weekend. I was really interested in this week’s hangouts and watched the recordings (http://www.youtube.com/watch?v=GUUaP39VpLI and http://www.youtube.com/watch?v=ziM0EvN9n0o) which were fascinating and brought to my attention the level of „that we as participants are object of study as well“. Personally, I find it a little bit scary that my social media and learning activity is under such scrutiny. Certainly I knew before the course that it would be the case – learning analytics naturally means monitoring and drawing conclusions – but it’s different when it’s for real. However, at this moment, I don’t know yet enough of what is possible via Learning Analytics and what is commonly in use. Maybe it’s inevitable and in five years or so it will be quite normal, but I think we as educators really have to think about which tools we are using and how we do that – We have to be very, very careful with our learner data at our universities.

SNA Case Studies
This week, we got an insight in some SNA case studies and the educational constructs which were used (e.g. learning design, sense of community, creative potential, social presence, academic performance, distributed MOOC pedagogy). Thankfully the full text was available and I’ll keep these examples in mind for further reading later on. SNA can help to detect patterns of interactions in online learning environments and instructors could start intervening depending on the intended learning design (e.g. from an instructor-centered network to an „equal distribution of student distributions“). Sense of community means to which extent learners feel that they belong to a community and on the other hand benefit from participating in the community (getting information, getting feedback, also relevant for student retention in universities). The aspects creative potential („network brokers are associated with achievement and creativity“), social presence („teaching presence facilitates the development of network centrality by guiding students to establish social presence“) and academic performance („those students who were central in cross-class networks had best academic performance“) were also covered in studies, but I just watched the intro-videos.
In order to learn more about cMOOCs (where social media is an integral part), this article about Twitter use in CCK11 (socio-technical approach: nodes are persons and hashtags) is very interesting: http://www.sfu.ca/~dgasevic/papers_shared/bjet2014_cmoocs.pdf

Unfortunately I don’t have the time for further experiments with Gephi (which I would really have liked to do because I value that we are doing something concrete with tools) and week 5 with different topics is near.

Learning Analytics MOOC – Week 3 – SNA

My goal with these blog posts is to summarize and reflect a little bit about things/content I’ve learned – my blog seems to be a good way to keep this for later on after the MOOC.

Week 3 is about an introduction to Social Network Analysis (SNA) and insights how social processes unfold. „SNA aims to understand the determinants, structure, and consequences of relationships between actors“ (Source http://www.lifescied.org/content/13/2/167.full.pdf+html) SNA is multidisciplinary (not only sociology and statistics) and main analysis methods are density, centrality and modularity types of analysis. We’ll do some analysis with test data and again visualization, this time with Gephi. The interesting thing will be what’s the use of SNA for learning (I’m not there yet).

Networks consist of actors (=nodes) and relationships/connections which can represent friendship, advice, hindrance, communication. In a spreadsheet nodes and relationships would be represented in rows (and weight via adding as many rows). Data can be collected by self-reports, interviews, collection from social networks (who is following whom on Twitter etc.) and special tools which collect data from LMS (activity in online discussion boards,..) and later on be analyzed in tools like Gephi. As these networks are seldom static, you have to decide on a time frame when collecting data. Also important: Anonymizing data, obtaining consent (which may lead to incomplete networks), ethics

 

Network Measures

In my understanding this very informative YouTube video from Dragan Gasevic http://www.youtube.com/watch?v=Gq-4ErYLuLA  lists network measures as follows:

a) Measures which are measuring the entire network:

* Diameter = „a measure which is determining the longest distance between any pair of two nodes in the network“

* Density = „is determining the potential of the entire network to talk to each other“ (how many connections of all the possible connections are actually happening)

b) Measures which are measuring the potential of individual nodes in a network:

* Concept of Centrality: (The meaning of centrality is dependent on the kind of different metric which is used)

** Degree centrality =  A very often used measure which „indicates the total number of connections for each actor in a network“

*** In-Degree centrality = Pointing to an actor / „how many other nodes are directly trying to establish communication or are talking  to a particular node“  (popularity, prestige)

*** Out-Degree centrality = Pointing away from an actor / „outgoing connections, may mean how many emails someone sent, generosity in conversations with others“  (gregariousness)

** Betweenness centrality = „measure which indicates the ease of connection with anybody else in the network but in particular to try to connect all these potentially small subclusters of the nodes“ (network broker)

** Closeness centrality = „used to measure the ease or the shortest distance of a node to anybody else in the network (indicates how quickly you can get to anybody else in the network, not useful for networks with many actors with no ties or groups with no connection to other groups)“

It also can be interesting to think about network modularity, e.g. smaller subgroups that are closely connected to each other (modules=communities). It is relevant for later use of modularity algorithms to identify the „giant component“ and use it as a filter.

 

About Gephi
Installation was easy but when I played around with the test data we were given, I even didn’t find the function to zoom in so I watched this YouTube video which was very helpful to get an overview of how to use Gephi (17 min very well spent): http://youtu.be/L0C_D68E1Q0

As I haven’t done anything like this before, I reduced my tests with the example blog dataset of week 6 to the „Average Degree“ and tried to find something useful. My results are in the attached pdf-file: w3-gephi-2

I’m looking forward to week 4 – maybe the Hangout times of day will be a little bit more convenient for Middle Europe again. And I still have to try Bazaar (I really want to do that), but again, this week, I had no time for that.

Learning Analytics MOOC – Week 2

Topic of this week is the „Learning Analytics Cycle“ and conducting some basic analytics with our test access to the Tableau software. The YouTube video from George Siemens about the Data/Analytics Cycle was very helpful, also the Google Hangout with Tony Hirst about „Data Wrangling“. I also attended the  Google Hangout on Wednesday and I am very impressed  by the commitment of the course facilitators – thank you!

My interpretation of the Learning Analytics Cycle consists of these steps:
1. Data collection, Data Acquisition and Storage (Data is generated by or about the learners: Sources can be LMS, Student information systems, Social Media… any interaction between Learner & Institution)
2. Dataset Cleaning (missing data, different spelling of names,…)
3. Analysis & Visualization
4. Action (Intervention, Optimization,… and back to the learners)
That means the process starts with the learners and ideally the cycle / loop closes with feedback of the intervention to the learners.
When we look at data, we can do counting, sorting and therefore get different sort of charts with the same data. As to interpretation, theses aspects are relevant: looking for outliers, looking for similarities and differences, looking for trends, looking for patterns & structure.
Certainly, you have to think about what you would like to know when you do the data collection and not only when you do the analytics & visualization.

My tests with the Tableau Software:
After having registered with Tableau a lot of times –  at first to get a test version of the software (thankfully in this MOOC we get an extended test period), then to actually start it after installation and then even to get the video tutorials – I watched the „Getting started“ video (20 minutes) and was really impressed with the variety of functions.

As I don’t have a set of educational data which I could use for testing, I had to be somewhat creative to use a different kind of dataset for testing. At our University (and in Germany) we have very strict regulations for the use of user data and logfiles and so my example won’t have anything to do with educational data but with recreational data… But my goal at the moment was to play around with the Tableau Software in order to get used to working with table cells and rows and visualizations and I am satsified with my results:

I spent more time with the MOOC this week than I planned because it was fun and creating artifacts is really very time-consuming. I am looking forward to week 3 and hopefully, I’ll find the time to try Bazaar meetings.

 

Learning Analytics MOOC – Week 1

The Horizon Report Higher Education 2014 sees Learning Analytics in the Time-to-Adoption Horizon as „One Year or Less“ (in the Report 2013 it was „Two to Three Years“), so the topic LA is quite an interesting one for Educators.

However, Learning Analytics (LA) is a term which needs definition – The Society for Learning Analytics Research defines it as „the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs“.

The slide at 3:00 min shows that it’s a good idea to know more about LA: „Data trails reveals our sentiments, our attitudes, our social connections, our intentions, what we know, how we learn and what we might do next“.

I really liked the link to the fulltext article „Educational Data Mining and Learning Analytics“ (Baker & Siemens 2014) as it gives an introduction and overview of the field (graduate programs, journals, conferences, methods & tools, differences between Educational Data Mining, EDM, and LA research communities). The reasons of the growing use of LA are cited as „a substantial increase in data quantity, improved data formats, advances in computing, and increased sophistication of tools available for analytics“.

What about software (analytics / research tools)? I try to remember that for single functionality there is NodeXL and Gephi whereas integrated suites would be SAS, IBM BI Analytics suite and Pentaho. Open Socurce tools are R and Weka and in our course we will focus on Tableau, Gephi, RapidMiner and LightSide. I intentionally skipped doing a tool matrix because at the moment I don’t feel like being competent enough and other things were more important to me (I decided that it’s  the kind of MOOC where I choose my learning goals).

In week 1, I spent about 5 hours with the MOOC: at first looking at the course / resources / activities in edX (plus joining one of the live hangouts until midnight local time on Tuesday) and then signing in ProSolo. My first impression of ProSolo was that is wasn’t very intuitive, so in week 2 I’ll  have a closer look at what the menus „plan, learn“, „goals, competences, activities“ mean.

I look forward to week 2 of DALMOOC and I’m curious what we will do with the Tableau software.