Personas Analysis for EdTech Firm

Client Profile

The client is an edtech startup firm that offers standards-aligned curriculum through an article-based engagement format. They offer both free and payware versions. The overarching goal of their data strategy is to drive higher student and teacher usage of the product through targeted product enhancements; the firm uses advanced analytics to determine where to target such efforts. In this project, they wanted to create and analyze distinct groups or “personas” of teachers based on the way the teachers had responded to a survey regarding their demographic information, challenges faced in the classroom, content distributed in class, and general teaching style. These questions produced a few hundred unique variables by which to group teachers.

Solution

Our client’s goal was to group teachers into personas in which the teachers in a given persona had responded to the survey similarly. From a data scientist’s point of view, this requirement presents a strong use case for cluster analysis. Cluster analysis is exactly what it sounds like; it involves using data- and statistically-driven methods to group a set of observations ―in this case, the observations are teachers― together in such a way that observations are more similar statistically to other observations in their own cluster than observations in all other clusters. Thus, we chose to employ cluster analysis to identify four unique personas of teachers.

An important note on cluster analysis is that unlike regression analysis, deep learning, decision trees, and other “supervised” machine learning techniques, cluster analysis is “unsupervised” machine learning. In other words, while many machine learning algorithms have a definitive right and wrong outcome that the algorithm is trying to predict for each observation ―and thus an objective measurement of how well the algorithm is doing its job is possible― no such assessment of cluster analysis algorithms is accessible.

K-Means Clustering: What Is It?

K-Means Clustering is a “centroid-based” clustering algorithm in that it uses a defined number of “centroids” (center points of clusters); the algorithm places the centroids in the location that minimizes the aggregate distance between every data point and the centroid of its cluster.

Other Clustering Algorithms

There are three other main forms of clustering algorithm: Expectation-Maximization Clustering (or EM Clustering for short), Density-Based Clustering, and Agglomerative Clustering. EM Clustering is similar to K-Means except that it calculates the probability that each observation belongs to each cluster instead of concretely assigning each data point to a single cluster by minimizing total data point-cluster center distance. Density-Based Clustering identifies areas where data points are tightly packed together as the clusters, while areas where data points are sparse are the boundaries between clusters; DBSCAN is a Density-Based Clustering algorithm that you may have heard of. Lastly, Agglomerative Clustering builds a hierarchy of clusters to divide observations into narrower and narrower groupings using proximity metrics until all observations have been included.

Why K-Means Clustering Instead Of The Other Options?

  1. With K-Means clustering, each datapoint (teacher) is placed into exactly one cluster (persona). This feature aligns with the client’s requirement of providing distinct, mutually exclusive personas. Had we used EM clustering, teacher(s) could have been placed into more than one persona.
  2. Unlike DBSCAN, K-Means allowed us to manually control the number of personas that the algorithm came up with. DBSCAN chooses the number of clusters to create for you. Thus, running DBSCAN would have risked the identification of an unhelpfully-large or unhelpfully-small number of personas ―for example, only two or more than 10― and such a result would not have been constructive for our client. Additionally, DBSCAN has trouble identifying clusters with varying density; because the vast majority of the variables in the analysis were categorical, the data points are likely to be of inconsistent proximity/density.
  3. We also preferred K-Means to agglomerative clustering as the latter does not distinctly identify clusters; it would have shown how the teachers relate to each other more than how to segregate teachers into groups. In other words, the process of converting the output of agglomerative clustering into meaningful personas would have been manual and subjective rather than scientific and objective.

Avoidance Of Overfitting

The ratio of features to observations in this particular dataset was quite low. An acceptable ratio would be at least 10 observations for each feature. This fact gave rise to the risk of overfitting, a phenomenon where a machine learning model effectively memorizes individual data points rather than learning overall patterns that generalize well to data that the model has not seen. In other words, if the same model were applied to a new set of teachers’ responses that it had not seen initially, the personas that the model identifies could be completely different if the initial model is overfit.

To counteract the threat of overfitting, we asked our client to rank the questions into tiers of importance so that we could focus on only the most important ones, thereby reducing the feature-to-observation ratio. They put seven questions into tier 1, six questions in tier 2, nine in tier 3, and three in tier 4, and the rest of the questions were to be removed from the analysis. We first ran the algorithm with just the tier 1 variables, and then ran it again with the tier 1 and 2 variables; we noticed that the clusters the algorithm identified changed significantly from the first run to the second. The conclusion was that to minimize the risk of overfitting, our analysis should focus exclusively on tier 1 variables.

Conclusions

The algorithm produced four distinct personas of teachers. We used the elbow method to determine that the optimal number of clusters was four. In general, how teachers responded to the questions pertaining to the primary challenges they face in the classroom was the best way to differentiate those teachers into the four clusters. Our clients found the resulting groups to be constructive from a business standpoint. We also completed follow-up analyses within these groups – for example, what was the average age of teachers in the Blue persona? What percentage of Orange teachers distributed content to their students using email?

Bespoke Solutions For Your Organization

Boxplot Analytics is passionate about working with all clients, regardless of their previous level of experience in data. If your organization is looking for a solution similar to the one described in this article ―or any other data-oriented capability― let us know by contacting us here.