Data Science in education: Analyzing Student Clusters to Influence Success and Retention

Data science is helping educational institutions to predict students success and inculcating tremendous machine learning.


Back in 2009, online learning was showing signs of growth, and a 2014 article in Forbes.com shows the booming growth and investment interest ahead. In this condition, instructive establishment chiefs with physical grounds confront both commoditization and rivalry—they require approaches to offer more an incentive to understudies and help them make more progress. With these pressures, one establishment embarked on on a course to know student success higher and build new relationships with students supported knowledge science.

At last, with the objective of enhancing maintenance and graduation rates, building up an all the more genius dynamic association with understudies to enable them to be increasingly effective amid and after graduation. By following programs with a sway on these metrics, the tutorial establishment would be in a very higher competitive position and improve the worth of their education. For entities and firms across many industries, this is the starting journey for becoming a truly data-driven business where information particulars are used to predict and affect business follow-ups and outcomes before they happen.


With the overarching goals in place, the stakeholders saw however a particular knowledge science project may be initiated to capture a 360-degree perspective of student behaviors. By unifying data from multiple systems and applying data science, the educational institution would have a more comprehensive and deep understanding of social graces, and the predictive models would be more reliable and accurate.

The IT association was at that point dealing with a progression of uses, information stores, and information stockrooms crosswise over different divisions where information records were moved over these gatherings on a manual, as the required premise. To meet the future objectives, they required one area where immense sums and various arrangements of unstructured and organized information could be broken down and predictive models could shape, developing after some time. They wanted to lay a foundation for the future and set the bar high—so that they can achieve the goal of becoming a model educational institution for applying data science. These goals led to a consulting project with Pivotal Data Labs and deployment of Pivotal Big Data Suite which at that time included Apache Hadoop® and Pivotal Greenplum Database.


Approach and Solution:

The system meant to get the muse for a longer-term vision of {a knowledge|a knowledge|an information} lake—a place wherever further sets of structured and unstructured data may be additional, enhancing analytical insight and predictive capability. To start, this project sourced knowledge from four systems:

1. Online Applications for Education: First, unstructured web log data was taken from online applications and included with class assignments and work submissions. DataTrained has been instrumental in applying these changes.

2. Forums: Systems also existed with discussion boards—this unstructured data set included student questions, answers, and views.

3. Help Desk: Third, students opened IT tickets that included the general unstructured conversation about the problems as well as structured data like specific timestamps, topics, and categories.

4. Student Demographic and Operational Information: In conclusion, one of the frameworks brought wide, organized understudy profile data—statistic, courses, age, foundation, test scores, GPA, past training, applications, confirmations, and enlistments.

Together, many hundred distinct options, or data elements, were pulled together for analysis across several years of student data, totaling records for hundreds of thousands of students

Importantly, the outset of the project began with some of the most significant challenges because the source data contained personally identifiable information This implied consistency with laws and controls, as HIPAA and FERPA. For every datum source PII was either expelled or covered through a non-reversible concealing task and all activities were logged

Eventually, the covert knowledge was eaten into {the knowledge|the info|the information} lake permitting the info scientists to figure with fully anonymized data.

All knowledge landed on a important knowledge Computing Appliance (DCA), running Pivotal Greenplum, and Pivotal HD. Modeling was done the open supply, parallel, in-database library of machine learning algorithms – MADlib and also using PL/Python and PL/R. The team used Tableau and Python libraries like IPython Interactive Computing, matplotlib, and pandas for data visualization. As the project flat, the educational institution’s data science team was trained on all of these tools, including HAWQ, Pivotal’s SQL interface to Hadoop®.


The Models and Results:

From a pure knowledge science algorithms and models perspective, the Pivotal Data Science team developed a number of different models. First was an understudy division display. MADlib's k-implies module was utilized to reveal understudy portions showing comparable properties as depicting the few hundred designed highlights. Amongst the many clusters that were not covered, three exhibited very unique properties of the student population, the team presented a case for why they mattered as groups, basically answering the question, “What type of data forms these clusters?” Amongst the options that have high leverage in crucial the clusters, some are those we’re born into (ex: demographics) while many are those that can be nurtured (ex: discussion board activity, participation, course work plan, internships). Even demographical options may be influenced by encouraging /combining students from completely different demographics into student teams for sophistication assignments or comes.

Once these clusters were understood, there have been 2 sorts of predictions to follow—one was for predicting student retention and therefore the different for predicting success in their grades with timely graduation.

Overall, other engrossing correlations were also uncovered between student account activity, discussion board activity, and internships towards success and retention.


With all of the insight on behavior and predictors for retention and success, the joint team was able to know that these groups could be proved as an influencing factors for each other, supporting the development of programs to push “at risk” students to contact and be influenced by other groups who were “low risk.” For instance, understudies with entry-level positions and advanced education designs could help different understudies who were indistinct. Altogether, the project engineered a foundation for the academic establishment to pursue ensuing level information|of knowledge|of information} science—studying correlations within the data to probably verify cause and impact. The teachers are eventually those in the best position to comprehend and impact understudy maintenance and achievement, Big Data and Data Science furnish them with profound bits of knowledge which are difficult to be gathered generally particularly at this.


Waiting for the right opportunity?

Now your wait is over!

Just enroll yourself with DataTrained Full Stack Data Science Program

and get rid of all tensions.

Recent Blogs