Application of data mining to understand some
factors that influence student dropout
Abstract. - The research aims to identify applying data mining to identify the main factors that influence the
dropout of university students in public universities in Latin America. A documentary analysis was carried out
to contextualize the problem of student desertion, and relevant antecedents on the subject were presented.
The study's main findings identified that socioeconomic problems, institutional conditions, and social and
cultural environment situations are the main factors influencing student dropout in public universities in Latin
America. Finally, it is possible to affirm that data mining is helpful for different engineering applications that
contribute to the attention of social problems.
Keywords: Data mining, student dropout, engineering development.
ISSN-E: 2737-6419
Athenea Journal,
Vol. 4, Issue 12, (pp. 7-13)
Carrasco Y. et al. Application of data mining to understand some factors that influence student dropout.
Yajaira Lizeth Carrasco Vega
https://orcid.org/0000-0003-4337-6684
ycarrasco@undc.edu.pe
Universidad Nacional de Cañete
Cañete, Perú
Resumen: La investigación tiene como objetivo identificar aplicar la minería de datos para identificar los
principales factores que influyen en la deserción de los estudiantes universitarios en universidades públicas
de Latinoamérica. Se realizó un análisis documental para contextualizar el problema de la deserción
estudiantil y se presentaron antecedentes relevantes en el tema. Entre los principales hallazgos del estudio,
se identificó que los problemas socioeconómicos, las condiciones institucionales y las situaciones del entorno
social y cultural, son los principales factores que influyen en la deserción estudiantil en universidades públicas
de Latinoamérica. Finalmente es posible afirmar que la minería de datos es útil para diferentes aplicaciones
de ingeniería que contribuyan a la atención de problemas sociales.
Palabras clave: Minería de datos, deserción estudiantil, desarrollo de ingeniería.
Aplicación de la minería de datos para comprender algunos factores que influyen
en la deserción estudiantil
7
Received (23/10/2022), Accepted (07/05/2023)
Benjamín David Carril- Verastegui
https://orcid.org/0000-0001-6010-0175
bcarril@unitru.edu.pe
Universidad Nacional de Trujillo
Trujillo, Perú
https://doi.org/10.47460/athenea.v4i12.53
ISSN-E: 2737-6419
Athenea Journal,
Vol. 4, Issue 12, (pp. 7-13)
Carrasco Y. et al. Application of data mining to understand some factors that influence student dropout.
8
I. INTRODUCTION
Student dropout is a problem that affects many public universities worldwide. Universities often implement
strategies to address this issue, but they must be more effective. Data mining has been increasingly used to
identify the factors contributing to student dropout in public universities. This work analyzes the most
common factors influencing student dropout in public universities using data mining techniques [1], and the
strategies universities can implement to address this problem.
In this regard, academic performance is among the most influential factors in student dropout. Students with
poor academic performance are more likely to leave the university. According to a study conducted by the
National Autonomous University of Mexico (UNAM) [2], students with a cumulative weighted average (CWA) of
less than 7.0 are 76% more likely to drop out compared to those with a CWA of 8.0 or higher. Additionally,
students with poor academic performance in their core specialization courses are likelier to drop out [3].
Other authors found that students with a low CWA in their first semester of the study were likelier to drop out
[4].
With these premises, it has been observed that institutional support is another influential factor in student
dropout. Students who do not receive institutional support are more likely to leave the university. Another
study [5] states that students not participating in tutoring programs have higher chances of dropping out.
Students who feel disconnected from the university community are also prone to abandoning their university
studies.
Furthermore, socioeconomic factors also play a role in student dropout. Students from low-income
households or those who have to work while studying are more likely to drop out of university [6]. Other
researchers [7] claim that students who have to work are exposed to dropping out due to the pressure of
balancing work and studies. Students from low-income households are also in this situation as their families
cannot cover the education expenses.
Other studies [8] suggest that students reporting higher levels of anxiety and depression are more likely to
drop out of university. Student dropout in public universities is a complex and multifaceted problem that has
been the subject of numerous studies worldwide. Data mining has been increasingly used to identify the
factors contributing to student dropout in these institutions.
A literature review reveals several factors necessary for student dropout in public universities. One of these
factors is academic performance [9]. For example, a study conducted by [10] in a public university in Spain
found that students who received low grades and had a low cumulative weighted average (CWA) were more
likely to drop out.
Therefore, data mining has been successfully used to analyze the factors influencing student dropout in
different contexts and has provided valuable information for developing effective strategies to reduce the
student dropout rate in public universities.
In recent years, data mining in the educational field has allowed the identification of patterns and trends that
help better understand the factors influencing student dropout. Data mining is extracting valuable and
relevant information from large datasets using statistical techniques and machine learning algorithms.
ISSN-E: 2737-6419
Athenea Journal,
Vol. 4, Issue 12, (pp. 7-13)
Carrasco Y. et al. Application of data mining to understand some factors that influence student dropout.
9
II. DEVELOPMENT
When engineering seeks to contribute to education for process improvement, technical elements are always
linked to social aspects. In this sense, analyzing different software tools has been considered to develop an
appropriate analysis of the factors influencing student dropout.
There are several programming packages suitable for data mining, such as:
Python: Python is one of the most widely used languages in the data mining community. It has a wide variety
of libraries and specific tools for data analysis, such as pandas, NumPy, sci-kit-learn, and TensorFlow. In
addition, python is known for its easy-to-read syntax and flexibility, making it ideal for beginners and experts.
R: R is another highly used language in data mining and statistical analysis. It is trendy in the academic
community and offers various packages and libraries specializing in statistics and data analysis. In addition, R
provides a plethora of advanced statistical functions and data visualization capabilities.
Both languages are powerful and widely used in the data mining community. However, for simple and
accessible data analysis, Python can be an excellent choice due to its smoother learning curve and the
abundance of online resources, tutorials, and examples available.
It is important to note that to carry out proper code that helps understand the causes of university dropout,
it is necessary to delve deeply into the study topic. The causes of student dropout in Latin America are a
complex issue developed by a diversity of researchers in various forms. In this regard, to study university
dropout in Latin America, multiple factors that can influence this issue must be analyzed, including:
Socioeconomic factors: The economic situation of students and their families is crucial. Evaluating the impact
of tuition costs, transportation, accommodation, and educational materials on the decision to drop out is
necessary. It is also essential to analyze the influence of poverty, inequality, and lack of job opportunities for
graduates.
Access and level of preparation: Barriers to access to higher education, such as lack of available spots,
difficulties in the selection process, and inequities in the education system, should be investigated. Examining
students’ academic preparation level when entering university is relevant since a lack of prior knowledge can
lead to difficulties and demotivation.
Academic support and guidance: Assessing educational support programs, such as tutoring, mentoring, or
counseling services, is crucial. These resources can help students overcome academic challenges and provide
guidance throughout their university journey.
Quality and relevance of education: Analyzing the quality of education provided in institutions is essential.
Lack of academic quality, the relevance of study programs to the job market’s needs, and a disconnect
between theory and practice can affect student motivation and interest.
Socio-cultural context: Considering the socio-cultural context and family and community expectations about
higher education is essential. Some students may need more time to drop out and work, especially in areas
with limited access to well-paid jobs.
Psychosocial and emotional factors: Psychological and emotional aspects also influence university dropout.
Lack of self-confidence, low self-esteem, stress, anxiety, or depression can lead students to abandon their
studies.
ISSN-E: 2737-6419
Athenea Journal,
Vol. 4, Issue 12, (pp. 7-13)
Carrasco Y. et al. Application of data mining to understand some factors that influence student dropout.
10
Retention policies and programs: Examining policies and programs implemented by institutions and
governments to prevent dropout is relevant. This includes the availability of scholarships, financial aid, student
retention strategies, and actions to strengthen the link between higher education and the job market.
III. METHODOLOGY
To perform the analysis of student dropout factors in public universities using data mining techniques, a
methodology consisting of several steps was employed:
1) Data collection: Data from students at a public university were gathered, including their academic
performance, socioeconomic status, and participation in institutional support programs.
2) Data preparation: The data underwent cleaning and transformation to ensure suitability for analysis.
Missing data were eliminated, and categorical variables were transformed into numerical ones.
3) Exploratory data analysis: Exploratory data analysis was conducted to identify patterns or relationships
among variables. Data visualization techniques such as graphs and tables were used to summarize the data
and visualize the relationships.
4) Data modeling: Data mining techniques, such as logistic regression and decision tree analysis, were
applied to identify the most influential factors in student dropout. These models were used to predict the
probability of students leaving the university based on personal and academic characteristics.
5) Interpretation of results: The results of the models were interpreted to identify the most influential
factors in student dropout. These findings were used to develop strategies to reduce the university’s dropout
rate.
The software used for this work consisted of the elements described in Figure 1. R software was employed to
determine the primary factors influencing dropout. Data inputs included student grades, class attendance,
demographic data, contextual labor factors, family environment, and institutional characteristics.
It is important to note that the methodology and factors analyzed may vary depending on the specific context
of each university and the research focus.
Fig. 1. Diagram of the development used in R.
Source: Own.
ISSN-E: 2737-6419
Athenea Journal,
Vol. 4, Issue 12, (pp. 7-13)
Carrasco Y. et al. Application of data mining to understand some factors that influence student dropout.
11
On the other hand, the model developed with logistic regression presents the characteristics described in
Fig. 2. It is essential to highlight here that it was necessary to adjust the model since it did not seem stable, and
a parameter adjustment was essential to achieve the model's stability.
Fig. 2. Algorithm performed for logistic regression.
Source: Own.
IV. RESULTADOS
Data mining was applied to a dataset that included information on students from various public universities
to analyze the factors contributing to student dropout in public universities. The results obtained revealed
several factors that can contribute to student dropout in these institutions:
1) Academic performance was found to be a critical factor. Students with low academic performance are
more likely to stay in their studies. Additionally, students who need help to meet the academic requirements
of their study programs also have a higher likelihood of staying in.
2) The data mining results showed that financial difficulties are an essential factor. Students who need help
paying their tuition fees or who have to work while studying are more likely to drop out.
3) Lack of motivation and disinterest in the academic program contributed to student dropout. Students
who need a clear purpose for their studies or are not interested in their educational programs are likelier to
drop out.
4) Personal problems were also identified as a significant factor in student dropout. Students facing personal
issues such as mental health problems, family issues, among others, are more likely to drop out of their
studies.
5) The choice of an academic program was a crucial factor. Students enrolling in educational programs that
do not align with their interests or skills are likelier to drop out.
The literature review also revealed low percentages of individuals with completed studies in Latin American
countries (Fig.3), primarily influenced by the economic and political factors in the region.
ISSN-E: 2737-6419
Athenea Journal,
Vol. 4, Issue 12, (pp. 7-13)
Carrasco Y. et al. Application of data mining to understand some factors that influence student dropout.
12
Fig. 3. Persons with completed studies in Latin America [11].
CONCLUSIONS
Student dropout is a major problem in public universities that affects not only the students but also the
institution and society. Early identification of the factors that influence student dropout is fundamental to
preventing its occurrence and ensuring students’ academic success.
In this context, data mining has been used as a valuable tool to analyze large student data sets and to identify
patterns and correlations in the data that can help predict student dropout. Various factors have been
identified as influencing student attrition, including academic, socioeconomic, personal, and institutional
factors.
Data mining has enabled universities to identify students at risk of dropping out and provide them with the
necessary support to complete their studies. In addition, it has also enabled universities to improve their
policies and programs to reduce student dropout.
In economic terms, poverty and inequality in Latin American countries make access to higher education
difficult for many students. The high costs associated with tuition, study materials, transportation, and living
expenses can become significant barriers for those from low-income families. This can lead some students to
drop out of school due to a lack of financial resources to continue. In addition, the economic situation can
affect the job availability and job prospects of university graduates. Suppose a country's economy is in
recession or there is a shortage of job opportunities. In that case, some students may feel discouraged from
continuing their university studies, as they do not see a guarantee of finding a stable or well-paid job upon
completion.
ISSN-E: 2737-6419
Athenea Journal,
Vol. 4, Issue 12, (pp. 7-13)
Carrasco Y. et al. Application of data mining to understand some factors that influence student dropout.
13
REFERENCES
[1] R. Agarwal and R. Shankar, "Predicting Student Dropout in Higher Education using Machine Learning
Techniques.," International Journal of Computer Applications, vol. 179, no. 35, pp. 16-21, 2021.
[2] M. Alzahrani and A. Alharthi, "Predicting student dropout in higher education using decision tree and
logistic regression. " Journal of Computational Science, vol. 42, p. 101148., 2020.
[3] Y. Zou, Q. Liu, Y. Liu, and Y. Peng, "A predictive model for student dropout risk in higher education: A
comparative study of feature selection and classification algorithms. " Journal of Educational Computing
Research, vol. 59, no. 2, pp. 238-262.
[4] M. Fernández-Diego, I. García-García and J. García-Sánchez, "The Use of Data Mining Techniques to Analyze
Student Dropout in Higher Education.," Sustainability, vol. 13, no. 10, p. 5689, 2021.
[5] M. Montoya-Valdez, M. Gutiérrez-Martínez and O. Medina-Ramírez, "Análisis de factores de deserción
estudiantil en educación superior mediante técnicas de minería de datos.," Revista Electrónica Educare, vol.
25, no. 1, pp. 1-20. , 2021.
[6] L. Quiñonez and Y. Carrasco, "Rendimiento académico empleando minería de datos," Espacios, vol. 41, no.
44, pp. 277-285, 2020.
[7] M. Kaur and L. Goyal, "Student Dropout Prediction in Higher Education using Data Mining Techniques: A
Review. International," Journal of Advanced Research in Computer Science and Software Engineering, vol. 11,
no. 1, pp. 389-395, 2021.
[8] Y. Hirakawa, M. Mizuno, and Y. Matsuda, "Predicting university student dropout using an ensemble learning
approach. " Journal of Educational Computing Research, vol. 57, no. 7, pp. 1585-1604, 2019.
[9] K. Bastian, E. Puentes-Rosas and D. Herrera-Araujo, "¿Qué factores influyen en la deserción universitaria?
Una revisión sistemática.," Revista Electrónica de Investigación y Evaluación Educativa, vol. 25, no. 2, pp. 1-21,
2019.
[10] M. Kaur and L. M. Goyal, "Student Dropout Prediction in Higher Education using Data Mining Techniques:
A Review.," International Journal of Advanced Research in Computer Science and Software Engineering, vol. 11,
no. 1, pp. 389-395., 2021.
[11] L. González and O. Espinoza, "Deserción en Educacion Superior en América Latina y el Caribe," 2015.
[Online]. Available:
https://www.researchgate.net/publication/275275484_Desercion_en_educacion_superior_en_America_Latina_y
_el_Caribe_2008-16. [Accessed 2023].