Modelling Students’ Length of Stay at University Using Coxian Phase-Type Distributions

The time that Italian students spend at university is remarkably longer than in other European universities. For this reason, the government has recently introduced new rules for academic courses, in order to reduce the issue of long term students. In addition to this, universities need to address the growing problem of students prematurely leaving university before completing their courses. This paper considers the analysis of the length of stay of groups of students at Milano-Bicocca University using Coxian phase-type distributions classified according to the student individual characteristics.


Introduction and Motivations
Economists and sociologists regard university education as an investment where the costs of the education is balanced against the future benefits of having a better educated population and employable workforce.When a student leaves the university without completing their degree, it is at a cost to the student's family, the university as an institution and society.The costs to society are due to the economic output loss: the graduated is more productive than the non-graduated and society does not make a profit from the taxes of the missed graduate.Following this line of thought, the reform of Italian universities, through Law 509/1999, aims to prevent university drop outs and shorten the time taken to obtain a degree.
Moreover, graduation and drop out rates are adopted as criteria for evaluating the performance of universities.Therefore, the challenge for university managers is to make better informed policy decisions that can streamline the degree completion process, reduce the length of time it takes students to complete a degree and develop effective programmes to prevent drop outs.
The focus of this paper is on the time-to-event data where the time is the number of days elapsed from when a student first enrols until he/she experiences the event, that is, he/she graduates or drops out from that university course.
The analysis of university education has been developing systematically for a long time and a wide literature exists on this subject.Event-History analysis (DesJardin et al., 1999(DesJardin et al., , 2006;;Ishitani, 2003;Kalamatianou & McClean, 2003) focuses on discrete events (the student drop out or graduation) occurring over time to establish risk factors.This technique is particularly interesting for analysing the departure process because the assessment of the transition from one state to another, that is, from enrolled to not enrolled, and the identification of the factors (e.g.personal, academic, socio-economic status of family) which influence the students' decision of leaving, are attainable (Triventi et al., 2009).
Another piece of work makes use of Markov chain models to analyze the progression of students at university.In these models, every student occupies a state at time t and transits from state to state at time t+1 (the first and the last state represent enrolment and graduation/drop out respectively, while other states represent educational progress).Gani (1963) originally used a Markov chain for estimating the probability of Australian students completing their degree course.Shah and Burke (1999) provided estimates for the mean time a student takes to complete the course, and mean time students spend in the higher education system in Australia.Harden and Tcheng (1971) built the transition matrices from available historical data.Other examples are given in Song and Chissom (1994), Sah and Degtriarev (2005).Recently, Symeonaki and Kalamatianou (2011) proposed the theory of non-homogeneous Markov systems with fuzzy states for describing student educational progress in Greek universities.
The aim of this paper is to analyse student progression in the Italian reformed degree system and to estimate the influence of various factors on the probability that students, with certain characteristics, will progress successfully towards their degree or drop out.We propose to use the Coxian phase-type distribution for modelling the length of stay (in days) of the students enrolled at University of Milano-Bicocca.Student status has been observed for six academic years, during which time the student can graduate (as of the third year), drop out, change course or university, or can still be enrolled at the end of the observation (considered right-censored in the analysis).
There has been a wealth of literature devoted to investigating the determinants of the propensity of students to drop out of the academic career or to complete their degree programme.Personal characteristics such as gender and age; or individual abilities; income, education and socio-economic status of the family, academic-specific factors (services, quality in teaching, etc), and time-varying variables such as number of passed exams or credits are found to affect student choices at university (Arulampalam et al., 2004;Arulampalam et al., 2005;Checchi & Flabbi, 2006;Johnes, 1990;Light & Strayer, 2000;Robst et al., 1998;Smith & Naylor, 2001;among others).
A recent technical report by a consortium of Italian universities (Almalaurea, 2012) showed that one out of two students, enrolled at university, makes a wrong decision about their own education thus resulting in a high rate of drop out and a low level of satisfaction for graduated students.Motivated by this reason, we consider characteristics of students, known on enrolment, such as individual information (age, gender) and pre-college qualification (high school, mark).In this paper, we investigate the potential of using the Coxian phase-type distributions to give insights to the risk of drop out and graduation, and on the probability to "survive" at university for groups of students sharing common characteristics.It is hoped that by making students aware that, on the basis of their education to date, their background and/or their personal characteristics, they are more likely to have a particular outcome; either to drop out, complete their degree in time, or take up to six years or more to graduate, and that this will help steer students towards an appropriate course choice which meets their expectations.
A classification tree is introduced to identify the different student profiles and, for each profile, we model student's time at university (length of stay, LoS) using different Coxian phase-type distributions.A new student upon identifying to which of the considered groups they belong, can gain valuable perspective on his/her probability of finishing the studies or dropping out, and on how long it should take him/her to complete or give up.Within the fitted Coxian phase-type distributions, each phase could represent a specific stage in academic career or behaviour.These issues are also of interest in assessing efficiency at the system and institutional level.This enhances the use of the Coxian models for offering university leaders possible insights into the actual needs of change in management and lead universities to develop effective retention programmes and initiatives aimed at reducing drop outs and the time taken to complete the degree.This paper is organized as follows: Section 2 introduces the Coxian phase-type distribution, Section 3 and 4 report the analysis of length of stay at Milano-Bicocca University using Coxian phase-type distributions for groups of students classified according to their individual characteristics.Section 5 concludes the work and reports on possible future developments.

Coxian Phase-Type Distribution
Coxian phase-type distributions (Neuts, 1989) are used to describe the time to absorption of a finite Markov chain in continuous time, where there is a single absorbing state (n + 1) and n ordered transient states or phases.The process starts in the first phase, then moves through sequential phases with the choice of entering the absorbing state at any time.For example, the student career at university can be thought of as a series of transitions through latent phases until an event of leaving university occurs due to graduation, drop out or transfer.Absorption from the first phases would represent the drop out of academic programmes, while absorption from the latest phases would indicate the conclusion for those students who complete a degree.
For i = 1, 2, . . ., n − 1 the probability that a unit moves from one phase to the next one in the time interval δt is and for i = 1, 2, . . ., n the probability that a unit leaves the system by entering the absorbing state is The parameters of the Coxian phase-type distribution, λ i and μ i , describe the transition rates through the ordered transient states (from state i to state i + 1) and the transition rates from the transient states to the absorbing state (from state i to the absorbing state n + 1), respectively, see Figure 1.The density and survival functions of the variable T , the time until absorption, are given by: and the hazard function is h(t) = f (t)/S (t), where p = (1, 0, 0, ..., 0) is the 1 × n vector of probabilities defining the initial transient phases, q = −Q1 = (μ 1 , μ 2 , ..., μ n ) is the n × 1 vector of transition rates from transient phases to the absorbing phase, and Q is the matrix of transition rates restricted to the transient phases The probability that the individual leaves the system at phase i, say π i is determined as a function of the estimated parameters μ i and λ i for i = 1, . . ., n, as follows: . . .
It is a usual procedure to aggregate phases sharing common characteristics to form stages, the interpretation of the stages is, often, more intuitive and meaningful.
Time (length of stay) may then be divided into intervals.In general, the k th length of stay interval (at the k th stage for example) can be determined by S k = t ( j) : m k−1 i=1 π i < j < m k i=1 π i where t (1) , t (2) , ..., t (m) represent the ordered lengths of stay data for each individual and m represents the number of observations (Marshall & McClean, 2003).
Parameters μ i and λ i , i = 1, . . ., n, are estimated by fitting Coxian phase-type distributions via the EM algorithm (Asmussen et al., 1996) appropriately modified to take censored data into account.Likelihood ratio tests are performed to determine the most suitable number of phases.The likelihood function with censored observations is: where α j is an indicator variable which equals 1, if t j is a complete time for the j th unit and α j = 0, if t j is a censored for the j th unit (that is, the event does not occur before the end of the observational period).
Previous research has successfully used Coxian phase-type distributions to represent survival times as the length of time until a certain event occurs, where the phases are considered to be stages in the survival and the absorbing, final stage, the event that occurs causing the individual or element to leave the system completely.For instance, this event could be a patient recovering from an illness, a patient having a relapse, an individual leaving a certain type of employment, a piece of equipment failing, or a patient dying.Faddy (1994) illustrates how useful the Coxian phase-type distributions are in representing survival times for various applications such as the length of treatment spell of control patients in a suicide study, the time prisoners spend on remand and the lifetime of rats used as controls in a study of ageing.
In particular, Faddy and McClean (1999) used the Coxian phase-type distribution to find a suitable distribution for modelling the duration of stay of a group of male geriatric patients in hospital.They found that the phasetype distributions were ideal for measuring the lengths of stay of patients in hospital and showed how it was also possible to consider other variables that may influence the duration.More recently, Marshall and McClean (2003) have demonstrated how the Coxian phase-type distribution can, unlike alternative approaches, adequately model the survival of various groups of elderly patients in hospital uniquely capturing the typical skewed nature of such survival data in the form of a conditional phase-type model (C-Ph) which incorporates a Bayesian network of inter-related variables.

Data
The empirical analysis is conducted on administrative data of students at the Milano-Bicocca University (hereafter MIB).This university was established on June 10, 1998, to serve students from the northern Italy, to relieve some of the pressure on the over-crowded university of Milan, but first of all to offer the opportunity to take a degree at a much more affordable public university than the two renowned private universities of Milan.
The analyzed data refers to just over twenty thousand (20,069) students enrolled in the academic years 2000/01, 2001/02, 2002/3, 2003/04 at MIB in one of the 8 three year degree programmes: Economics (Ec), Educational Science (ED), Law, Mathematics-Physics-Natural Sciences (MPN), Medicine (Med), Psychology (Psy), Sociology (Soc) and Statistics (Stat).Conditions required for admission to the programmes at Milano-Bicocca university vary according to the Faculty.Students who apply for Psychology, Educational Science, Sociology, or to some of the Mathematics-Physics-Natural Sciences degree programmes are selected through an entry test while students of Economics only need to pass a mathematical and Italian language test.In addition, the undergraduate technical courses in Medicine such as Biomedical Laboratory Techniques, Dental Hygiene, Midwifery, Nursing and so on, require students to pass an aptitude test to get enrolled.
For each surveyed student, the time-to-exit from university is measured as the number of days elapsed from when a student first enrols until she/he graduates or drops out or changes institution.Drop out students never finish their degree and are academically dismissed if they provide formal renunciation at any time during the year, if they do not pay taxes or if they do not take exams for at least one year.Students who are still studying but transfer to another institution exit in the data analysis.We follow the performances of the students for six academic years; if a student is still enrolled at the end of the observation time, his/her length of stay at university is considered as right censored in the analysis.
The event of interest is whether the student leaves during their study, gets a degree, transfers to another university, or takes six or more years to finish.The life table of all the students during the six years under study is represented graphically in Figure 2. In the first 3 years, 6167 students (31%) drop out and leave their academic career, 832 (4%) change university and 4675 (23%) take the degree in the regular time.Surprisingly, out of those completing within three years, 34% did so at the beginning of the third year.It is interesting to observe that 4817 students complete the degree programme after the legal duration of the courses.They are known as fuoricorso students representing 50.7% of graduated students, which is consistent with the other Italian universities (Miur, 2011).At the end of the observation, 1256 students are still enrolled (6%) taking at least six years to complete their study.Table 1 shows the distribution of students by personal information and Faculty.The composition of the surveyed students is very heterogeneous among the Faculties.In particular, the courses of Mathematics and Physics attract much more male students, while Medicine, Psychology, Sociology and Educational Science have typically a female setting.Most students of Statistics and Mathematics-Physics-Natural Sciences enrol immediately after the high school graduation.In general, more than half students live in Milan, but two out of three students enrolled on Medicine programmes come from other cities, probably because the offered courses are not commonly supplied in other Italian universities.
The preferences of students coming from different types of high school are in one case unexpected.Two-thirds of the students enrolled in Economics, Educational Sciences, Law and Medicine attended professional high schools while half the group of students who prefer to apply for the other programmes are qualified at liceo.It is unexpected that 63% and 68% of enrolled students into Law and Medicine courses respectively are graduated at a technical high school.More or less the same percentage of students enrolled in the four cohorts.
The final mark at the high school (Note: As the type of scale of mark at the high school has changed during the last 10 years, we considered a homogeneous version with minimum value 0.6 and maximum 1) seems not to be a discriminating factor.The average mark is quite similar for all the Faculties, except for the Psychology and Statistics programmes which are preferred by students who graduate with a medium-high level.The smallest average high school mark is for students on Medicine courses.
Table 2 shows the percentage of students of every Faculty by the cause of exit from MIB.Overall, almost half of students (47%) succeed to earn a degree, but a high percentage of enrolled students (42%) do not complete the degree programmes.At the end of the observation time (6 years), 6% of the students are still enrolled.As regards the Faculties, students attending courses of Law and Economics perform the worst.The rate of students who start studying Law (Economics) and never finish is dramatically high, 60% (53%).The Faculty of Law has also the highest percentage (10%) of students still "in progress" toward the degree after six years.
On the other hand, the percentage of students who graduated in Medicine programmes is substantially higher: nearly eighty percent of enrolled students complete their degree, while only 17% of students decide to drop out and less than 1% take over five years to finish.Note that these students have the smallest average mark at high school, but they do better in attaining a degree than their colleagues.
Students enrolled on Educational Sciences, Psychology, Sociology and Statistics courses perform similarly, with approximately 55% of these students completing their degree and nearly 35% dropping out.

A Classification Tree
A CART (Classification and Regression Tree) is a binary decision tree that is constructed by splitting a node into two child nodes repeatedly, beginning with the root node that contains the whole learning sample.Specifically, it is a non-parametric tree-structured recursive partitioning method, introduced by Breiman et al. (1984), to predict a response variable on the basis of certain predictors observed on a learning sample.The algorithm consists of two main stages: growing and pruning.In growing, the tree is recursively partitioned into subsets (nodes); each partition is obtained by examining all the possible binary splits along the observed data of each predictor variable and selecting the split that most reduces some measure of node impurity.The result is a sequence of nested trees, with increasing numbers of leaves (terminal nodes), until no more splits are possible and the fully grown tree is reached.The pruning operation on the fully grown tree aims then to select the best subtree and consists by declaring an internal node as terminal and deleting all its descendants; this makes the tree more general and prevents any over-fitting on the training set.The aim of the classification tree is to predict the level of the response on the basis of the vector of the explanatory variables.In this paper the results for CART are obtained using the R package rpart (Therneau, 2012).
We created the classification tree reported in Figure 3 to classify the students according to their propensity for completing their degree based on individual characteristics collected at enrolment.Actually, the categorical response variable of interest is the event that determines the exit from MIB with four categories: degree, drop out, change, still enrolled.In the nodes of the tree, students with common attitude towards study who behave in a similar way are joined.
The Gini index is used to evaluate the node impurity and the misclassification rate at the final stage is 0.387.The mode of the four categories is shown on the final nodes in Figure 3. Faculty (1), age at enrolment (1), mark at high school (0.7), enrolment time (0.6) and high school type (0.2) give the most relevant contribution in the classification.The normalized measure of the importance of each predictor variable in relation to the final tree is reported in brackets.
Figure 3.The classification tree for the MIB students by the risk of graduation or drop out At the end of the pruning procedure, the terminal nodes of the tree identified 10 groups.Table 3 reports the description of the 10 groups ranked according to the likelihood of graduation, the graduation risk.For every group identified, the most frequent cause of exit is reported (column 6, Table 3).The last three columns of the table contain the mean, median and coefficient of variation of the length of stay(in days) at MIB of all the students belonging to each group.
The first group, for example, associated to a high risk of graduation, consists of over 19 years old students in Medicine.These students are probably the most motivated in studying: as they have been forced to pass an admission test and are mostly living outside of Milan, deciding to come to Milan probably only to attend courses.
The study programmes give professional training where students interested in such courses make job-oriented choices without wasting time.The students tend to be older than their colleagues probably as most of them have failed the test previously.Their length of stay at MIB on average is 1031 days, so they are likely to complete their degree in a timely manner.At the end of the six years only 2 students are censored.
The last node (hereafter, the tenth), corresponding to the highest risk of drop out (67%) combined with the smallest percentage of graduated students (14%), and refers to the group of students who enrol in Economics, Mathematics-Physics-Natural Sciences, Statistics and Law at least one year later than graduation at school when they are older than 19 and drop out of the academic programmes during the first 2 years (average LoS is 775 days).They resemble the category of students who enter university without any real motivation, probably under the pressure of their family who believe in the usefulness of the degree to getting a job, but who give up along the way.
The ninth group consists of students qualified at high schools different from liceo with at most a medium mark (<=0.775),enrolled immediately after school into Mathematics-Physics-Natural Sciences, Economics and Law.They could be those students who enter college with a weak background and do not perform at the level required to meet the Faculty standards and decide to leave, in fact half of them drop out.
Another group, the third, associated to a high risk of graduation (60%) consists of students enrolled immediately after school with the highest marks at Faculties Mathematics-Physics-Natural Sciences, Economics and Law.Their colleagues of the same Faculties and same age, but graduated at liceo with low-medium marks, compose the fourth group which rises above the others for the longest stay at MIB, 50% of them do not complete degree programmes within normal time and takes more than 3.64 years.Another node (the sixth), corresponding to a medium risk of drop out (46%) combined with a slightly lower percentage of graduated students (36%), refers to the group of male students over 19 years old, enrolled immediately after school into Educational Science, Sociology and Psychology Faculties.They could be those students who prefer to reconcile work with study, the programmes of these Faculties are in fact suitable also to part-time students.Students belonging to the first group are considerably more likely to graduate within three regular years.Most of students in the second and third groups complete their degree in a timely manner.The percentage of students who graduate in corso is higher for group five than for group six even if the graduation rate is in the opposite order (see Table 3).The other groups can be characterized by high drop out rates and low graduation rates, having also a high rate of fuoricorso students.

The Survival and Hazard Functions for the 10 Groups of Students
To complete the explorative study of the length of stay at MIB, the empirical survival and hazard functions for the ten groups of students can be determined.Although the general trend of the survival curves is common to all the groups, the plotted functions do not overlap and the log-rank test confirms that a significant difference exists among them (Chisq=538, df=9, p-value=0.0001).
In particular, the empirical survival curve of the first group stands out from all the other groups.It shows a sudden decline around the end of the third academic year, where most students take the degree.The curve of the second group, instead, dominates the other curves until the graduation time occurs, as the drop out/transfer rate registered for this group of students in the first three academic years is the smallest.
The performance of students decreases as the group number rises.
An overall inspection of Figure 5 shows that the higher the number of group the higher the risk of drop out at the beginning of the academic career, moreover, the first groups are characterized by a greater risk of completing the degree programme then the latest groups.This seems reasonable to expect given that groups one and two have the highest percentage of students completing within three years, therefore the risk of completing should be greater.Likewise, it is in the earlier period of study that you would expect students to be most indecisive of their course choice and most likely to drop out.

Fit of the Coxian Phase-Type Distribution
Tables 5a and 5b report the fit of the Coxian phase type distribution parameters using the EM-algorithm (Asmussen et al., 1996) adjusted for censored data.From inspection of the results, it is apparent that a 19 phase Coxian phasetype distribution is the most suitable for all the data (Note: We use a Chi-square test for nested model where the Loglikelihood for 19 phases was -136603.4351 and the Loglikelihood for 20 phases was -136602.9433,p-value=0.3884)together.However, it is important to note that some of the parameters μ i associated with phase i in Table 5a are equal to zero.This would suggest that no one is observed leaving this phase which can therefore be aggregated with the neighbouring phase.Hereafter, the term stage will indicate a set of sequential phases with estimates of μ i approaching to zero aggregated together with the closest phase associated to a strictly positive μ i .
In effect, the phases with small values of μ i parameters are redundant and only the most dominant phases with the largest μ i values are meaningful.This will also prevent an over-fitted model as reported in earlier literature (Marshall et al., 2012).In each stage we calculated the probability of leaving university due to degree, or drop out, or transfer.Actually, in each of the estimated phases where μ i = 0 it is possible to leave the process of study, but the probability is so small that it is more likely for students to stay in the process than leave.The length of stay of all the surveyed students at MIB is analysed and found to be most suitably represented by a 4 phase Coxian distribution.The estimated parameters indicate four positive absorption probabilities (Note: Actually the first value of π i is positive but strongly close to zero and it is ignored).So, the career of students seems to go through four sequential stages which we name explorative, intermediate, outcome and tardive.The four stages appear to represent student behaviour appropriately.At the beginning of a university course (the first explorative stage), we imagine an explorative stage in which the students face a new form of study; some of which realize that they cannot perform at the level required to meet the faculty standards and are discouraged to continue.So, the impact of the new environment results in a peak of students leaving (the drop out students).The second intermediate stage relates to students who have previously been unsure of their career path and after an initial attempt to go on, decide not to pursue their studies further where they either drop out or transfer, other students rest "in progress" towards a degree, proceeding step by step.The third stage, the outcome stage, comprises of the motivated students who complete their degree, in a timely manner however there will also be some students who are 'resting in progress'.The final stage (the fourth, tardive stage) are those students, with the longest length of stay at university (fuoricorso or censored data) taking six or more years to complete their degrees.
In particular, • The explorative stage has length of stay between 17 and 464 days.The mean time to departure from MIB is 218 days, that is, less than three quarters of the first academic year.Among students leaving university at this stage, 88% gives up studying completely, while the remaining 12% decide to transfer to another university.
• The intermediate stage has length of stay interval between 465 and 732 days.Here we see the group of students who have been in doubt with a mean stage of 585 days (about one and half years) on whether keep up the pace of study.Unfortunately, 92% of students who leave university during this stage are drop out students while 8% of them choose to transfer.Only 2 students take the degree in this stage (formally they graduated in the first degree session, on June).
• The outcome stage has length of stay between 733 and 1916 days.This comprises strongly motivated students who complete their degree.The average time it takes students to earn the degree in this phase is 1236 days corresponding to about three academic years.In particular, 87 out of 100 students will exit their academic career at this stage and graduate within the regular time, the other students drop out.
• The tardive stage has length of stay varying between 1917 and 2190 days.This final stage is regarded as the tail period where fuoricorso students graduate or remain still enrolled after six years (censored LoS).Eighty percent of students complete their degree while 20% percent remain enroled up to the end of the period.Students leaving at this stage belong to the group of those who wish to graduate but they succeed only after a very long stay at university.The average time to degree is equal to 2161 days, about six years, so twice the regular duration of a degree programme.Understanding the factors that are influential to such a long degree completion time is one of the most crucial issues for the university managers.
Table 5b reports the students leaving the university, the interval and the average LoS for each stage.So, for example, 6394 students leave during the explorative stage within 464 days and their career lasts on average 217.53 days.
In describing the stages, we focused on students who left the system, but of course there are all the motivated students who rest in progress toward a degree and proceed through these sequential stages increasing their abilities (and their human capital) by attending courses and passing exams.
Figure 6 displays how the estimated Coxian phase-type distribution fits the empirical distribution of the lengthsof-stay at MIB university.The fitted density seems to meet the characteristic shape with two peaks, the former due to the high drop out rate and the latter related to students graduating with a degree.At first glance, the results make it clear that the number of stages is quite different within the different groups.Only the distribution for the second group has the 4 stages detected in the distribution fitted on the complete sample of students.Moreover, for students in groups 3 to 6, the intermediate stage seems not to be relevant.
A case which deserves attention is the first group, in such a case a remarkable high number of phases are registered and the algorithm for parameter estimation appears to have difficulty in reaching convergence.It is possible that a more suitable mixed model (continuous-discrete) for the student performances of this group can avoid the convergence difficulty.This problem needs further investigation.
A comparison between the empirical and estimated distributions for each group is shown in Figure 7.The Coxian phase-type distributions differ according to the different student groups but almost all the fitted densities seem to suitably represent the empirical trends, even if for some groups the fitting is substantially better than for other groups.Recall that the groups are ordered according to the chance of graduation, so the later groups involve students whose performance is poor and are more likely to drop out than their colleagues of the first groups with a marked propensity towards study.In Figure 7, the higher the number of group, the better the fit of the Coxian phase-type model, thus the use of the Coxian phase-type distribution seems to be more appropriate for modeling the MIB of students who perform worse and have very long lengths of stay.This agrees with previous research where Coxian phase-type distributions are used to represent survival or length of stay of elderly patients in hospital.There is more heterogeneity in the earlier stages of survival which is to be expected as there is a bigger case mix of individuals present in the first phases.In fact Faddy and McClean (1999) and Marshall et al. (2003) both highlight that the first phase includes elderly patients who either leave the system quite quickly due to having minor problems and thus return home or who have critical health problems and die within a short period of time in hospital.The approach is good at representing the very long stay patients who are consuming large amounts of hospital resources by staying in hospital for a long period of time.Likewise, the research presented in this paper is primarily concerned with those students who have very long stays at University and do not complete their course within six years of stay.
Figure 7.The empirical and estimated density distributions for the ten groups As expected, at first sight, comparing the Coxian phase-type distributions in succession from the first to the latest ones, it appears clear that the second peak in the distributions, relates to lengths of stay of graduated students, gradually tailing off as the number of group increases, while the peak in the initial stage caused by the dropping out students rises.Thus, the distributions are initially (for the first groups) bimodal and then tend to become highly skewed with only one large peak at the explorative stage (for the last groups).
In the first 4 plots, students are likely to enter the absorbing state either for dropping out and for graduation.Plots relative to all the groups from the 5th to the 10th instead show that most of the students reach the absorbing state in the first explorative stage, a high rate of enrolled students never finish their degree.This represents the very challenge of university leaders who need to ensure policies and practices to prevent this academic failure.
The percentage of students who graduated is substantially higher for the first three groups of students, the corresponding plots exhibit the largest peak of the density at the outcome stage.In the third group, for example, students with a good background at school who decide to enroll straight into the academic programmes of Mathematics-Physics-Natural Sciences, Economics and Law are more likely to complete their degree.The Coxian phase-type distribution captures this performance.On the other hand, the over 19 year old students in the eight group, enroled in Education, Sociology and Psychology Faculties at least one year after the high school, do not overcome the initial difficulties and most give up in the first two years (approximately 460 days).The Coxian phase-type distribution is able to represent this empirical mode in the explorative stage (see Figure 9).For the first group the fitted distribution does not capture the extremely high second peak of the empirical distri-bution due to the fixed graduation dates where graduation falls on fixed days in the academic year.An alternative representation is to consider a mixture of Coxian phase-type distributions.However, doing so does not improve the fit any further than what is presented in this paper.This aspect will undergo further investigation and we restrict the focus of this paper on the extremely long stay students.
Figure 9.The empirical and estimated hazard for the ten groups Fitting the Coxian phase-type models enables us to offer possible insights on estimating the risk of leaving university due to drop out or graduation reasons according to the group to which the student belongs.The Coxian phase-type distribution also provides the estimates of the probability to "survive" in the university system towards the degree and distinguish between different groups of students depending on the survival probability they have.
Survival and hazard functions are estimated using the Coxian phase-type densities of each group, and are compared with the empirical curves determined by the non-parametric Kaplan-Meier procedure in Figures 8 and 9, respectively.The estimated functions approximate the empirical curves well, where the fitted results for the densities represent the highest ranked groups most appropriately (Figure 7).In particular, the estimated survival functions overlap the empirical ones for all of the later groups (particularly for groups 7-10).

Conclusion
The work presented in this paper introduces an innovative application of the Coxian phase-type distribution to the University student progression and drop out phenomena.There are two models considered.The first, introduces a classification tree to divide the students into different profiles of stay according to their characteristics known on enrolment of their course.This produced ten groups of student with differing characteristics across the groups and time at university represented using different Coxian phase-type distributions.A new student, upon identifying the group to which he/she belongs, can gain valuable perspective on his/her probability of finishing the studies or dropping out, and on how long it should take him/her to complete or give up.Within the fitted Coxian phase-type distributions, each phase represents a specific stage in academic career or behaviour.These issues are also of interest in assessing efficiency at the system and institutional level.This enhances the use of the Coxian models that would offer university leaders possible insights into the actual needs of change in management and lead universities to develop effective retention programmes and initiatives aimed at reducing drop outs and reducing the times taken to complete a degree.Upon developing a model for the ten different student groups, the Coxian phase-type distribution was fitted again separately for each group of student.This provides further refinement of the student length of stay by modeling each student group as a sequence of phases in a separate Coxian phase-type distribution.In doing so, an improvement in survival predictions can be made to the student stay.
This second model follows a similar format to that by Harper et al. (2012) and Marshall et al. (2012) who introduce the Discrete Conditional Phase-type distribution using a classification tree to model patient characteristics on admission to hospital as the first component in the model which is conditioned on the second component, the patient length of stay in hospital represented by a Coxian phase-type distribution.Such an approach is very applicable to student time at University and consistent with previous research.This paper extends that work to another application area and in doing so is able to use the fitted Coxian phase-type distribution to define four stages of student behaviour in University.Linked with these stages different student characteristics and associated likely result for that student in terms of graduating on time or dropping out.Student progression at an University is a concern for many countries particularly the costs incurred and the stress to the student.As further work, it is planned that the models presented in this paper will be applied to student data for other countries.One particular example that will be considered is the application of the model for University students in Greece.Another possible extension to this work is to incorporate the costs into the model.

Figure 1 .
Figure 1.An illustration of the Coxian phase-type distribution

Figure 2 .
Figure 2. The progression of students during the six years of observations.The percentage is calculated with respect to students enrolled at the beginning of each academic year

Figure 4 .
Figure 4. Empirical survival curves for the ten groups of studentsFigure4displays the empirical survival curves for each group according to the product limit estimator.The shape of the curves seems quite similar: in the first academic year, a steep decline appears overall due to the high rate of drop outs, followed by a stationary trend in the next two years where a reduced number of drop outs and transfers usually registered.Starting from the end of the third year, there is a gradual decrease of the survival probability of students completing their degree.

Figure 5 .
Figure 5. Empirical hazard functions for the ten groups of students

Figure 6 .
Figure 6.Empirical and estimated distribution of the length of stay at MIB university (on the complete sample of students)

Figure 8 .
Figure 8.The empirical and estimated survivals for the ten groups

Table 1 .
Distribution of students by faculties and information at enrolment

Table 2 .
Distribution of students by Faculties and exit from MIB

Table 3 .
Description of the groups identified by the classification tree

Table 4 .
Rates of in corso and fuoricorso students by groups

Table 5a .
Results of fitting Coxian phase-type distribution

Table 5b .
Average and interval of length of stay at each stage

Table 7 .
Results of fitted Coxian distribution to the length of stay in the 10 groups of students Description: the* indicates that actually one more value of μ i is positive but strongly close to zero and it is ignored.The upper bound (up.bound in the table) of the intervals of LoS at each stage is indicated, the first lower bound is zero.