Innovative Data Mining Based Approaches for Life Course Analysis

Ritschard, Gilbert; Gabadinho, Alexis; Mueller, Nicolas Séverin; Studer, Matthias

This communication presents a just starting research project aiming at exploring the possibilities of resorting to data-mining-based methods in personal life course analysis. The project has also a socio-demographic goal, namely to gain new insights on how socio-demographic, familial, educational and professional events are entwined, on what are the characteristics of typical Swiss life trajectories and on changes in these characteristics over time. Methods for analyzing personal event histories can be categorized in two broad classes: 1) Survival methods that focus on a given event (e.g. first union, being married, birth of first child, first job, end of job, moving) and analyze the hazard of experiencing the event after a duration t, or more or less equivalently the duration until the event occurs. 2) Sequential methods that consider for each case the whole sequence of monthly or annual states of the variables of interest (education level, marital status, professional status, number and age of children, …) and attempt to discover regularities and differences in these sequences among individuals. The former approach requires mainly data in time stamped event form, while the latter needs them in the form of sequences. As for the first approach, where classical tools are survival curves and regression like risk models such as the widely used proportional hazard Cox model, we will discuss the advantages of survival trees. Their principle consists in recursively partitioning the data at hand by means of explanatory variables so as to get groups with survival curves or hazard functions that differ as much as possible from one group to the other. This tree approach advantageously complements classical regression like models by their ability to automatically detect the most significant interaction effects. Sequential data analysis is less popular though it is best suited for analyzing whole trajectories in a holistic way. Discrete Markov models are sometimes used, especially in mobility analysis for analyzing the transition rates between states. Clustering of sequences using optimal matching is also a very powerful descriptive analysis tool. In this framework, we are mainly interested in: 1. Mobility trees, which are just classification trees where the present state serves as target class variable and the previous states belong to the set of predictors, and 2. In exploiting algorithms developed in the data mining area for finding frequent subsequences and associations between subsequences in for instance web logs or DNA sequences. We expect such tools to be helpful for discovering relevant relationships between for example characteristic family life subsequences and professional subsequences. Even though the project started only at the beginning of February 2007, we are able to present our very first results. Using data from the SHP panel data we present mobility trees for the working status (full time active, part time active, unemployed, non active). This exhibits interesting interactions between previous states and the education level. For instance, women with lower education (less than full time vocational school) have much higher chances to leave the labor force after having been active occupied than those with higher education. Using the 2002 biographical SHP survey, we present a first attempt of survival tree for analyzing marriage duration until divorce. One important result that follows from this analysis is how the role of the child as a moderator of the divorce hazard depends upon the birth cohort: the child seems to become cement for the marriage only for those born after 1940. Regarding the association between frequent subsequences, we will investigate, using among others implication graphs, the relationship between selected event subsequences.

Archive ouverte UNIGE

Innovative Data Mining Based Approaches for Life Course Analysis

Technical informations