Statistical models and analysis techniques for learning in relational data
Many data sets routinely captured by organizations are relational in nature---from marketing and sales transactions, to scientific observations and medical records. Relational data record characteristics of heterogeneous objects and persistent relationships among those objects (e.g., citation graphs, the World Wide Web, genomic structures). These data offer unique opportunities to improve model accuracy, and thereby decision-making, if machine learning techniques can effectively exploit the relational information.
This work focuses on how to learn accurate statistical models of complex, relational data sets and develops two novel probabilistic models to represent, learn, and reason about statistical dependencies in these data. Relational dependency networks are the first relational model capable of learning general autocorrelation dependencies, an important class of statistical dependencies that are ubiquitous in relational data. Latent group models are the first relational model to generalize about the properties of underlying group structures to improve inference accuracy and efficiency. Not only do these two models offer performance gains over current relational models, but they also offer efficiency gains which will make relational modeling feasible for large, relational datasets where current methods are computationally intensive, if not intractable.
We also formulate of a novel analysis framework to analyze relational model performance and ascribe errors to model learning and inference procedures. Within this framework, we explore the effects of data characteristics and representation choices on inference accuracy and investigate the mechanisms behind model performance. In particular, we show that the inference process in relational models can be a significant source of error and that relative model performance varies significantly across different types of relational data.