Mixed Method Development of Evaluation Metrics

Abstract

Designers of online search and recommendation services often need to develop metrics to assess system performance. This tutorial focuses on mixed methods approaches to developing user-focused evaluation metrics. This starts with choosing how data is logged or how to interpret current logged data, with a discussion of how qualitative insights and design decisions can restrict or enable certain types of logging. When we create a metric from that logged data, there are underlying assumptions about how users interact with the system and evaluate those interactions. We will cover what these assumptions look like for some traditional system evaluation metrics and highlight quantitative and qualitative methods that investigate and adapt these assumptions to be more explicit and expressive of genuine user behavior. We discuss the role that mixed methods teams can play at each stage of metric development, starting with data collection, designing both online and offline metrics, and supervising metric selection for decision making. We describe case studies and examples of these methods applied in the context of evaluating personalized search and recommendation systems. Finally, we close with practical advice for applied quantitative researchers who may be in the early stages of planning collaborations with qualitative researchers for mixed methods metrics development.

Presentation

Slides from the tutorial can be found here: Mixed Method Development of Evaluation Metrics (KDD 2021).

Presenters

Praveen Chandar is a researcher at Spotify working on experimentation and evaluation of search & recommender systems. His research interests are in machine learning, information retrieval, and recommendation systems with a focus on experimentation and evaluation. Praveen received his Ph.D. from the University of Delaware, working on novelty and diversity aspects of search evaluation. He has published papers at top conferences including SIGIR, KDD, WSDM, WWW, CIKM, CHI, and UAI. He has also served as a program committee member at top conferences such as SIGIR, KDD, WSDM, etc., and has co-taught a tutorial at NeurIPS 2020 on Grounding Evaluation Metrics for Human-Machine learning systems.

Fernando Diaz is a research scientist at Google Research Montréal. His research focuses on the design of information access systems, including search engines, music recommendation services and crisis response platforms. He is particularly interested in understanding and addressing the societal implications of artificial intelligence more generally. Previously, Fernando was the assistant managing director of Microsoft Research Montréal and a director of research at Spotify, where he helped establish its research organization on recommendation, search, and personalization. Fernando’s work has received awards at SIGIR, WSDM, ISCRAM, and ECIR. He is the recipient of the 2017 British Computer Society Karen Spärck Jones Award. Fernando has co-organized workshops and tutorials at SIGIR, WSDM, and WWW. He has also co-organized several NIST TREC initiatives, WSDM (2013), Strategic Workshop on Information Retrieval (2018), FAT* (2019), SIGIR (2021), and the CIFAR Workshop on Artificial Intelligence and the Curation of Culture (2019).

Christine Hosey is a user researcher at Spotify in Boston. She obtained her PhD in Behavioral Science from the University of Chicago’s Booth School of Business. Her PhD research focused on goal pursuit, motivation and person perception. Her current research aims to infer the meaning of user behavior in a way that is usable for metric development and recommender systems. Some of her recent work has been published at top conferences including CHI, WWW, SIGIR, RecSys, and the Journal of Experimental Social Psychology. Christine has co-taught a tutorial at RecSys 2018 on Mixed Methods for Evaluating User Satisfaction.

Brian St. Thomas is a data scientist at Spotify in Boston, researching online experimentation methods and metric development. He received his PhD in Statistical Science from Duke University, and was previously a Data Scientist in TiVo’s Advanced Search and Recommendations Division. His primary research interests are in the development and evaluation of personalized recommendation and search systems, with a focus on statistical aspects of these problems. He has contributed research to the fields of information retrieval, dimension reduction, and manifold learning and has published in the proceedings of SIGIR, RecSys, CHI, WWW, and the Journal of the American Statistical Association. He also co-taught tutorials at RecSys 2018 and NeurIPS 2020 on metric development.