Projects
The rise of large language models (LLMs) has given rise to a class of prompt-based interactive systems where users primarily express their input in natural language. However, composing a prompt as a linear text string becomes unwieldy when capturing users’ multifaceted intents. We present Object-Oriented Prompting (OOPrompt), an emergent interaction paradigm that enables users to create, edit, iterate, and reuse prompts as structured, manipulable artifacts, unifying and generalizing several existing point systems. We first outlined a design space from existing work and built an early prototype, which we deployed as a probe in a formative study with 20 participants. Their feedback informed an expanded OOPrompt design space. We then developed the full OOPrompt prototype and conducted a validation study to further understand OOPrompt’s added values and trade-offs. We expect the OOPrompt design space to provide theoretical and empirical guidance to the design and engineering of prompt-based, LLM-enabled interactive systems.
LLMs are now embedded in a wide range of everyday scenarios. However, their inherent hallucinations risk hiding misinformation in fluent responses, raising concerns about overreliance on AI. Detecting overreliance is challenging, as it often arises in complex, dynamic contexts and cannot be easily captured by post-hoc task outcomes. In this work, we aim to investigate how users’ behavioral patterns correlate with overreliance. We collected interaction logs from 77 participants working with an LLM injected plausible misinformation across three real-world tasks and we assessed overreliance by whether participants detected and corrected these errors. By semantically encoding and clustering segments of user interactions, we identified five behavioral patterns linked to overreliance: users with low overreliance show careful task comprehension and fine-grained navigation; users with high overreliance show frequent copy-paste, skipping initial comprehension, repeated LLM references, coarse locating, and accepting misinformation despite hesitation. We discuss design implications for mitigation.
Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that incorporates expert labeling characteristics, average fact–abstract similarity (F1), and low-similarity fact counts (F2), enabling the construction of a silver-standard dataset (3,720 QA pairs).
AI can now generate high-fidelity UI mock-up screens from a high-level textual description, promising to support UX practitioners’ work. However, it remains unclear how UX practitioners would adopt such Generative UI (GenUI) models in a way that is integral and beneficial to their work. To answer this question, we conducted a formative study with 37 UX-related professionals that consisted of four roles: UX designers, UX researchers, software engineers, and product managers. Using a state-of-the-art GenUI tool, each participant went through a week-long, individual mini-project exercise with role-specific tasks, keeping a daily journal of their usage and experiences with GenUI, followed by a semi-structured interview. We report findings on participants’ workflow using the GenUI tool, how GenUI can support all and each specific roles, and existing gaps between GenUI and users’ needs and expectations, which lead to design implications to inform future work on GenUI development.
One of the long-standing aspirations in conversational AI is to allow them to autonomously take initiatives in conversations, i.e., being proactive. This is especially challenging for multi-party conversations. Prior NLP research focused mainly on predicting the next speaker from contexts like preceding conversations. In this paper, we demonstrate the limitations of such methods and rethink what it means for AI to be proactive in multi-party, human-AI conversations. We propose that just like humans, rather than merely reacting to turn-taking cues, a proactive AI formulates its own inner thoughts during a conversation, and seeks the right moment to contribute. Through a formative study with 24 participants and inspiration from linguistics and cognitive psychology, we introduce the Inner Thoughts framework. Our framework equips AI with a continuous, covert train of thoughts in parallel to the overt communication process, which enables it to proactively engage by modeling its intrinsic motivation to express these thoughts. We instantiated this framework into two real-time systems: an AI playground web app and a chatbot. Through a technical evaluation and user studies with human participants, our framework significantly surpasses existing baselines on aspects like anthropomorphism, coherence, intelligence, and turn-taking appropriateness.
Drug discovery (DD) has tremendously contributed to maintaining and improving public health. Hypothesizing that inhibiting protein misfolding can slow disease progression, researchers focus on target identification (Target ID) to find protein structures for drug binding. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug discovery, integrating models into cohesive workflows remains challenging. We conducted a user study with drug discovery researchers to identify the applicability of LLMs and RAGs in Target ID. We identified two main findings: 1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on an initial protein and protein candidates that have a therapeutic impact; 2) the model must provide the PPI and relevant explanations for better understanding. Based on these observations, we identified three limitations in previous approaches for Target ID: 1) semantic ambiguity, 2) lack of explainability, and 3) short retrieval units. To address these issues, we propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve agent pipeline RAG framework to support large-scale PPI signaling pathway exploration in understanding therapeutic impacts by decomposing the analysis of entire PPI pathways into sub-tasks focused on the analysis of PPI edges.
Economic constraints on recruiting experts hinder efforts to build qualified datasets for utilizing AI in professional domains (e.g., medical diagnosis), which could provide societal benefits. To solve this issue, previous studies introduced crowdsourcing and AI to enable non-experts to perform expert-level data labeling. Yet, they encountered three challenges: 1) the limited applicability of crowdsourcing in less specialized domains (e.g., identifying animal species); 2) the chicken-and-egg problem, a paradox where high-performance AI is required to build a dataset to train such AI; and 3) over-reliance on AI, where non-experts, lacking expertise, may incorrectly label data when guided by sub-optimal AI. To address this, we introduce DANNY (Data ANnotation for Non-experts made easY), an AI-based tool designed to help non-experts label an arthritis dataset, aiming to increase labeling accuracy and mitigate over-reliance on AI. By externalizing a cognitive forcing intervention to foster critical thinking, DANNY provides two visualizations: 1) the Criteria phase, where non-experts define criteria across four arthritis features, and 2) the Correction phase, where they refine these criteria by comparing them to AI suggestions. In a study with 28 participants, DANNY users achieved higher accuracy and a more appropriate reliance on AI dependency than control groups. A follow-up study with 12 participants demonstrates how DANNY can be used to improve AI with an ensemble method. Our findings contribute new insights into using AI to support non-experts in labeling domain-specific data when expert resources are limited.
As Artificial Intelligence (AI) making advancements in medical decision-making, there is a growing need to ensure doctors develop appropriate reliance on AI to avoid adverse outcomes. However, existing methods in enabling appropriate AI reliance might encounter challenges while being applied in the medical domain. With this regard, this work employs and provides the validation of an alternative approach – majority voting – to facilitate appropriate reliance on AI in medical decision-making. This is achieved by a multi-institutional user study involving 32 medical professionals with various backgrounds, focusing on the pathology task of visually detecting a pattern, mitoses, in tumor images. Here, the majority voting process was conducted by synthesizing decisions under AI assistance from a group of pathology doctors (pathologists). Two metrics were used to evaluate the appropriateness of AI reliance: Relative AI Reliance (RAIR) and Relative Self-Reliance (RSR). Results showed that even with groups of three pathologists, majority-voted decisions significantly increased both RAIR and RSR – by approximately 9% and 31%, respectively – compared to decisions made by one pathologist collaborating with AI. This increased appropriateness resulted in better precision and recall in the detection of mitoses. While our study is centered on pathology, we believe these insights can be extended to general high-stakes decision-making processes involving similar visual tasks.
Situationally Induced Impairments and Disabilities (SIIDs) can significantly hinder user experience in contexts such as poor lighting, noise, and multi-tasking. While prior research has introduced algorithms and systems to address these impairments, they predominantly cater to specific tasks or environments and fail to accommodate the diverse and dynamic nature of SIIDs. We introduce Human I/O, a unified approach to detecting a wide range of SIIDs by gauging the availability of human input/output channels. Leveraging egocentric vision, multimodal sensing and reasoning with large language models, Human I/O achieves a 0.22 mean absolute error and a 82% accuracy in availability prediction across 60 in-the-wild egocentric video recordings in 32 different scenarios. Furthermore, while the core focus of our work is on the detection of SIIDs rather than the creation of adaptive user interfaces, we showcase the efficacy of our prototype via a user study with 10 participants. Findings suggest that Human I/O significantly reduces effort and improves user experience in the presence of SIIDs, paving the way for more adaptive and accessible interactive systems in the future.
Mitosis is a critical criterion for meningioma grading. However, pathologists’ assessment of mitoses is subject to significant inter-observer variation due to challenges in locating mitosis hotspots and accurately detecting mitotic figures. To address this issue, we leverage digital pathology and propose a computational strategy to enhance pathologists’ mitosis assessment. The strategy has two components: (1) A depth-first search algorithm that quantifies the mathematically maximum mitotic count in 10 consecutive high-power fields, which can enhance the preciseness, especially in cases with borderline mitotic count. (2) Implementing a collaborative sphere to group a set of pathologists to detect mitoses under each high-power field, which can mitigate subjective random errors in mitosis detection originating from individual detection errors. By depth-first search algorithm (1) , we analyzed 19 meningioma slides and discovered that the proposed algorithm upgraded two borderline cases verified at consensus conferences. This improvement is attributed to the algorithm’s ability to quantify the mitotic count more comprehensively compared to other conventional methods of counting mitoses. In implementing a collaborative sphere (2) , we evaluated the correctness of mitosis detection from grouped pathologists and/or pathology residents, where each member of the group annotated a set of 48 high-power field images for mitotic figures independently. We report that groups with sizes of three can achieve an average precision of 0.897 and sensitivity of 0.699 in mitosis detection, which is higher than an average pathologist in this study (precision: 0.750, sensitivity: 0.667). The proposed computational strategy can be integrated with artificial intelligence workflow, which envisions the future of achieving a rapid and robust mitosis assessment by interactive assisting algorithms that can ultimately benefit patient management.
Recent progress in Text-to-Image (T2I) models promises transformative applications in art, design, education, medicine, and entertainment. These models, exemplified by Dall-e, Imagen, and Stable Diffusion, have the potential to revolutionize various industries. However, a primary concern is their operation as a 'black-box' for many users. Without understanding the underlying mechanics, users are unable to harness the full potential of these models. This study focuses on bridging this gap by developing and evaluating explanation techniques for T2I models, targeting inexperienced end users. While prior works have delved into Explainable AI (XAI) methods for classification or regression tasks, T2I generation poses distinct challenges. Through formative studies with experts, we identified unique explanation goals and subsequently designed tailored explanation strategies. We then empirically evaluated these methods with a cohort of 473 participants from Amazon Mechanical Turk (AMT) across three tasks. Our results highlight users' ability to learn new keywords through explanations, a preference for example-based explanations, and challenges in comprehending explanations that significantly shift the image's theme. Moreover, findings suggest users benefit from a limited set of concurrent explanations. Our main contributions include a curated dataset for evaluating T2I explainability techniques, insights from a comprehensive AMT user study, and observations critical for future T2I model explainability research.
To measure how HCI papers are cited across disciplinary boundaries, we collected a citation dataset of CHI, UIST, and CSCW papers published between 2010 and 2020. Our analysis indicates that HCI papers have been more and more likely to be cited by HCI papers rather than by non-HCI papers.
We explore the design of Marvista—a human-AI collaborative tool that employs a suite of natural language processing models to provide end-to-end support for reading online news articles. Before reading an article, Marvista helps a user plan what to read by filtering text based on how much time one can spend and what questions one is interested to find out from the article. During reading, Marvista helps the user reflect on their understanding of each paragraph with AI-generated questions. After reading, Marvista generates an explainable human-AI summary that combines both AI’s processing of the text, the user’s reading behavior, and user-generated data in the reading process. In contrast to prior work that offered (content-independent) interaction techniques or devices for reading, Marvista takes a human-AI collaborative approach that contributes text-specific guidance (content-aware) to support the entire reading process.
Video conferencing solutions like Zoom, Google Meet, and Microsoft Teams are becoming increasingly popular for facilitating conversations, and recent advancements such as live captioning help people better understand each other. We believe that the addition of visuals based on the context of conversations could further improve comprehension of complex or unfamiliar concepts. To explore the potential of such capabilities, we conducted a formative study through remote interviews (N=10) and crowdsourced a dataset of over 1500 sentence-visual pairs across a wide range of contexts. These insights informed Visual Captions, a real-time system that integrates with a videoconferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest relevant visuals in open-vocabulary conversations. We present the findings from a lab study (N=26) and an in-the-wild case study (N=10), demonstrating how Visual Captions can help improve communication through visual augmentation in various scenarios.
Artificial Intelligence (AI) brings advancements to support pathologists in navigating high-resolution tumor images to search for pathology patterns of interest. However, existing AI-assisted tools have not realized this promised potential due to a lack of insight into pathology and HCI considerations for pathologists' navigation workflows in practice. We first conducted a formative study with six medical professionals in pathology to capture their navigation strategies. By incorporating our observations along with the pathologists' domain knowledge, we designed NaviPath — a human-AI collaborative navigation system. An evaluation study with 15 medical professionals in pathology indicated that: (i) compared to the manual navigation, participants saw more than twice the number of pathological patterns in unit time with NaviPath, and (ii) participants achieved higher precision and recall against the AI and the manual navigation on average. Further qualitative analysis revealed that navigation was more consistent with NaviPath, which can improve the overall examination quality.
Generative adversarial networks (GANs) have many application areas including image editing, domain translation, missing data imputation, and support for creative work. However, GANs are considered `black boxes'. Specifically, the end-users have little control over how to improve editing directions through disentanglement. Prior work focused on new GAN architectures to disentangle editing directions. Alternatively, we propose GANravel --a user-driven direction disentanglement tool that complements the existing GAN architectures and allows users to improve editing directions iteratively. In two user studies with 16 participants each, GANravel users were able to disentangle directions and outperformed the state-of-the-art direction discovery baselines in disentanglement performance. In the second user study, GANravel was used in a creative task of creating dog memes and was able to create high-quality edited images and GIFs.
Recent developments in AI have provided assisting tools to support pathologists' diagnoses. However, it remains challenging to incorporate such tools into pathologists' practice; one main concern is AI's insufficient workflow integration with medical decisions. We observed pathologists' examination and discovered that the main hindering factor to integrate AI is its incompatibility with pathologists' workflow. To bridge the gap between pathologists and AI, we developed a human-AI collaborative diagnosis tool — xPath — that shares a similar examination process to that of pathologists, which can improve AI's integration into their routine examination. The viability of xPath is confirmed by a technical evaluation and work sessions with twelve medical professionals in pathology. This work identifies and addresses the challenge of incorporating AI models into pathology, which can offer first-hand knowledge about how HCI researchers can work with medical professionals side-by-side to bring technological advances to medical tasks towards practical applications.
Authors make their videos visually accessible by adding audio descriptions (AD), and auditorily accessible by adding closed captions (CC). However, creating AD and CC is challenging and tedious, especially for non-professional describers and captioners, due to the difficulty of identifying accessibility problems in videos. A video author will have to watch the video through and manually check for inaccessible information frame-by-frame, for both visual and auditory modalities. In this paper, we present CrossA11y, a system that helps authors efficiently detect and address visual and auditory accessibility issues in videos. Using cross-modal grounding analysis, CrossA11y automatically measures accessibility of visual and audio segments in a video by checking for modality asymmetries. CrossA11y then displays these segments and surfaces visual and audio accessibility issues in a unified interface, making it intuitive to locate, review, script AD/CC in-place, and preview the described and captioned video immediately. We demonstrate the effectiveness of CrossA11y through a lab study with 11 participants, comparing to existing baseline.
Generative Adversarial Network (GAN) is widely adopted in numerous application areas, such as data preprocessing, image editing, and creativity support. However, GAN's 'black box' nature prevents non-expert users from controlling what data a model generates, spawning a plethora of prior work that focused on algorithm-driven approaches to extract editing directions to control GAN. Complementarily, we propose a GANzilla—a user-driven tool that empowers a user with the classic scatter/gather technique to iteratively discover directions to meet their editing goals. In a study with 12 participants, GANzilla users were able to discover directions that (i) edited images to match provided examples (closed-ended tasks) and that (ii) met a high-level goal, e.g., making the face happier, while showing diversity across individuals (open-ended tasks).
Often, emotional disorders are overlooked due to theirlack of awareness, resulting in potential mental issues. Recent advances in sensing and inference technology provide a viable path to wearable facial-expression-based emotion recognition. However, most prior work has explored only laboratory settings and few platforms are geared towards end-users in everyday lives or provide personalized emotional suggestions to promote self-regulation. We present EmoGlass, an end-to-end wearable platform that consists of emotion detection glasses and an accompanying mobile application. Our single-camera-mounted glasses can detect seven facial expressions based on partial face images. We conducted a three-day out-of-lab study (N=15) to evaluate the performance of EmoGlass. We iterated on the design of the EmoGlass application for efective self-monitoring and awareness of users' daily emotional states. We report quantitative and qualitative fndings, based on which we discuss design recommendations for future work on sensing and enhancing awareness of emotional health.
One important vision of robotics is to provide physical assistance by manipulating different everyday objects, e.g., hand tools, kitchen utensils. However, many objects designed for dexterous hand-control are not easily manipulable by a single robotic arm with a generic parallel gripper. Complementary to existing research on developing grippers and control algorithms, we present Roman, a suite of hardware design and software tool support for robotic engineers to create 3D printable mechanisms attached to everyday handheld objects, making them easier to be manipulated by conventional robotic arms. The Roman hardware comes with a versatile magnetic gripper that can snap on/off handheld objects and drive add-on mechanisms to perform tasks. Roman also provides software support to register and author control programs. To validate our approach, we designed and fabricated Roman mechanisms for 14 everyday objects/tasks presented within a design space and conducted expert interviews with robotic engineers indicating that Roman serves as a practical alternative for enabling robotic manipulation of everyday objects.
Despite the promises of data-driven artificial intelligence (AI), little is known about how we can bridge the gulf between traditional physician-driven diagnosis and a plausible future of medicine automated by AI. Specifically, how can we involve AI usefully in physicians' diagnosis workflow given that most AI is still nascent and error-prone (e.g., in digital pathology)? To explore this question, we first propose a series of collaborative techniques to engage human pathologists with AI given AI's capabilities and limitations, based on which we prototype Impetus — a tool where an AI takes various degrees of initiatives to provide various forms of assistance to a pathologist in detecting tumors from histological slides. We summarize observations and lessons learned from a study with eight pathologists and discuss recommendations for future work on human-centered medical AI systems.
User-generated videos are an increasingly important source of information online, yet most online videos are inaccessible to blind and visually impaired (BVI) people. To find videos that are accessible, or understandable without additional description of the visual content, BVI people in our formative studies reported that they used a time-consuming trial-and-error approach: clicking on a video, watching a portion, leaving the video, and repeating the process. BVI people also reported video accessibility heuristics that characterize accessible and inaccessible videos. We instantiate 7of the identified heuristics (2 audio-related, 2 video-related, and 3audio-visual) as automated metrics to assess video accessibility. We collected a dataset of accessibility ratings of videos by BVI people and found that our automatic video accessibility metrics correlated with the accessibility ratings (Adjusted R^2= 0.642). We augmented a video search interface with our video accessibility metrics and predictions. BVI people using our augmented video search interface selected an accessible video more efficiently than when using the original search interface. By integrating video accessibility metrics, video hosting platforms could help people surface accessible videos and encourage content creators to author more accessible products, improving video accessibility for all.
Online shopping has become a valuable modern convenience, but blind or low vision (BLV) users still face significant challenges using it, because of: 1) inadequate image descriptions and 2) the inability to filter large amounts of information using screen readers. To address those challenges, we propose Revamp, a system that leverages customer reviews for interactive information retrieval. Revamp is a browser integration that supports review-based question-answering interactions on a reconstructed product page. From our interview, we identified four main aspects (color, logo, shape, and size) that are vital for BLV users to understand the visual appearance of a product. Based on the findings, we formulated syntactic rules to extract review snippets, which were used to generate image descriptions and responses to users' queries. Evaluations with eight BLV users showed that Revamp 1) provided useful descriptive information for understanding product appearance and 2) helped the participants locate key information efficiently.
Algorithms often appear as 'black boxes' to non-expert users. While prior work focuses on explainable representations and expert-oriented exploration, we propose and study an interactive approach using question answering to explain deterministic algorithms to non-expert users who need to understand the algorithms' internal states (e.g., students learning algorithms, operators monitoring robots, admins troubleshooting network routing). We construct XAlgo---a formal model that first classifies the type of question based on a taxonomy and generates an answer based on a set of rules that extract information from representations of an algorithm's internal states, e.g., the pseudocode. A design probe in an algorithm learning scenario with 18 participants (9 for a Wizard-of-Oz XAlgo and 9 as a control group) reports findings and design implications based on what kinds of questions people ask, how well XAlgo responds, and what remain as challenges to bridge users' gulf of understanding algorithms.
Patient's understanding on forthcoming dental surgeries is required by patient-centered care and helps reduce anxiety. Due to the complexity of dental surgeries and the patient-dentist expertise gap, conventional techniques of patient education are usually not effective for explaining surgical steps. In this paper, we present OralViewer—the first interactive application that enables dentist's demonstration of dental surgeries in 3D to promote patients' understanding. OralViewer takes a single 2D panoramic dental X-ray to reconstruct patient-specific 3D teeth structures, which are then assembled with registered gum and jaw bone models for complete oral cavity modeling. During the demonstration, OralViewer enables dentists to show surgery steps with virtual dental instruments that can animate effects on a 3D model in real-time. A technical evaluation shows that our deep learning model achieves a mean Intersection over Union (IoU) of 0.771 for 3D teeth reconstruction. A patient study with 12 participants shows OralViewer can improve patients' understanding of surgeries. A preliminary expert study with 3 board-certified dentists further verifies the clinical validity of our system.
We present DualVib, a compact handheld device that simulates the haptic sensation of manipulating dynamic mass; mass that causes haptic feedback as the user's hand moves (e.g., shaking a jar and feeling coins rattling inside). Unlike other devices that require actual displacement of weight, DualVib dispenses with heavy and bulky mechanical structures and, instead, uses four vibration actuators. DualVib simulates a dynamic mass by simultaneously delivering two types of haptic feedback to the user's hand: (1) pseudo-force feedback created by asymmetric vibrations that render the kinesthetic force arising from the moving mass; and (2) texture feedback through acoustic vibrations that render the object's surface vibrations correlated with mass material properties. By means of our user study, we found out that DualVib allowed users to more effectively distinguish dynamic masses when compared to using either pseudo-force or texture feedback alone. We also report qualitative feedback from users who experienced five virtual reality applications with our device.
Supporting voice commands in applications presents significant benefits to users. However, adding such support to existing GUI-based web apps is effort-consuming with a high learning barrier, as shown in our formative study, due to the lack of unified support for creating multimodal interfaces. We present Geno---a developer tool for adding the voice input modality to existing web apps without requiring significant NLP expertise. Geno provides a high-level workflow for developers to specify functionalities to be supported by voice (intents), create language models for detecting intents and the relevant information (parameters) from user utterances, and fulfill the intents by either programmatically invoking the corresponding functions or replaying GUI actions on the web app. Geno further supports multimodal references to GUI context in voice commands (e.g. 'move this [event] to next week' while pointing at an event with the cursor). In a study, developers with little NLP expertise were able to add multimodal voice command support for two existing web apps using Geno.
Reconfiguring shapes of objects enables transforming existing passive objects with robotic functionalities, e.g., a transformable coffee cup holder can be attached to a chair's armrest, a piggy bank can reach out an arm to 'steal' coins. Despite the advance in end-user 3D design and fabrication, it remains challenging for non-experts to create such 'transformables' using existing tools due to the requirement of specific engineering knowledge such as mechanisms and robotic design. We present Romeo -- a design tool for creating transformables to robotically augment objects' default functionalities. Romeo allows users to transform an object into a robotic arm by expressing at a high level what type of task is expected. Users can select which part of the object to be transformed, specify motion points in space for the transformed part to follow and the corresponding action to be taken. Romeo then automatically generates a robotic arm embedded in the transformable part ready for fabrication. A design session validated this tool where participants used Romeo to accomplish controlled design tasks and to open-endedly create coin-stealing piggy banks by transforming 3D objects of their own choice.
The recent development of data-driven AI promises to automate medical diagnosis; however, most AI functions as 'black boxes' to physicians with limited computational knowledge. Using medical imaging as a point of departure, we conducted three research activities to formulate the design of CheXplain---a system that enables physicians to explore and understand AI-enabled chest X-ray analysis: (i) a paired survey between referring physicians and radiologists reveals whether, when, and what kinds of explanations are needed; (ii) a low-fidelity prototype co-designed with three physicians formulates eight key features; and (iii) a high-fidelity prototype evaluated by another six medical professionals provides detailed summative insights on how each feature enables the exploration and understanding of AI. We summarize by discussing recommendations for future work to design and implement explainable medical AI systems that encompass four recurring themes: motivation, constraint, explanation, and justification.
Due to a lack of resources or awareness, oral diseases are often left unexamined and untreated, affecting a large population worldwide. With the advent of low-cost, sensor-equipped smartphones, mobile apps offer a promising possibility. However, to the best of our knowledge, no mobile health (mHealth) solutions can directly support users to self-examine their oral health condition. We present OralCam, the first interactive app that enables end-users' self-examination of five common oral conditions (diseases or early disease signals) by taking smartphone photos of one's oral cavity. OralCam allows a user to annotate additional information to augment the input image, and presents the output hierarchically, probabilistically and with visual explanations to help laymen users understand examination results. We describe a deep learning backend trained on a data set that consists of 3,182 oral photos. We report a technical evaluation, a week-long in-the-wild user study and an expert interview to validate OralCam.
Users can now easily communicate information with an Internet of Things; in contrast, there remains a lack of support to automate tasks that involve legacy static objects, e.g. adjusting a desk lamp's angle for optimal brightness, turning on/off a manual faucet when washing dishes, sliding a window to maintain a preferred indoor temperature. Automating these simple physical tasks has the potential to improve people's quality of life, which is particularly important for people with a disability or in situational impairment.
We present Robiot -- a design tool for generating mechanisms that can be attached to, motorized, and actuating legacy static objects to perform simple physical tasks. Users only need to take a short video manipulating an object to demonstrate an intended physical behavior. Robiot then extracts requisite parameters and automatically generates 3D models of the enabling actuation mechanisms by performing a scene and motion analysis of the 2D video in alignment with the object's 3D model. In an hour-long design session, six participants used Robiot to actuate seven everyday objects, imbuing them with the robotic capability to automate various physical tasks.
A large number of Internet-of-Things (IoT) devices will soon populate our physical environments. Yet, IoT devices' reliance on mobile applications and voice-only assistants as the primary interface limits their scalability and expressiveness. Building off of the classic 'Put-That-There' system, we contribute an exploration of the design space of voice + gesture interaction with spatially-distributed IoT devices. Our design space decomposes users' IoT commands into two components---selection and interaction. We articulate how the permutations of voice and freehand gesture for these two components can complementarily afford interaction possibilities that go beyond current approaches. We instantiate this design space as a proof-of-concept sensing platform and demonstrate a series of novel IoT interaction scenarios, such as making 'dumb' objects smart, commanding robotic appliances, and resolving ambiguous pointing at cluttered devices.
Low-cost fabrication machines (e.g., 3D printers) offer the promise of creating custom-designed objects by a range of users. To maximize performance, generative design methods such as topology optimization can automatically optimize properties of a design based on high-level specifications. Though promising, such methods require people to map their design ideas--often unintuitively--to a small number of mathematical input parameters, and the relationship between those parameters and a generated design is often unclear, making it difficult to iterate a design. We present Forte, a sketch-based, real-time interactive tool for people to directly express and iterate on their designs via 2D topology optimization. Users can ask the system to add structures, provide a variation with better performance, or optimize internal material layouts. Users can globally control how much to `deviate' from the initial sketch, or perform local suggestive editing, which interactively prompts the system to update based on the new information. Design sessions with 10 participants demonstrate that Forte empowers designers to create and explore a range of optimized designs with custom forms and styles.
In our everyday life, we interact with and benefit from objects with a wide range of material properties. In contrast, personal fabrication machines (e.g., desktop 3D printers) currently only support a much smaller set of materials. Our goal is to close the gap between current limitations and the future of multi-material printing by enabling people to explore the reuse of material from everyday objects into their custom designs. To achieve this, we develop a library of embeddables--everyday objects that can be cut, worked and embedded into 3D printable designs. We describe a design space that characterizes the geometric and material properties of embeddables. We then develop Medley---a design tool whereby users can import a 3D model, search for embeddables with desired material properties, and interactively edit and integrate their geometry to fit into the original design. Medley also supports the final fabrication and embedding process, including instructions for carving or cutting the objects, and generating optimal paths for inserting embeddables. To validate the expressiveness of our library, we showcase numerous examples augmented by embeddables that go beyond the objects' original printed materials.
As computing devices become increasingly ubiquitous, it is now possible to combine the unique capabilities of different devices or Internet of Things to accomplish a task. However, there is currently a high technical barrier for creating cross-device interaction. This is especially challenging for end users who have limited technical expertise—end users would greatly benefit from custom cross-device interaction that best suits their needs. In this article, we present Improv, a cross-device input framework that allows a user to easily leverage the capability of additional devices to create new input methods for an existing, unmodified application, e.g., creating custom gestures on a smartphone to control a desktop presentation application. Instead of requiring developers to anticipate and program these cross-device behaviors in advance, Improv enables end users to improvise them on the fly by simple demonstration, for their particular needs and devices at hand. We showcase a range of scenarios where Improv is used to create a diverse set of useful cross-device input. Our study with 14 participants indicated that on average it took a participant 10 seconds to create a cross-device input technique. In addition, Improv achieved 93.7% accuracy in interpreting user demonstration of a target UI behavior by looking at the raw input events from a single example.
In this paper, we describe Reprise--a design tool for specifying, generating, customizing and fitting adaptations onto existing household objects. Reprise allows users to express at a high level what type of action is applied to an object. Based on this high level specification, Reprise automatically generates adaptations. Users can use simple sliders to customize the adaptations to better suit their particular needs and preferences, such as increasing the tightness for gripping, enhancing torque for rotation, or making a larger base for stability. Finally, Reprise provides a toolkit of fastening methods and support structures for fitting the adaptations onto existing objects.
To address the increasing functionality (or information) overload of smartphones, prior research has explored a variety of methods to extend the input vocabulary of mobile devices. In particular, body tapping has been previously proposed as a technique that allows the user to quickly access a target functionality by simply tapping at a specific location of the body with a smartphone. Though compelling, prior work often fell short in enabling users' unconstrained tapping locations or behaviors. To address this problem, we developed a novel recognition method that combines both offline—before the system sees any user-defined gestures—and online learning to reliably recognize arbitrary, user-defined body tapping gestures, only using a smartphone's built-in sensors. Our experiment indicates that our method significantly outperforms baseline approaches in several usage conditions. In particular, provided only with a single sample per location, our accuracy is 30.8% over an SVM baseline and 24.8% over a template matching method. Based on these findings, we discuss how our approach can be generalized to other user-defined gesture problems.
One powerful aspect of 3D printing is its ability to extend, repair, or more generally modify everyday objects. However, nearly all existing work implicitly assumes that whole objects are to be printed from scratch. This paper presents a framework for 3D printing to augment existing objects that covers a wide range of attachment options. We illustrate the framework through three exemplar attachment techniques - print-over, print-toaffix and print-through, implemented in Encore, a design tool that supports a set of analysis metrics relating to viability, durability and usability that are visualized for the user to explore design options and tradeoffs. Encore also generates 3D models for production, addressing issues such as support jigs and contact geometry between the attached part and the original object. Our validation helps to illustrate the strengths and weaknesses of each technique. For example, we characterize how surface curvature and roughness affect print-over's strength compared to the conventional print-in-one-piece.
The emergence of smart devices (e.g., smart watches and smart eyewear) is redefining mobile interaction from the solo performance of a smart phone, to a symphony of multiple devices. In this paper, we present Duet - an interactive system that explores a design space of interactions between a smart phone and a smart watch. Based on the devices' spatial configurations, Duet coordinates their motion and touch input, and extends their visual and tactile output to one another. This transforms the watch into an active element that enhances a wide range of phone-based interactive tasks, and enables a new class of multi-device gestures and sensing techniques. A technical evaluation shows the accuracy of these gestures and sensing techniques, and a subjective study on Duet provides insights, observations, and guidance for future work.
We present Air+Touch, a new class of interactions that interweave touch events with in-air gestures, offering a unified input modality with expressiveness greater than each input modality alone. We demonstrate how air and touch are highly complementary: touch is used to designate targets and segment in-air gestures, while in-air gestures add expressivity to touch events. For example, a user can draw a circle in the air and tap to trigger a context menu, do a finger 'high jump' between two touches to select a region of text, or drag and in-air 'pigtail' to copy text to the clipboard. Through an observational study, we devised a basic taxonomy of Air+Touch interactions, based on whether the in-air component occurs before, between or after touches. To illustrate the potential of our approach, we built four applications that showcase seven exemplar Air+Touch interactions we created.
Ultra-small smart devices, such as smart watches, have become increasingly popular in recent years. Most of these devices rely on touch as the primary input modality, which makes tasks such as text entry increasingly difficult as the devices continue to shrink. In the sole pursuit of entry speed, the ultimate solution is a shorthand technique (e.g., Morse code) that sequences tokens of input (e.g., key, tap, swipe) into unique representations of each character. However, learning such techniques is hard, as it often resorts to rote memory. Our technique, Swipeboard, leverages our spatial memory of a QWERTY keyboard to learn, and eventually master a shorthand, eyes-free text entry method designed for ultra-small interfaces. Characters are entered with two swipes; the first swipe specifies the region where the character is located, and the second swipe specifies the character within that region. Our study showed that with less than two hours' training, Tested on a reduced word set, Swipeboard users achieved 19.58 words per minute (WPM), 15% faster than an existing baseline technique.
The space around the body provides a large interaction vol-ume that can allow for big interactions on small mobile de-vices. However, interaction techniques making use of this opportunity are underexplored, primarily focusing on dis-tributing information in the space around the body. We demonstrate three types of around-body interaction includ-ing canvas, modal and context-aware interactions in six demonstration applications. We also present a sensing solu-tion using standard smartphone hardware: a phone's front camera, accelerometer and inertia measurement units. Our solution allows a person to interact with a mobile device by holding and positioning it between a normal field of view and its vicinity around the body. By leveraging a user's proprioceptive sense, around-body Interaction opens a new input channel that enhances conventional interaction on a mobile device without requiring additional hardware.
Modern mobile devices rely on the screen as a primary input modality. Yet the small screen real-estate limits interaction possibilities, motivating researchers to explore alternate input techniques. Within this arena, our goal is to develop Body-Centric Interaction with Mobile Devices: a class of input techniques that allow a person to position and orient her mobile device to navigate and manipulate digital content anchored in the space on and around the body. To achieve this goal, we explore such interaction in a bottomup path of prototypes and implementations. From our experiences, as well as by examining related work, we discuss and present three recurring themes that characterize how these interactions can be realized. We illustrate how these themes can inform the design of Body-Centric Interactions by applying them to the design of a novel mobile browser application. Overall, we contribute a class of mobile input techniques where interactions are extended beyond the small screen, and are instead driven by a person's movement of the device on and around the body.
Portable paper calendars (i.e., day planners and organizers) have greatly influenced the design of group electronic calendars. Both use time units (hours/days/weeks/etc.) to organize visuals, with useful information (e.g., event types, locations, attendees) usually presented as - perhaps abbreviated or even hidden - text fields within those time units. The problem is that, for a group, this visual sorting of individual events into time buckets conveys only limited information about the social network of people. For example, people's whereabouts cannot be read 'at a glance' but require examining the text. Our goal is to explore an alternate visualization that can reflect and illustrate group members' calendar events. Our main idea is to display the group's calendar events as spatiotemporal activities occurring over a geographic space animated over time, all presented on a highly interactive public display. In particular, our SPALENDAR (SPAtial CALENDAR) design animates peoples' past, present and forthcoming movements between event locations as well as their static locations. Details of people's events, their movements and their locations are progressively revealed and controlled by the viewer's proximity to the display, their identity, and their gestural interactions with it, all of which are tracked by the public display.
