CLUNCH | Price Lab for Digital Humanities

Computational Linguistics and Lunch, 2016–2017

March 23, 2017 12pm-1:20pm; Levine 307 (3300 Walnut Street)

Predicting Emotional Word Ratings using Distributional Representations and Signed Clustering

João Sedoc (University of Pennsylvania)

Inferring the emotional content of words is important for text-based sentiment analysis, dialogue systems and psycholinguistics, but word ratings are expensive to collect at scale adn across languages or domains. We develop a method that automatically extends word-level ratings to unrated words using signed clustering on vector space word representations.

We use our method to determine the valence and arousal of words, the most widely used dimensional model of emotional state. Our method achieves superior out-of-sample word rating prediction on both affective dimensions in three different languages when compared to state-of-the-art word similarity based methods. Our method can assist building word ratings for new languages and improve downstream tasks such as sentiment analysis and emotion detection.

March 21, 2017 12pm-1:20pm; Levine 307 (3300 Walnut Street)

Data Driven Dialog Management

Alborz Geramifard (Amazon)

Speech-based AI assistants such as Alexa and Google Now are becoming increasingly popular as a convenient way for people to interact with machines. However, users find interactions with their assistants more natural if conducted in a conversational manner, with multiple requests made and responses provided in a given dialog session. Creating robust dialog policies for conversational bots is challenging. This talk presents a data driven approach for dialog management through reinforcement learning. We first introduce a framework for building conventional bots and describe MovieBot as an implementation of the framework that was launched as an Alexa skill. We then describe approaches to creating the reward function based on sentiment analysis on text, using various techniques including Long Short Term Memory networks (LSTMs). The talk will end by discussing potential directions, and how all pieces of the puzzle can fit together.

March 16, 2017 12pm-1:30pm; Levine 307 (3300 Walnut Street)

Applications of Social Media in Public Health: Progress, Possibilities and Challenges

Abeed Sarker (University of Pennsylvania)

Social media has emerged into a crucial resource for obtaining population-based signals for various health monitoring and surveillance tasks. There is an abundance of health-related knowledge hidden within social media data, and the volume is growing. However, despite the advances in NLP and machine learning, real-world health systems that utilize social media data are scarce. In this talk, I will present some of our recent research on social media mining for health-related applications. I will commence the talk by introducing my research group the Health Language Processing (HLP) lab at the Department of Biostatistics, Epidemiology and Informatics. I will discuss our work on social media mining for pharmacovigilance, and discuss the challenges faced and findings from the last few years of research. I will also discuss future research possibilities in this and related domains.

March 2, 2017 12pm-1:30pm; Levine 512 (3300 Walnut Street)

Multidocument Summarization for Scientific Literature

Kokil Jaidka (University of Pennsylvania)

The large scale of scholarly publications poses a challenge for scholars in their search for relevant literature, as they are inundated with thousands of results. In the case of evidence-based medicinal research, the information overload problem could gravely impact their efficiency and critical decision-making abilities.

In the first part of this talk, I will present the crux of my thesis - a literature review framework for multidocument summarization of research papers. I will show how an interdisciplinary approach involving discourse analysis can bring about a new understanding of a well-known problem: how do researchers synthesize facts into a comprehensive literature review?

In the second part of my talk, I will provide an overview of the CL-SciSumm Shared Task, my personal effort to mobilize the summarization community around the scientific summarization task. The CL-SCiSumm corpus, comprising over 500 annotated computational linguistics research papers interlinked through a citation network, is sponsored by Microsoft Research Asia and is the first medium-scale shared task on scientific document summarization in the computatitonal linguistics domain. I will be presenting the key insights and takeaways from this summarazitation bakeoff.

February 23, 2017 12pm-1:30pm; Levine 307 (3300 Walnut Street)

From Dialogue Systems to Social Chatbots: Reinforcement Learning, Seq2Seq, and back again

Verena Rieser (Heriot-Watt)

Spoken Dialogue Systems have experienced a revolution over the past decade: They went from being completely hand-crafted to using data-driven methods, including techniques like Reinforcement Learning (RL) and Seq2Seq modelling. In this talk, I will review this current development, including my own work on optimising dialogue strategies using RL. I will then review recent Seq2Seq models for response generation (aka social chatbots). Finally, I will discuss the opportunities and pitfalls for future dialogue research.

February 16, 2017 12pm-1:30pm; Levine 307 (3300 Walnut Street)

Studying Teamwork and Cooperation in the Virtual Lab

Andrew Mao (Microsoft)

For decades, physical behavioral labs have been a primary, yet limited, method for controlled experimental studies of human behavior. Now, software-based "virtual labs" on the Internet allow for studies of increasing complexity, size, and scope. In this talk, I highlight the potential of virtual lab experiments for studying social interaction and coordination. First, we explore collective intelligence and digital teamwork in "crisis mapping", where digital volunteers organize to assess and pinpoint damage in the aftermath of humanitarian crises. By simulating a crisis mapping scenario to study self-organization in teams of varying size, and find a tradeoff between individual effort in small groups and collective coordination in larger teams. We also conduct a study of cooperation in a social dilemma over a month of real time, using crowdsourcing participants to overcome the time constraints of behavioral labs. Our study of about 100 participants over 20 consecutive weekdays finds that a group of resilient altruists sustain a high level of cooperation across the entire population. Together, our work motivates the potential of controlled, highly instrumented studies of social interaction; the importance of behavioral experiments on longer timescales; and how open-source software both can speed up the iteration and improve the reproducibility of experimental work.

February 9, 2017 12pm-1:30pm; Levine 307 (3300 Walnut Street)

Situated Intelligent Interactive Systems

Zhaou Yu (Carnegie Mellon University)

Communication is an intricate dance, an ensemble of coordinated individual actions. Imagine a future where machines interact with us like humans, waking us up in the morning, navigating us to work, or discussing our daily schedules in a coordinated and natural manner. Current interactive systems being developed by Apple, Google, Microsoft, and Amazon attempt to reach this goal by combining a large set of single-task systems. But products like Siri, Google Now, Cortana and Echo still follow pre-specified agendas that cannot transition between tasks smoothly and track and adapt to different users naturally. My research draws on recent developments in speech and natural language processing, human-computer interaction, and machine learning to work towards the goal of developing situated intelligent interactive systems. These systems can coordinate with users to achieve effective and natural interactions. I have successfully applied the proposed concepts to various tasks, such as social conversation, job interview training and movie promotion. My team's proposal on engaging social conversation systems was selected to receive $100,000 from Amazon Inc. to compete in the Amazon Alexa Prize Challenge (https://developer.amazon.com/alexaprize).

January 26, 2017 12pm-1:30pm; Levine 512 (3300 Walnut Street)

Name Variation in Community Question Answering Systems

Anietie Andy (Howard University)

Community question answering (CQA) systems are forums where users can ask and answer questions in various categories. Examples are Yahoo! Answers, Quora, and Stack Overflow. A common challenge with such systems is that a significant percentage of asked questions are left unanswered. In this paper, we propose an algorithm to reduce the number of unanswered questionsin Yahoo! Answers by reusing the answer to the most similar past resolved question (PARQ) to the unanswered question, from the site. Semantically similar questions could be worded differently, thereby making it difficult to find questions that have shared needs. For example, ?Who is the best player for the Reds?? and ?Who is currently the biggest star at Manchester United?? have a shared need but are worded differently; also, Reds and Manchester United are used to refer to the soccer team Manchester United football club. In this research, we focus on question categories that contain a large number of named entities and entity name variations. We show that in these categories, entity linking can be used to identify relevant PARQ with shared needs as a given question by disambiguating named entities and matching these questions based on the disambiguated entities, identified entities, and anchor phrases (textual phrases that potentially link to an entity in a knowledge base (KB)). We evaluated our algorithm on a new dataset constructed from Yahoo! Answers. The dataset contains annotated question pairs, (Qgiven, [Qpast, Answer]). We carried out experiments on several question categories and show that an entity-based approach gives good performance when searching for similar questions in entity rich categories.

November 16, 2016 2pm-3:30pm; GSE 322 (3700 Walnut Street)

Networked Learning Analytics: Studying Learning at Scale

Srećko Joksimović (University of Edinburgh)

The development of learning analytics research field was primarily driven by the advances in educational technology and the emergence of large scale data about students' learning, along with the willingness of educational institutions and corporations to make sense of such data. Nevertheless, regardless of the fact that learninanalytics has exhibited a noticeable potential for studying learning in various (e.g., formal and non-formal) educational settings, there is a need to make a critical reflection on the current state of learning analytics research and practice. Various researchers argue for the increased importance of empirical research that is theory driven, as opposite to the exploratory analyses that could potentially
lead towards misuse and misinterpretations of the factors that could contribute to learning. On the other hand, while it is important to rely on commonly used educational metrics to allow for generalizability across different settings, a holistic approach is needed to understand and interpret observed learning-related constructs and their association with learning, taking into account specific educational context. This talk will give a brief overview of the existing empirical work on development of the next generation MOOC research and lessons learned from studying learning in networked settings. This talk will then identify critical challenges that require immediate attention in order for the study of social learning at scale to make a sustainable impact on learning and teaching. The most important take away from the talk will be that

- computational aspects of the study of social learning need to be
integrated deeply with theory, research and practice,
- novel methods for the study of critical dimensions (discourse,
structure and dynamics) that shape social interactions are necessary, and
- innovative instructional approaches are essential to address the
changing conditions created by contemporary educational and technological
contexts.

November 9, 2016 12pm-1:30pm; Levine 307

The Role of Discourse Units in Near-Extractive Summarization

Jessy Li (University of Pennsylvania)

Although human-written summaries of documents tend to involve significant edits to the source text, most automated summarizers are extractive and select sentences verbatim. In this work we examine how elementary discourse units (EDUs) from Rhetorical Structure Theory can be used to extend extractive summarizers to produce a wider range of human-like summaries. Our analysis demonstrates that EDU segmentation is effective in preserving human-labeled summarization concepts within sentences and also aligns with near-extractive summaries constructed by news editors. Finally, we show that using EDUs as units of content selection instead of sentences leads to stronger summarization performance in near-extractive scenarios, especially under tight budgets.

October 12, 2016 12pm-1:30pm; Levine 307

Steps Towards a Neural Model for Text Simplification

João Sedoc (University of Pennsylvania)

Sentential text simplification requires real natural language understanding in order to be able to simplify both syntactic and grammatical structure while retaining the core meaning of the sentence. Most current approaches to text simplification approach the problem framed as either machine translation or text summarization. The simplification of text is further complicated by the lack of large scale corpora.

I will present our preliminary results and issues regarding training neural models. I also present a new model, the neural dictionary machine, which uses PPDB to augment learning and decrease parameter size.

October 5, 2016 12pm-1:30pm; Levine 307

Fine-Grained IsA Extraction via Modifier Composition

Ellie Pavlik (University of Pennsylvania)

We present a method for populating fine-grained classes (e.g. American jazz composers) with instances (e.g. Charles Mingus). While state-of-the-art methods tend to treat class labels as single lexical units, the method we propose works by individually considering each of the modifiers in the class label (American, jazz) relative to the head (composers). On the task of reconstructing Wikipedia category pages, we demonstrate a 4x increase in coverage over a strong baseline which relies on widely-used lexical patterns for IsA extraction.

September 28, 2016 12pm-1:30pm; Levine 307

Exploring Statistical Paradigms for Unsupervised Morphology Learning

Hongzhi Xu (University of Pennsylvania)

Morphology analysis aims to segment words into morphemes automatically. It is a fundamental task of NLP, especially for morphologically rich languages such as Turkish and Finnish where one word can have hundreds of inflection forms and contain more than ten affixes. NLP technologies may suffer from a serious data sparseness problem in such languages. Morphology analysis is thus a crucial step before other downstream processes be conducted.

Unsupervised morphology analysis has the merit that no annotated data is needed, which is a big advantage for low resource languages for which no sophisticated linguistic analyses even exist. In this talk, I will discuss the key challenges of unsupervised morphology learning and then describe a system that we designed for the DARPA LORELEI (Low Resource Languages for Emergent Incidents) project. Particularly, we exploit the notion of statistical paradigms that are generated by a statistical model and thenthe reliable paradigms are used to prune the unreliable ones and thusexcluding the spurious morphological segmentations.

September 21, 2016 12pm-1:30pm; Levine 307

Automatic Characterization of Vowel Harmony

Jordan Kodner and Spencer Caplan (2nd Year PhD Students, Department of Linguistics, University of Pennsylvania)

Vowel harmony is a pattern of vowel alternation observed in many phonological systems. Vowels are partitions into sets such that their linguistic features spread to neighboring vowels. While vowel harmony is well accounted for in modern phonological theory, its complex surface realization makes its learnability difficult to explain. Children as young as several months accurately acquire these diverse systems given only opaque input stimuli.

In this talk we present a computational model of vowel harmony acquisition. Our cognitive model is built on simple, psychologically motivated building blocks and accounts for the mechanism by which children identify both whether or not an input language contains productive vowel harmony and the specific realization of that harmony process.

September 14, 2016 12pm-1:30pm; Levine 307

The Relationship between Engagement and Success in Online Learning in Massive Online Open Courses

Ryan Baker (Associate Professor, Graduate School of Education at the University of Pennsylvania; Director of the Penn Center for Learning Analytics)

In this talk, I'll discuss my group's work to study the relationship between engagement and success in online learning in massive online open courses (MOOCs). We look both at the now-standard metric of course completion but also at participation in the community of practice after completing the course. We will examine variables related to discussion forum participation and interaction with both other students and with course materials as factors predictive of student achievement.

Computational Linguistics at Penn

Subscribe to the Computational Lingustics Lunch Mailing List