Michael J. Price Lab for Digital Humanities

CLUNCH

CLUNCH

Computational Linguistics and Lunch, 2015–2016

May 31, 2016, 10am, Levine 307

Exploring a Rich Design Space for Text Summarization

Giuseppe Carenini (Associate Professor in Computer Science, University of British Columbia)

Text summarization is a critical task to deal with the ever increasing amount of information available online. There is abundant research in this area, however, because inputs and outputs of the summarization process can vary substantially, it is often difficult to compare and integrate the existing methods.  To tackle this challenge, in this talk I will introduce a set of dimensions to characterize a rich design space for summarization systems. I will then present two summarization frameworks we have been working on, one for evaluative text (e.g. customer reviews) and one for conversations (e.g., discussion forums), and discuss how they fit into the proposed design space.


May 5, 2016, 12:00 - 1:30pm, Levine 307

The Fist of the Sender: How Much Does Typing Reveal About Us?

Andrew Rosenberg (IBM Watson)

When we type, we engage in a complicated series of actions beginning with cognition progressing through sentence planning and ending with the physiological mechanics of our fingers striking keys.  In this talk, I will present results investigating what qualities about people, and our writing processes are revealed through our typing.

To some degree, how we type is unique to who we are.  Through examination of keystroke dynamics, passwords can be made more reliable.  Moreover, this talk will describe the use of typing as a continuous security measure, where the identity of a typist is regularly verified via his or her typing behavior.  While this is not as reliable as, say, a fingerprint, analysis of typing data is a very reliable authentication biometric.

Beyond uniquely identifying a typist, we also investigate which (if any) demographic qualities are revealed through a person's typing.  This talk will present work recognizing whether a typist is left- or right-handed, male or female or whether he or she is a native speaker of English.

Finally, we investigate qualities of the typing process that may reveal aspects of the cognition and mechanics of typing.  In this work, we attempt to recognize  what kind cognitive process a typist is engaged in.  I will also discuss typing dynamics in and around revisions and multi-word expressions (e.g., compound nouns, compound verbs, idioms) insofar as this may be informative to self-editing and lexical retrieval processes.


April 28, 2016, 12:00 - 1:30pm, Levine 307

Studying the Temporal Dynamics of Word Co-Occurrences
and
Stylistic Variation with Age and Income on Twitter

Daniel Preotiuc (Penn)

Studying the Temporal Dynamics of Word Co-Occurrences
Joint work with: Srijith PK, Mark Hepple and Trevor Cohn
This talk will analyze the volatility of word co-occurrences over time
using social media data. Our hypothesis is that co-occurrences between
words will change as a response to real world events. We develop a spectral
clustering approach to find groups of mutually informative terms occurring
in discrete time frames. Experiments on large datasets of tweets show that
these groups identify key real world events as they unfold, despite no
explicit supervision. The performance of our method rivals state-of-the-art
methods for event detection on F-score, obtaining higher recall at the
expense of precision.

Stylistic Variation with Age and Income on Twitter
Joint work with: Lucie Flekova and Lyle Ungar
In this talk, we will explore the relationship between stylistic and syntactic features and authors' age and income. Our hypothesis is that for numerous feature types writing style is predictive of income even beyond age. We analyze the predictive power of these features in a regression setup and additionally, we study temporal variations in writing style of users from distinct income groups.


April 21, 2016, 12:00 - 1:30pm, Greenberg Lounge (Skirkanich 114)

Domain Adaptation using Language Difference Models 

Amittai Axelrod (University of Maryland)

Data selection uses a small set of domain-relevant data to select additional training items from a much larger, out-of-domain dataset. The goal is to filter Big Data down to Good Data: finding the best, most relevant data to use to train a much smaller model for a particular task. Data selection is common in both targeted natural  language processing applications, and where model cost and size are concerns.  

In statistical machine translation and language modeling, data selection usually results in building a system on only a fraction of the training data. Such a system incurs only a fraction of the computational training cost, and makes it easy to deploy mobile-sized and realtime systems.  

In this talk, I first describe the Moore-Lewis cross-entropy difference algorithm and an extension of mine that has emerged as the standard method of data selection for MT. I then present a new understanding of why this technique actually works, and how it leads to a new discriminative method for data selection and quantifying  textual relevance in general.  The result is a language difference model, a new representation for modeling the similarities and differences between two corpora.  

One advantage is that off-the-shelf tools can be used to produce downstream domain-adapted language models with better lexical coverage and lower perplexity than the standard. Furthermore, these improvements carry over to MT systems, despite the fact that the data selection process abstracts away entirely from the identities of individual words. Additionally, language difference models allow the  entire scoring and selection process to run in constant memory, lowering the barriers to fine-grained personalized systems. Lunch will be served at 11:45 in Greenberg Lounge.


April 14, 2016, 12:00 - 1:30pm, Levine 307

Clustering Paraphrases by Word Sense 

Anne Cocos (Penn)
Automatically generated databases of English paraphrases have the drawback that they return a single list of paraphrases for an input word or phrase. This means that all senses of polysemous words are grouped together, unlike WordNet which partitions different senses into separate synsets. In this talk I will present methods for clustering paraphrases by word sense, as applied to the Paraphrase Database (PPDB). I will discuss the performance of hierarchical and spectral clustering algorithms, and how they vary in response to changes in the way we define the similarity matrix used as input. These methods produce sense clusters that are qualitatively and quantitatively good, and represent a substantial improvement to the PPDB resource.


April 7, 2016, 12:00 - 1:30pm, Levine 307

Corpus - Driven Approaches to Generation Bottlenecks

Or Biran (Columbia University)
Most Natural Language Generation systems today are limited to a particular and usually fairly small subject domain. Generating text for a new domain means handling new entities and relations, creating new types of messages about them, adopting a new document style, and learning how to lexicalize the new messages. We call these the "bottlenecks" of NLG because they often have to be handled manually in a new system or for a new domain. Where they can be done automatically, it typically requires a rare resource such as an aligned corpus.

In this talk I will discuss two fully automated corpus-driven approaches that aim to solve two of these bottlenecks. First, our approach to discourse planning using a sequential model of discourse relations: we show that local discourse coherence as guided by a statistical model extracted from a discourse-annotated corpus (including an automatically annotated corpus) makes a significant difference in human preference for a generated text, and that the genre of the annotated corpus makes a difference as well. Second, I will describe our approach to paraphrasal template mining from an unaligned corpus, which can help with realization in different domains. The approach utilizes a combination of distributional and taxonomy-driven semantics to allow the extraction of paraphrasal templates from sentences in the corpus that are not in themselves paraphrases. Lunch will be served at 11:45 am in Levine 512.


March 24, 2016, 12:00 - 1:30pm, Levine 512

Natural Language Semantics using Probabilistic Logic

Islam Beltagy (University of Texas at Austin)
Being able to automatically read and understand natural language is useful because the web has a huge amount of textual knowledge that we would like to acquire automatically, and because it enables seamless interaction with human users. Current language understanding systems rely mostly on shallow represents for text, like bag of words or syntactic structure, which do not work adequately for tasks that require deeper semantics understanding like question answering.

In this talk, I will present a framework for semantic understanding that integrates logical representation of meaning with distributional information in order to overcome the brittleness of logical approaches, and it uses probabilistic logic for inference. We applied this framework to textual similarity, textual entailment (achieving one state of art result) and currently working on applying it to open-domain question answering. I will discuss how we learn a task-specific knowledge base, how to do efficient and scalable probabilistic logic inference, and how to do inference with sentences that have complex forms of quantification. Lunch will be served at 11:45 am in Levine 512.


March 17,  2016, 12:00 - 1:30pm, Levine 307

What NLP Can Teach Us about Contemporary American Politics

Dan Hopkins (Penn)
In this talk, I will employ topic models and other NLP tools to demonstrate a few consequential shifts in American politics in recent years. First, by analyzing more than 1,000 state party platforms, we can see the ways in which American politics has nationalized and polarized in recent decades. State party platforms are less distinctive--and more in sync with the national party platforms--than in previous eras. Second, analyzing press releases, open-ended survey responses, and other content from recent debates over the Affordable Care Act provides new insights into the limited capacity of elite rhetoric to shape public opinion. While there was much discussion of "death panels" and the role of elite rhetoric in shaping public views, analyses of open-ended survey responses show that there little uptake of elite rhetoric as the debate unfolded. Lunch will be served at 11:45 am in Levine 512


March 10, 2016, 12:00 - 1:30pm, Levine 307

Discourse Relations between Conjoined VPs: A New Corpus

Bonnie Webber (University of Edinburgh)
As frequently noted, discourse relations can hold within a sentence as well as between them. Interest has grown in the former 'intra-sentential' discourse relations [Joty et al, 2015] -- for example, for Statistical Machine Translation [Guzman et al, 2014]. We have therefore started to expand the density of intra-sentential discourse relations annotated in the Penn Discourse TreeBank [Prasad et al., 2014].

We know from English grammar [Huddleston &  Pullum, 2002] that conjoined VPs can relate in ways other than simply Conjunction ('and'), Disjunction ('or'), and Contrast ('but'). The senses mentioned by Huddleston & Pullum include, inter alia, Temporal Sequence (where 'X and Y' means 'X and then Y), Result (where 'X and Y' means  'X and therefore Y'), Condition (where 'X and Y' means (if X then Y), etc. These are among the kinds of discourse relations that can also hold between larger units of text and that are annotated in the Penn Discourse TreeBank. We took this as justification for annotating discourse relations associated with the many conjoined VPs in the Penn TreeBank.

To date, we have manually annotated over 4500 conjoined VPs -- their spans and the one or more senses taken to hold between them.  There is still some work to do, as some of the tokens that an automated parser might analyse as a conjoined VP appear in the Penn TreeBank as conjoined  zero-subject S-nodes, and we want to include them as well.

Since we plan to make this corpus available in Spring 2016 to anyone with access to the Penn TreeBank, I want to describe the corpus, including how tokens were identified and annotated, the repertoire of senses used in annotating them, and the distribution of senses over tokens. This will be the first corpus of its kind, since only a subset of conjoined clauses were annotated in the RST-corpus [Carlson et al, 2003] and the only sense associated with 'and'-conjoined  clauses is simply List. Neither VP nor clausal conjunction is annotated in PropBank. Lunch will be served at 11:45 am in Levine 512.


March 3, 2016, 12:00 - 1:30pm, Levine 307

A Graph Based Approach for Multimodal Question Answering

Yusuf Osmanlioglu (Drexel University)
In this talk, I will present our ongoing research on a graph matching based visual question answering system. I will first provide a background on metric embedding of graphs into hierarchical trees, representing graph matching as an instance of quadratic assignment (and also metric labeling) problem, and primal dual approximation algorithms. Then, I will go through the vision and nlp components of the system separately. For each part, I will discuss the graph representation of data, formulation of the matching problem as a linear program, and a primal dual approximation algorithm to speed up the process. I will conclude with presenting preliminary results. Lunch will be served at 11:45 am in Levine 512.


February 25, 2016, 12:00 - 1:30pm, Levine 512

Spectral Clustering and Multi-Scale Hidden Markov Models 

João Sedoc (Penn)
In this talk I will give a brief overview of signed spectral clustering (SSC) in particular its application to the identification of semantically similar words (or lexical units) using word embedding and thesauri. Then I will discuss new extensions which involve real valued side knowledge and probabilistic clustering. I will show initial results of how knowledge bases can be used to extend lexicon.

The second part of this talk will focus on our ongoing research in multi-scale HMMs for sentence embedding. I will present our current methods and cursory results.


February 18, 2016, 12:00 - 1:30pm, Levine 512

nlpgrid Resources at Penn 

Dan Widyono (Penn)
In this Q&A talk Dan will explain the NLP Grid resource, as a refresher for those who came last fall, and as a chance to learn about shared computing for others.


February 11, 2016, 12:00 - 1:30pm, Levine 512

Unsupervised Morphological Learning with EM and MDL

Hongzhi Xu (Penn)
Unsupervised morphological analysis is an important task from both linguistic and computational linguistic perspectives. This talk is about an ongoing joint work with Mitch Marcus, Lyle Ungar and Charles Yang. In detail, I will talk about utilizing low resources, typically a list of word types without frequencies, to learn morphological structure of a language. We combined the EM learning algorithm with MDL to learn a Bayesian model. I will show that MDL formulated appropriately is an effective way to prevent EM from overfitting the observed data. I will also show experiments on English that a single list of word types can actually provide enough information of the its morphological structure.


February 4, 2016, 3:00 - 4:30pm, Levine 101

2016 CIS Spring Colloquium Series
Constraints Driven Learning and Interference for Natural Understanding

Don Roth (University of Illinois)
Machine Learning and Inference methods have become ubiquitous and have had a broad impact on a range of scientific advances and technologies and on our ability to make sense of large amounts of data. I will describe some of our research on developing learning and inference methods in pursue of natural language understanding.  This challenge often involves assigning values to sets of interdependent variables and thus frequently necessitates performing global inference that accounts for these interdependencies. I will focus on algorithms for training these global models using indirect supervision signals.  Learning models for these structured tasks is difficult partly since generating supervision signals is costly. We show that it is often easy to obtain a related indirect supervision signal, and discuss algorithmic implications as well as options for deriving this supervision signal, including inducing it from the world's response to the model's actions.

A lot of this work is done within the unified computational framework of Constrained Conditional Models (CCMs), an Integer Linear Programming formulation that augments statistically learned models with declarative constraints as a way to support learning and reasoning.  Within this framework, I will discuss old and new results pertaining to learning and inference and how they are used to push forward our ability to understand natural language.


February 3, 2016, 11:00 - 12:30pm, Levine 512

Shortcomings of Leading Word Embeddings and their Pattern-based Solutions

Roy Schwartz (Hebrew University)
Word embeddings (e.g., word2vec, GloVe) have become increasingly popular in recent years, obtaining state-of-the-art results on numerous semantic tasks. In this talk I will show that in fact, these embeddings are limited in their ability to capture various types of semantic information. I will then present several pattern-based methods (e.g., "X and Y") that greatly alleviate some of these problems, focusing on two vector space models that replace bag-of-words contexts with pattern contexts. I will show that these models obtain substantial improvements on word similarity tasks, most notably a 20% improvement on a verb similarity task. Joint work with Roi Reichart and Ari Rappoport.


January 28, 2016, 12:00 - 1:30pm, Greenberg Lounge, Skirkanich 114

Compositional Entailment in Adjective-Nouns

Ellie Pavlick (Penn)
A necessary component of natural language understanding is the ability to perform inference over natural language utterances. Determining whether one sentence can be reasonably assumed to entail another is a complicated task, requiring a combination of logical deduction and simple common-sense. In this work, we look at the problem of adjective-noun composition in the task of recognizing textual entailment (RTE). Conventional wisdom in RTE says that adding adjectives to nouns does not preserve entailment (e.g. we cannot in general conclude that "dog" entails "little dog"). In this work, we perform an empirical analysis of adjective-noun composition in human language. Our study reveals that the composition of an adjective with a noun may produce a variety of entailment relationship, including equivalence ("boy" is often the same as "little boy") and contradiction ("sense" is rarely the same as "little sense"). The nature of the entailment tends to be highly context dependent, making it an especially challenging phenomenon for current NLP technologies. We introduce a new dataset for testing systems' ability to model compositional entailment in context, and present preliminary results in which we benchmark several state-of-the-art RTE systems on this dataset. This is work-in-progress, and the hope is to have an interactive presentation, so we can get feedback from the CLUNCH crowd. Lunch will be served at 11:45 am in Raiser Lounge.  


January 21, 2016, 12:00 - 1:30pm, Raisler Lounge

Contextually Informed Sentence Specificity Annotation and an Analysis of the Instantiation Relation

Jessy Li (Penn)
Previously we presented our state-of-the-art system that predicts sentence specificity using co-training, with the Instantiation relation in the PDTB as seed. In this talk, I will first present our continuous work in sentence specificity analysis and discuss our guidelines and pilot corpus for the annotation of contextually informed sentence specificity that enables the joint analysis of the degree, location and type of under specification in text. The second part of the talk will focus on our systematic analysis on the characteristics of the Instantiation relation, which besides specificity past work has suggested that it plays special roles in local coherence, in sentiment expression and in content selection in summarization. We show that Instantiation-specific characteristics provide substantial predictive power that improves the identification of the relation. 


January 13, 2016, 1:30pm - 3:00pm, Levine 307

Searchable Television as Text

Robin Pemantle & Diana Mutz (Penn)
We discuss a data set owned by Penn consisting of closed caption transcripts of TV shows from 10/2012 through the present.  The point is not so much to present research findings as to invite colleagues to share in the use of this resource.  This talk elaborates on a 5-minute Warren Center flash-talk this past October 13.


December 16, 2015, 1:30 - 3:00pm, Levine 307

Clustering of Unsigned and Signed Graphs

Jean Gallier (Penn)
Graph clustering using the method of normalized graph cuts is nowadays a standard technique in computer vision and other domains, owing to the seminal work of Shi and Malik. This method was extended to multiple clusters (K > 2) by Yu and Shi.

This talk will focus on Yu and Shi's method, with clarification of certain details. The method proceeds in two stages. During the first stage, the clustering problem is formulated as an optimization problem which  is translated in matrix form using a graph Laplacian. This problem is NP-hard, so a relaxed continuous version is considered; I will show that this reduces to an eigenvalue problem. The second stage consists of obtaining discrete solution which approximates  the continuous solution.

I will describe how to set up and solve stage 1 of the clustering problem. A way to deal with signed graphs (allowing negative weights) was proposed by Kunegis et al for two clusters. I will show that it is actually possible to extend the method of normalized cuts for signed graphs to  any number of clusters. Lunch will be served at 1:00 pm in Levine 307.


December 9, 2015, 1:30pm - 3:00pm, Levine 307

Verb Knowledge Bases for the Never Ending Language Learner (NELL)

Derry Wijaya (Carnegie Mellon University)
A verb is the organizational core of a sentence, it expresses relations between noun phrases that are its arguments in the sentence. Understanding the meaning of the verb is therefore key to understanding the meaning of the sentence. Natural language understanding is the problem of mapping natural language text to its meaning representation: entities and their relations in the real world. NELL, our never ending language learner, has amassed millions of facts about entities and their relations in the world, compiling them in a knowledge base of noun phrases. For each noun phrase in its knowledge base, NELL knows the entity(s) it is referring to, its category(s), and its relations to other entities in the knowledge base. NELL knows a lot about noun phrases but has yet to learn about verbs. The work I am presenting is on building a knowledge base of verbs that will complement NELL's knowledge base of noun phrases. Since verbs express relations over noun phrases in text, a knowledge base of verbs can facilitate natural language understanding by mapping verbs in text to relations over entities referred to by the noun phrases. In this work, we study an automatic construction of a knowledge base of verbs that contains important semantics for natural language understanding. The semantics are (1) the mapping of verbs to relations in a knowledge base, (2) the mapping of verbs to events that update (initiate or terminate) relations in a knowledge base. The mapping of verbs to relations in knowledge bases such as NELL, YAGO, or Freebase can provide a direct link between the knowledge in text and the background knowledge about the world in the knowledge bases; enabling inferences over the world knowledge to better understand the text. The mapping of verbs to events that update relations in the knowledge base can be useful for temporal scoping of relations in the knowledge base or for updating the knowledge base as events unfold. Since NELL has knowledge bases of noun phrases in a growing number of languages (currently English, Portuguese, and Spanish), we will also present this automatic construction of knowledge base of verbs in multiple languages.


December 3, 2015, 3:00 - 4:30pm, Wu & Chen Auditorium, 101 Levine Hall

Measuring Unconscious Bias

Dr. Kieran Snyder (CEO, Textio)
How do you measure unconscious bias? This talk will feature Kieran Snyder of Textio who will walk through a variety of quantitative, statistical research on gender, language, and technology, including: How often men and women interrupt each other in technology workplace settings Gendered language in performance reviews and job listings Systematic differences in how men and women with similar backgrounds choose to present themselves in resumes. We'll also look at the powerful and increasing role that software plays in addressing unconscious bias in the workplace, with a specific focus on machine learning solutions that show measurable results.  We'll take a particular look at how Textio is using machine learning and natural language processing to address unconscious bias in job listings, and briefly look at software that covers other aspects of employee development and gender in technology.


December 2, 2015, 1:30 - 3:00pm, Levine 307

Word Embeddings

Bryan Perozzi (Stony Brook University)
In this talk, I will present an overview of recent work with word embeddings. I’ll start with our work using word embeddings for multilingual NLP tasks like part-of-speech tagging (CoNLL'13) and named entity recognition (SDM'15). Next I’ll present our work on detecting statistically significant linguistic shifts through the dynamics of these embedding spaces (WWW?15). If time permits, I?ll finish up with some results from our current work on using embedding spaces to discover the geographic variation of language.


November 18, 2015, 1:30 - 3:00pm, Levine 307

ACCAMS: Additive Co-Clustering to Approximate Matrices Succinctly

Alex Beutel (Carnegie Mellon University)
As we rely on web services in more aspects of our everyday lives, modeling user actions and interactions has become increasingly valuable.  Research in matrix completion and approximation has enabled mining of user preferences and personalized recommendation, with recent developments in low-rank factorization offering improved accuracy.  While accuracy is important, users are more likely to follow recommendations if they are accompanied by an explanation. Thus, having simple, interpretable models is valuable. In this talk I will describe the challenges with current matrix completion techniques and offer a drastically different approach to the problem.  Instead of using low-rank factorization, we build on the insight that an additive model of co-clusterings allows one to approximate matrices efficiently. This allows us to build a concise model that, per bit of model learned, significantly beats all factorization approaches in matrix completion.

Following Occam's razor principle, the fact that our model is more concise and yet just as accurate as more complex models suggests that it better captures the latent preferences and decision making processes present in the real world. Our model, ACCAMS, achieves state-of-the-art accuracy for matrix completion on Netflix at a fraction of the model complexity. Looking forward, I will also discuss new, preliminary work using additive co-clustering to model text reviews and thus directly offer explanations of our model and its recommendations.

This is joint work with Alex Smola, Amr Ahmed, and Chao-Yuan Wu, originally published at WWW 2015.


November 11, 2015, 1:30 - 3:00pm, Levine 307

Towards a Formal Understanding of Sentiment Annotation in Sentiment Lexicons

Eduard Dragut (Temple University)

Sentiment analysis (SA) aims to detect and analyze writers' opinions, sentiments and attitudes, which are often deeply embedded in a text and not overtly expressed. Central to many SA algorithms are Sentiment Lexicons (SLs), where words are tagged as indicating positive, negative or neutral sentiment. Previous analyses have revealed substantial disagreements across different SLs, where a given word may be tagged as positive in one and as negative in another. SLs also exhibit substantial internal inaccuracies. In this talk, we will present a general framework for addressing the polarity consistency problem using linear programming (LP) theory. LP tools allow us to determine whether a given polarity assignment (i.e., a sentiment lexicon) is consistent and, in those cases when it is not consistent, to uncover polarity inconsistencies. We show that the complexity of the polarity consistency problem is dependent on the sentiment annotation scheme. For example, the problem is NP-hard when the polarity tags are categorical labels: positive, negative or neutral; and it is polynomial when polarity tags take the form of a continuous distribution in the interval [0, 1]. We present an experimental study which applies our framework to a number of publicly available sentiment lexicons.


November 6, 2015, 12:00 - 1:30pm, IRCS Conference Room

Parallel Meaning Bank

IRCS Noon Colloquium Series

Johan Bos (University of Groningen)
In this talk I will introduce semantic annotation based on Combinatory Categorial Grammar and Discourse Representation Theory. This will be illustrated with the Groningen Meaning Bank, a large corpus of English texts annotated with formal meaning representations, and a new multi-lingual resource in the making, the Parallel Meaning Bank. I will talk about the language technology used, the role of gamification and crowdsourcing, and formal considerations.


November 4, 2015, 1:30 - 3:00pm, Levine 307

Document Intelligence in the Real World

Alexy Khrabrov & Max Cantor (Nitro)
Knowledge workers spend their lives working in documents. As businesses have grown complex, individuals have had to adapt to ever-increasing complexity and volume in the document workflows they manage.

Current best of breed approaches to addressing the avalanche of document workflows have concentrated on large scale, highly customized enterprise deployments. Unfortunately, this leaves aside a huge number of individuals at smaller firms, while for the most part aggravating their enterprise end users as custom, enterprise solutions often do.

Nitro is leveraging machine learning and natural language processing techniques to deliver democratized, user- and business-friendly solutions to the workflow avalanche. Nitro has the best PDF-centric expertise in the world and we are merging this with our ML/NLP work in our Scala/Spark environment. In this talk Alexy Khrabrov, Nitro?s Chief Scientist (SEAS Ph.D. ?11) and Max Cantor, Director of Research Engineering (SEAS ?03) will talk about document intelligence in the real world.


October 28, 2015, 1:30 - 3:00pm, Levine 307

Phrasal Translation Generation for Low-Resource, Agglutinative Languages

John Hewitt (Penn)
Statistical Machine Translation of low-resource languages suffers from data scarcity that is compounded when these same languages are agglutinative. Agglutinative languages' property of modifying nouns' and verbs' meanings through affixes decreases the probability of seeing any single type in training data (and thus increases the OOV rate.) Further, highly inflected forms are unlikely to be word-aligned to the set of English words that properly expresses their meanings. We present, as a work in progress, a translation generator which uses morphological annotation to generate English phrasal translations for inflected foreign words. Our system generates sets of English translation candidate phrases, and corresponding probability distributions, for millions of inflections in over 100 languages. Using no bitext, we are able to generate k-length English phrasal translations for 1-length source inflections. These meaning-rich inflections encode up to 23 dimensions of meaning, like aspect, case, and tense, in affixes. Early tests show promising accuracy in expressing linguistic features, and modest gains in end-to-end MT for Turkish-English. In completing our work, we will expand our end-to-end MT testing to tens of languages, as well as test our system's efficacy in improving word alignment between agglutinative languages and English. 


October 9, 2015, 12:00 - 1:30pm, IRCS Conference Room

A Distributional Theory of Content for NLP

IRCS Noon Colloquium Series

Mark Steedman (University of Edinburgh)
Linguists and computational linguists have come up with some quite useful theories of the semantics of function words and the corresponding logical operators such as generalized quantifiers and negation (Woods 1968; Montague, 1973; as adapted by Steedman 2012). There has been much less progress in defining a usable semantics for content words. The effects of this deficiency are very bad: linguists find themselves in the embarrassing position of saying that the meaning of "seek" is seek. Computationalists find that their wide coverage parsers, which are now fast and robust enough to parse billions of words of web text, have very low recall as question answerers because, while the answers to questions like "Who wrote 'What Makes Sammy Run?'" are out there on the web, they are not stated in the form suggested by the question, "Budd Schulberg wrote 'What Makes Sammy Run?'" but in some other form that paraphrases or entails the answer, such as, "Budd Schulberg's 'What Makes Sammy Run?'". Semantics as we know it is not provided in a form that supports practical inference over the variety of expression we see in real text. I'll discuss recent work with Mike Lewis which seeks to define a novel form of semantics for content words using semi-supervised machine learning methods over unlabeled text. True paraphrases are represented by the same semantic constant. Common-sense entailment is represented directly in the lexicon, rather than delegated to meaning postulates and theorem-proving. The method can be applied cross-linguistically, in support of machine translation. I'll discuss extensions of the method to extract an aspect-based semantics for temporal entailment, and speculate concerning the relation of this representation of content to the hidden prelinguistic language of mind that must underlie all natural language semantics, but which has so far proved resistant to discovery.


October 7, 2015, 1:30 - 3:00pm, Levine 307

Linear Algebraic Structure of Word Meanings

Sanjeev Arora (Princeton University)
In Natural Language Processing (NLP), semantic word embeddings use vectors to capture the meanings of words. They are often constructed using nonlinear/nonconvex techniques such as deep nets and energy-based models. Recently Mikolov et al (2013) showed that such embeddings exhibit linear structure that can be used to solve "word analogy tasks" such as *man*: *woman* :: *king*: *??*.  Subsequently, Levy and Goldberg (2014) and Pennington et al (2014) tried to explain why such linear structure should arise in embeddings derived from nonlinear methods.

We provide a new generative model for language that gives a different explanation for how such linear algebraic structure arises. This new model also casts new light on older methods for generating word embeddings, such as the PMI method of Church and Hanks (1990). The model has surprising predictions, which are empirically verified. It also suggests a new linear algebraic method to detect polysemy (words having multiple meanings).

We think our methodology and generative model may be useful for other NLP tasks and understanding the efficacy of other neural models. Joint work with Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Lunch will be served at 1:00 pm in Levine 307.  


September 23, 2015, 1:30 - 3:00pm, Levine 512

Transfer Learning

David Isele (Penn)
Transfer learning involves using a learned model for a different but related task. Since learning large models can often take weeks, transfer learning can save time by making use of already trained models. Additionally, when there is not sufficient data, transfer can improve performance - giving greater generalization and preventing overfitting. With the recent success of many deep net systems, researchers have started using deep nets to approach transfer learning. We review recent approaches to transfer learning using deep nets, specifically looking at the following works:

A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN Features off-the-shelf: an Astounding Baseline for Recognition. Computer Vision and Pattern Recognition Workshops (CVPRW), pages 512? 519, Mar. 2014.

R. Socher, M. Ganjoo, C. D. Manning, and A. Y. Ng. Zero-Shot Learning Through Cross-Modal Transfer. Advances in Neural Information Processing Systems, pages 935?943, 2013.

J. Hoffman, S. Guadarrama, E. Tzeng, J. Donahue, R. Girshick, T. Darrell, and K. Saenko. LSDA: Large Scale Detection Through Adaptation. Advances in Neural Information Processing Systems, pages 3536?3544, 2014.


 

 

Computational Linguistics at Penn

Subscribe to the Computational Lingustics Lunch Mailing List