stemma

 

A Stemmatic Parsing Project

Page history last edited by Nicholas Davis 1 yr ago

 

By Per Aage Brandt, Matthew Elliott, Nicholas Davis

Center for Cognition and Culture, CWRU

 

 

     Standard automated parsing programs are inhibited by two crucial problems. One: They treat the linguistic input as symbol strings. Two: They ignore the semantic meaning of syntactic composition nodes. The result is the incapacity to pick up relevant semantic information from sentences in texts. The present project overcomes these two problems and is therefore in a position to develop a radically new and semantically efficient parsing system, which is capable of enhancing both the scientific understanding of linguistic structure and the range of applications in language technology that are dependent on relevant and efficient parsing.

 

 

            In stemmatic parsing, phrase, clause, and sentence structure is not treated as symbol strings but instead as variable linear manifestations of canonically built constituent compositions strung together by semantically meaningful nodes, and this principle encompasses all syntactic compositions and sub-compositions. Since these compositions have semantic readings per se, sentences and component phrases can now be interpreted as semantic information, relevant for search engines, etc.

 

 

     A pilot version of the stemmatic parser is available. This project aims at developing a big-scale version, to write out the theoretical consequences, and to start a series of application designs for specific usage domains.

 

 

Stemmatic Parsing

 

                                                                 


 

Abstract

 

 

Natural language parsing can be approached from multiple standpoints.  The typical approach processes the text from left to right, paying attention the co-text; the system tries to fit each word into a given pattern. This linear approach, largely used in statistical parsing (Klein and Manning 2003, Bhagat et. al. 2005) works with a considerably fast run time and consumes the fewest resources, but it does not account for the ambiguities and different interpretations language can have.  This essay proposes a new approach to this problem, namely stemmatic parsing. This method strives to return all possible meanings of a sentence while still avoiding an infinite result, maximizing readable phrases, while minimizing errors due to ambiguity.

1 Introduction

This article outlines current approaches to natural language parsing while highlighting important limitations, such as multiple word meanings and varying interpretations on the same sentence, the latter being referred to as the garden path problem. After these problems are stated, an alternative processing method is proposed, namely brute force, calculating all possible solutions to insure accuracy. Frame and metaphors are considered next, as they are pertinent to the organization of linguistic knowledge once a parsing method is created. Stemmatic syntax is used to organize the frame network, utilizing eight nodes to structure the network based on domains of knowledge as well as actions schemas depicted in the sentence (See Figure 1 below). In this way, the frame network will be semantically organized, rather than simply creating a frame encyclopedia. 

Figure 1: The Eight Stemmatic Nodes

This parsing approach utilizes generative, construction, and stemmatic grammars. Generative grammar, founded largely by Noam Chomsky, describes language as a componential system, phonology governing sound structure, syntax dictating sentence structure, semantics comprised of rules governing meaning, and a lexicon containing information from each category about individual words. On the syntactic level,  lexical items are assigned a word class, such as noun or verb, hierarchically ordered into noun phrases, verb phrases etc. until each component of the sentence is assigned a level. The resulting tree accounts for word classes and their relationship but not the meaning because generative grammarians claim that syntax is not inherently meaningful. Chomsky proposes that some ‘deep structure’ exists by performing a sequence of transformations on a universal grammar, stating that the “semantic interpretation of a sentence depends only on its lexical items and the grammatical functions and relations represented in the underlying structures in which they appear.”(Chomsky 1965) Stemmatic parsing uses word tags and phrasal construction resembling generative grammars, but diverges technically, viewing syntactic nodes as semantic case meanings and dynamic schemas, proposing a 5 tier model of language seen in Table 1.

Table 1: Five Level Model of Language

Level 1

Linear Order:

  • Morphological Instantiation
  • Word Classes

Level 2

Structural:

  • Stemmatic Syntax

Level 3

Semantic:

  • Frames
  • Closed Class Words

Level 4

Domain:

  • Knowledge Network
  • Open Class Words

Level 5

Pragmatic:

  • Base Space
  • Situational Significance

 

 

Construction grammars offer an alternative to the generative paradigm, the core of this grammar “grew out of a concern to find a place for idiomatic expressions in the speaker’s knowledge.”(Croft and Cruse 2004) In generative grammars, idioms were seen as anomalies, not being accounted for because lexical items with static meanings cannot be larger than one word. Construction grammar claims that these idiomatic phrases are actually constructions, for example ‘the bigger they are the harder they fall’ is conceived of as an instatiation of the schematic idiom ‘the x’er the y’er’. It is a template upon which other words can use depending on the productivity of the construction. This approach claims that constructions “consist of pairings of form and meaning that are at least partially arbitrary.”(Croft and Cruse 2004) The form contains syntactic, phonological, and morphologic qualities, while the meaning contains semantic and pragmatic qualities. However, stemmatic grammar holds that the form of a sentence must contain a stemmatic representation accounting for the contribution that syntax has on the meaning. Stemmatic construction grammar acknowledges that thought is three dimensional, while text is only one dimensional, stemmatic syntax is a method of linking the two, literally a two dimensional model of language.  The result is a funnel type effect, either going from three dimensional thought, structured in a two dimensional stemma, expressed in one dimensional text, or vice versa.

2 Current Issues   

            The garden path problem occurs in sentences whose local structure suggests a particular pattern only to be disproved later.  A simple example can highlight these issues: “Since Todd runs seven miles seem like a short distance to him”. Here the parser first expects “runs seven miles” as a single phrase, but the sentence only makes sense if the initial adverbial clause ends at runs, with “eight miles” as the subject of the independent clause.   Sometimes backtracking is employed to get around this issue.  Back tracking allows the parser to back up and try a second attempt at parsing the sentence. This approach works on a surface level, but is will not solve all instances efficiently, thus a different approach to natural language parsing is needed.

            Unknown words also present a challenge to parsers. The inability to look up a word in the dictionary forces the system to guess at the different meanings of that word.  Unknown words can be guessed at using context and word classification. Words can be classified into two different categories, closed class and open class words.  Closed class words are definite and limited in the grammar, serving to structure a sentence; examples include ‘the’, ‘and’, etc. Open class words are numerous and provide the deeper meaning of a sentance; nouns are a good example of this. In order to solve the issue of unknown words, a parser can look at the context of the word, considering all the words around it to get an understanding of what kinds of phrases the unknown word could possibly fit into.  This list can be limited by excluding the closed class words, and in doing so the number of possibilities drops considerably.  The ability for a parser to guess at a word is crucial for tackling unknown texts without human verification of existing dictionary entries. In addition, the parser should not settle on classifying an unknown word as specifically one type of word because most words can fit into several categories. For example, the word ‘fight’ acts as a noun in the sentence ‘Cats don’t get into fights,’ but it is also a verb in the sentence ‘My cat fights big scary lions.’ This problem can be solved by allowing the same word multiple grammatical classifications in the initial phase. Later, the results of the parser can be pruned based on statistical usage. 

            One method that has not received much focus in the design of parsers is the possibility of multiple correct answers.  Justification for this approach can be found in ambiguous sentences like "I bought a ticket with my credit card" can be understand as "I bought a ticket using my credit card, or "I bought a ticket that comes with a credit card;" both are correct.  The polysemy of the preposition ‘with’ is responsible for this ambiguity because it can mean ‘instrumental’ or ‘accompanying’.  Parsers that are unable to handle these multiple meanings can be said to be incomplete because they do not return all the possible results, regardless if one is more correct than the others.

3 Brute Force

            An intelligent brute-force approach to parsing language can be used to return all the possible correct results. This method is ideal because it ensures accuracy by returning all possible permutations of sentence components and meanings, thus solving ambiguity issues. The only drawback is the computationally intensive computing that this approach requires. However, the IBM cluster provides the computational power necessary in order for this project to move forward. [computation example] In building such a parser, one can use word classes to construct a foundation of lexical rules. For example, ‘act’ is both a performative verb and a noun depending on usage. Thus, the parser will assign both values. In this fashion, it goes through each word and assigns a list of possible grammatical values. These categories are then combined to form potential overarching phrases. It will create a sentence based on every possible combination of grammatical categories.  For example, in one permutation, the parser will assume ‘act’ is a noun and look to see if there are any articles that will make a noun phrase. This phrase building can be modeled hierarchically, with smaller phrases, like a verb phrase and a noun phrase combining to make an intransitive phrase. The parser recognizes what types of words or phrases it has to work with, analyzes what these elements can build up to create, and then constructs the most logical combination.  This way, the program is able to see the big picture while still focusing on the micro constructions, the potential stemmatic realization of a single word or phrase. Here, phrase is referring to noun phrases, verb phrases etc., while still applicable to clauses such as ditransitive  and monotransitive phrases. See Figure 2 for an explanation of this process.

Photobucket

The number of micro constructions and phrases are bounded to a limited number of different types of constructions and phrases defined in grammar. The parser tries each different combination, yielding multiple phrasal constructions with a corresponding change in meaning, as depicted in Figure 2. However, the next step, understanding what to do with these phrases, can quickly approach infinity.  Several methods of parsing language try to represent all the different types of combinations; the fatal flaw that this approach runs into is returning infinite results.  These approaches seems to work on the surface level because they can represent a wide and deep variety of interaction, but ultimately fail because they lack boundaries that limit the different interactions objects can have in the parsed grammar.  Stemmatic syntax is Per Aage Brandt’s method of limiting the possible interactions down to eight nodes. This model categorizes cognitive schemas that are prevalent in language use to construct a meaningful syntactic tree, using recursion to express depth.        

4 Stemmatic Syntax

Grammatical constructions can be variably manifested by linear concatenations of words and therefore must as such be assigned a non-linear constituent structure. Additionally, this structure must reflect framed semantic representations. Besides the prominent verbal constructions, subordinate phrasal compositions of all kinds must equally be assigned a non-linear constituent structure compatible with the dominant constructions. Since these phrasal compositions are often neutral to the dominant constructions and compatible with all of them, there must be a shared structural format that can be variably linearized and that can integrate lexical information to represent semantic scenarial meaning.

            All phrase structure can thus be analyzed as part of a general phrasal format that accounts for nodal constituent compositions. This format is what we call stemmatic syntax; it turns out that a very small set of meaning-bearing nodes (eight) suffice for a meaningful parsing of verbal phrases, nominal phrases, Verbal/nominal-neutral adverbial phrases, adjectival phrases, and clausal embedding – completive, relative, or adverbial. These nodes can roughly be compared to case meanings and to dynamic schemas; they are meaning-invested in so far as they imply specified mental operations of semantic integration.

            The stemmatic format of ‘trans-constructional’ phrasal compositions explains the possibility of blending in syntax. It explains the possibility of language learning – reusing the same stemmatic format for new constructions and new lexical input, while allowing projections from the first acquired language and thereby making ‘meaningful’ errors.

            The theory of stemmatic syntax, or stemmatic construction grammar, makes it possible to configure new language technology for language learning, for automatic parsing, for semantically informed data-mining, for speech synthesis, and for cognitive robotics. Many other applications of in sight, and it is of course important to continue the basic research on the format itself, which is far from being entirely or sufficiently understood.

            Nevertheless, a pilot version of an automatic parser built on stemmatic principles has been produced in our laboratory and already appears to be ready for implementation in such applications. Applied and basic research should go hand in hand in the next phase of this work.

 

5 Frames

            Dealing with verbs is another issue that leads researchers into the realm of infity. Attempts have been made to categorize and list every possible interaction of every verb in the dictionary; creating an encyclopedia of frames.  Each frame describes the components and roles of a specific interaction. For example, the buy-sell frame would describe specifically what is involved in buying and selling: who the buyer is, who the seller is, the object being transacted, the money involved, and so on (see Fillmore 2006 for an explanation of frame semantics).

5.1 FrameNet

One can build an encyclopedia of frames and their connection patterns can be seen as a frame network. The University of California Berkeley used this paradigm to create FrameNet (Baker, Fillmore, and Lowe 1998), which has a variety of interesting aspects worth mentioning.  FrameNet and other encyclopedic approaches return results very quickly because retrieving knowledge is a quick look up process that can be optimized using a variety of algorithms.  The FrameNet project, as a whole, is by far not a complete project and has an immense goal of trying to categorize everything into frames and building a complete system.  The FrameNet system defines and understands meaning through other definitions by employing a circular logic.  Several issues could potentially arise when working with large corpuses of unknown text, FrameNet and other encyclopedic approaches would find it very difficult to process words that are not entered in its database, making it a poor candidate for natural language processing.   As a result, a different approach to frames has to be taken to solve this fundamental issue.

5.2 Stemmatic Framing

            The approach used in this parser utilizes stemmatic structure to organize the frames, opposed to building an encyclopedia of frames based on the definitions of every word. Stemmatic organization limits the number of connection types to eight nodes based on cognitive schemas used to understand language. For example, a style or manner of doing something falls under represents the fifth node, thus when a subject is acting in a certain manner, the word describing the style would be connected to the verb through a five branch.  The five branch has information coded into it, thus the stylistic word is saved as such because that is one function that the five branch does. This method focuses on modeling the interactions of verbs through understanding their individual instances through stemmatic organization.  This large expansive network would gather meaning through reading into sentences. To best explain this process, an example sentence is needed: 'My cat ate a blade of grass this morning.' This sentence already provides a vast amount of information that a naive system can pick up on.  The speaker can own a cat, and currently owns a cat.  This cat can and has eaten a blade of grass.  Grass can be eaten by cats and it can also been eaten in the mornings.  The verb 'eat' in the sentence is being defined by the co-text, as just something that cats can do, and more specifically what this cat has done.  Further refinement of the definitions and growth of network can be done by simply feeding the system additional corpora.  As the system reads, more information builds up a wide depth of knowledge and several distinct patterns emerge.

5.3 Contradictory Input

            When the system encounters a contradiction in the current text verse stored text, the parser does not reject this information, but rather stores it along with it the correct information without giving priority to either one.  The reason behind this storage becomes evident when the system must decide the amount of confidence it has in a specific piece of information. Simply rejecting information allows for the possibility of database poisoning, telling the system false information for the sole purpose of misinformation.  A system that rejects contradictory information only holds the initial input as true.  The current approach, storing all information with varying preference, allows the ability to persuade the system into taking a different viewpoint, providing a greater dynamic of knowledge.

5.4 Polysemy

            The majority of words in the dictionary have multiple meanings and definitions.  In order to have a complete system, the program has to accept that there can be multiple definitions. The current parser accommodates for this by not connecting all the instances of a word together.  Instead, they are kept separately, allowing the parser to distinguish between the different meanings of the same word; through separation the system is able to understand the fundamental distinctions in the different definitions.  Context analysis aids in determining which definition should be used. 

            How strong a word is defined can be measured in two ways: the relative density around that word, and the number of non-unique times that word has been used.  The relative density is a function of how many connections there are around a specific word.  As the number of instances and different uses of the word ‘cat’ increases, the definition of that word becomes stronger.  For example, if the system received the following text: “The cat ate grass. I petted the cat. I wish I was a cat. Cats drink water and milk.” Each instance of the word cat adds to what a cat can do and what can be done to a cat. This creates a more robust knowledge network based on the number of connections formed.  The distance between words can mark out definitive categories and domains. The non-uniqueness of usage is taking into account when a word is being used in the same manner repeatedly; this recurrence strengthens the definition as well.

5.6 Metaphor

Another important aspect of parsing language is dealing with metaphor. For the purpose of this program, metaphors will be divided into two distinct categories: idiomatic metaphors and non-idiomatic metaphors. Idiomatic metaphors require prior knowledge to understand them: 'wide awake', 'kick the bucket', 'break a leg.'  In contrast, the non-idiomatic types of metaphors don't require prior knowledge: 'that man is a shark, 'concrete jungle', 'loan shark'. Categories and domains play an important role in determining metaphor type and meaning. Non-idiomatic metaphors can be characterized as dramatic shifts in categories, topics or domains.  In the example above 'concrete jungle' can be seen as a topic shift from building materials to forests.  These dramatic topic shifts highlight the possibility for a metaphor.

            When the context of a metaphor is considered, these shifts in domain become more apparent, making potential metaphors stand out in a text.  For example, the metaphor ‘that man is a shark’ takes on a completely different meaning in various contexts, notably if the context is a used car dealership versus a swimming competition. In order to get a more accurate representation of the metaphoric meaning, context has to be included and taken into consideration.  To view this metaphor problem from a software perspective using the stemmatic framing described above, the distance between nodes suggest a stronger categorical differences.  When the system goes to analyze text, each word of the text would highlight the nodes on the network of frames.  The various paths and distances between each node used in a sentence can measure categorical shifts. Context can be taken into consideration for metaphors by analyzing the sentences around the metaphor, at a narrative level, to provide global coherence. 

6 Research Objectives  

            The basic research question that we are asking in this project is whether there is a more accurate method of parsing text; an approach that will give all meanings of the sentence rather than choosing one that may or may not be correct. Our hypothesis is yes, there is a better method, but it requires enormous computing power, like the IBM cluster computer, in order to succeed. The implications of creating such a parser are substantial in both the commercial and academic domains.

6.1 Ground Research

            The stemmatic parser, in its current state, is successful in dealing with small sentences because the permutations stay in the thousands, but as textual input grows, the brute force approach returns substantially greater permutations, requiring additional resources to parse. In order to implement this parser on a large scale, a large cluster of computers is needed. The benefits of this will be two fold, evaluating the effectiveness of brute force on a large scale given sufficient resources, and testing the stemmatic organization of language.

There are a several key issues to focus on in the two year development plan. The best approach to expanding this parser is to work on basic groundwork in tandem with applications. The project is ready for implementation in applications, and any issues that come up will serve to fine tune the architecture already in place. Some practical issues that are of primary interest are:  

·         Pronouns

·         Efficient storage of stemmatically organized frame network  

·         Optimizing run speed

·         Creating a de-parser

 

 

Pronouns pose a problem for the parser because they do not inherently contain information about what person they are referring to. In English, there is no gender mark or other contextual cues to help in this question. There may be several individuals described in the text, and many pronouns used, a primary concern in stemmatic parsing in producing a method to track pronoun creation and reference.

Since the parser has not been able to run to its full potential, a complete frame network has not been created. The components for creating this network are schematically in place, but implementing the parser on a large scale will provide feedback about the most efficient method of storing the information. This storage issue deals with level 3 of the 5 level model of language, namely frames. A diagrammatic storage method has been proposed because frames are viewpoint sensitive and serve to structure the meaning of the sentence. Closed class words, especially prepositions, do a lot of the work with framing, and most of them can have a diagrammatic component, for example,  ‘to’ can be thought of as a starting point with an arrow arriving at the end point. Depending on the closed classed items used in a sentence, a composite diagram could be constructed to store the frame networks of a sentence. Certain nodes would have a family of diagram types, and the storage would reflect the stemmatic structure of the sentence.  The primary researcher of this group, Per Aage Brandt, has diligently worked to categorize the kinds of diagrams and their relation to language in his book Spaces, Domains, and Meanings.

The run speed of the parser is quite slow on a normal computer because of the brute force method it is using. However, given the advantages this method has over the others presented in this paper, it is important to work with this method and tweak the parameters to optimize efficiency. During the course of this two year project, additional programmers would be hired to review the current code, making adjustments to make a more efficient parser.

Creating a de-parser is also an important step in the overarching scope of this project. This is especially important for text summarization and translation. This program has to take the parsed material and construct a sentence with grammatically correct word order. In a given language, certain stemmatic nodes tend to have a usual place in the linear order, with a few instances diverging from the norm. An exhaustive rulebook for linearization is needed to move forward in the de-parsing process.  This basic research objective may be aided by simultaneously working on a semantically enabled web search engine.

6.2  Applied Research

            The commercial applications for a project of this kind lay mainly in creating a semantically enabled web search engine. This would be accomplished by instructing the software to go through every web page and parse the entirety of the contents. The program would then summarize this information and store it in a database. Storing this quantity of information is an extremely data intensive computational process, but it would be achievable solely through the use of the new IBM cluster, making this platform ideal for applied research with the current parser. Once the summarized text is stored, this information could be accessible through a web based search engine, similar to google. However, in this search engine, the user input is parsed and understood, rather than searching for sites using a statistical word count, the search engine actually knows what the user wants, along with a summary of every site. The search engine would then return just a handful of relevant pages rather than some five hundred thousand pages that have the same word in it.  

            Once the general meaning of text has been described in terms of frames and metaphors, the natural language parser can take the next step and do a large number of things. Two prominent applications stand above the others, namely text translation and summarization. Historically, machine translation uses two dictionaries, one from the source language and one from the destination language. Translation is conducted by looking up each word in the source dictionary, and finding the word in the target dictionary. This method of translation ignores word order and the meaning that the author of the text was trying to express originally. It also runs into issues when metaphors are encountered. Summarizing text has typically dealt with extracting portions of the text without any real guidance other than statistics on word count and positions in the text. A more recent summarization method focuses on determining the main characters of the text using statistical methods. Regardless of which method is used, the problem is the same: neither of them actually understands what is going on in the text.

            The design of a natural language parser described here is based on a hierarchy of feedback loops, in which the system is able to train itself to be able to read and process text better.  The sentence processing aspect is able to encounter unknown words and add them to its array of dictionaries.  The network of frames does not fall apart when it tries to process information that has not been pre-programmed into it, but instead uses the information to learn a new construction.  The metaphor system is based upon the feedback loop of the frame network.  This series of loops allows the natural language parser to monitor itself without human intervention, without having a trainer specifically tailored for the system about the target text, but rather have a system that is flexible enough to prune itself.  This is not to say that the parser cannot accept human intervention; human intervention would only involve pruning the expansive networks and tweaking the dictionaries.

7 Conclusion

            In the authors’ view, there are two basic methodologies of creating natural language parsers, namely statistical or brute force. Statistical parsers are able to handle large sentences with a fast run time, but the results do not necessarily return an accurate representation of the sentence meaning because they are forced to choose one answer. Using brute force along with stemmatic construction grammar will return all the possible meanings and organize this information using stemmatic connections. These connections give the system a basic knowledge of what each word is doing in the sentence, eventually leading to a semantically sensitive parser. Besides the current parser, this approach has not yet been implemented because the computational resources did not exist. Given the power of cluster computing, the authors’ feel that the benefits of an accurate and semantically enabled parser using brute force is far better than settling for a fast, but incomplete paradigm.

References

Colin F. Baker, Charles J. Filmore, and John B. Lowe. 1998 The berkely framenet project. In

            Proceedings of COLING/ACL, page 8960, Montreal, Canada.

Dan Klein and Christopher D. Manning. 2003. Fast Exact Inference with a Factored Model for Natural

            Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002),

            Cambridge, MA: MIT Press, pp. 3-10

           

 

 

Comments (0)

You don't have permission to comment on this page.