The Phrase Parser

The phrase parser is a component of the link grammar parser. It takes a linkage (as generated by the parser) and generates from it a constituent structure, showing conventional constituents such as noun phrases, verb phrases, and prepositional phrases.

For example, for the sentence "The quick brown fox jumped over the lazy dog", the phrase parser produces this:

(S (NP The quick brown fox)
   (VP jumped
       (PP over
           (NP the lazy dog))))

The conventions used by the phrase-parser are those of the Penn Treebank.

USING THE PHRASE-PARSER. The phrase-parser has been incorporated into the standard version of the parser, available at our ftp site. When using the parser, one can type the variable "!constituents=1"; this will cause the parser to output the constituent structure of each inputted sentence, after the graphic display of the linkage. (The graphic display can be suppressed by typing "!graphic", so that only the constituent structure is outputted.) If you type "constituents=2", the constituent representation will be displayed on a single line, like this:

[S [NP The quick brown fox NP] [VP jumped [PP over [NP the lazy dog NP] S]

Constituent structures will be shown for all linkages of a sentence that are displayed; initially only the most preferred linkage is displayed, but other linkages (if any) can be displayed by pressing "return" repeatedly.

The constituent structure of a linkage can also be accessed as a tree data structure. Each node in this tree consists of of a pair of integers, indicating a span of words in the sentence (the first and last word of the constituent), a character string indicating the "type" of constituent (NP, PP, etc.), and pointers to the node's children.

In order to function, the phrase-parser must have a "constituent knowledge file"--the post-process knowledge file from which the constituents are generated (see below for explanation). The default constituent knowledge file is "4.0.constituents". You can specify a different constituent file with the command-line command "-c [filename]". You can also run the parser with no constituent file, with the command-line command "-coff". (If you run the parser this way but then attempt to do constituent processing with the command "!constituents=1 (or 2)", an error will be produced.

LEVEL OF PERFORMANCE. The phrase-parser allows the accuracy of the parser to be tested against other sources, such as the Penn treebank. The Penn treebank is a large database of newspaper text, including not only news but sports reports, movie reviews, letters to the editor, etc.. On Penn treebank text as a whole, the parser gets about 75% of constituents correct. (The parser's "precision" score--the proportion of the constituents it finds that are correct--is about the same as its "recall score"--the proportion of correct constituents that it finds.) If the input is restricted to hard news and financial news, the parser gets about 82% of constituents correct.

The Logic of the Phrase Parser

In this section we describe the way the phrase parser generates the constituent structure for a linkage. The process involves several stages. First, the program generates a set of constituents for each sublinkage of the sentence independently; then it merges these constituent sets.

FIRST-PASS CONSTITUENT GENERATION. Before generating the constituent structure, the parser "post-processes" it. This post-processing is similar in character to the post-processing that is done as a part of regular parsing (see section 6 of the Introduction for an explanation of how post-processing works); however, it is different in the details. This second post-processing stage creates a "domain structure" which essentially divides the sentence into groups of words corresponding to constituents. The post-processing rules for doing this are stored in the file "constituent.knowledge".

Once the domain structure has been generated, the parser goes through the list of domains; roughly speaking, it generates a a constituent for each domain. In some cases, it might generate two constituents for a domain. In some cases, the starting link of a domain might cause a constituent to be generated, as well as (or instead of) the constituent triggered by the domain itself. The domains are either of the "normal" kind (including everything reachable from the right end of the starting link of the domain) or of the "left" kind (including everything reachable from the left end of the starting link).

In general, the limits on a constituent are the leftmost and rightmost words in the domain. However, the left boundary may not be before the left end of the starting link. If a domain contains no links, its left and right limits are the right end of the starting link (or the left end in the case of a left domain).

The table below lists the constituents generated in this fashion. On the left is a domain type; next is a list of link-types that start it; next is the constituent type that is generated; next is an explanatory comment. (In cases where the starting link generates the constituent directly, no domain type is listed.)

Domain Starting links Constituent Comments
p MVp,Mp,MVt,MX#x,MG,OF,Pp PP Prepositional phrase: "The dog ran [PP in the park ]."
v S(except S##d), Pg,Pv,I,PP,PF, SF,SX,Mv,Mg VP Verb phrase: "The dog [VP will [VP run in the park ] ] ."
s Wd,Wc,Ce,Cs,Rn,Cc,Cr,Cs,CPi, CP*,Zc,Bc,COq, CX,B#w,B#d S Clause (main or dependent, including subordinate, embedded, and relative clauses): "[S He said [S he left when [S he saw the dog [S I bought ] ] ] ]".
z S##d,Wi,S##w, RS,MVg,Mgp, MX#p, PF VP,S In certain cases, both a VP and an S must be generated. This is used for imperatives, present participles after prepositions, subject-type relative clauses, and other things. Example: "He escaped by [S [VP digging a tunnel ] ]."
c COp VP,S Left version of z, used for opener participle phrases: "[S [VP Carrying the dog ] ], he left".
b TH,R*,MVs, QI,MX*r,MVh, MX#d,Mr,Mj SBAR SBARs are generated in several cases: Embedded clauses with "that", relative clauses with a pronoun, dependent clauses with a conjunction, and indirect questions. Example: "He said [SBAR that he was coming ]."
f CO#s SBAR Left version of SBAR, used for opener clauses: "[SBAR When he arrived ], I left."
- Ce, Rn SBAR Rn (used in relative clauses with no relative pronoun) and Ce* (used with embedded clauses with no "that") require SBAR as well as S: "The dog [SBAR [S we saw ] ] was black."
n O,J(except Jw),SI,OX,MX#*,TI, BIt,IN,ON,JG,U,JT,OD NP Noun phrases - objects, prepositional objects, and inverted subject phrases: "I saw [NP the dog ]". (Isn't this generated twice?)
y YS,YP,Yt,Yd,GN,OD NP Noun phrases on the left end of a link, such as possessives: "[NP The dog ] 's nose was black."
- R*, MX*r WHNP A one-word constituent to contain relative pronouns: "The dog [WHNP who ] chased me was big".
- Mj WHPP,WHNP In prepositional relatives, a two-word constituent WHPP contains the preposition plus the relative pronoun; a one-word constituent WHNP contains just the pronoun. "The park [WHPP in [WHNP which ]] we played was big."
- Ss#d, B#d WHNP,SBAR With "what" sentences ("what goes up must come down"), the "what" is a WHNP and the whole clause is an SBAR: "[SBAR [WHNP what ] goes up ] must come down".
a Pa,Ma,MX#a,L ADJP Predicative adjective phrase: "He is [ADJP happy ]".
u A ADJP Attributive adjective phrase: "The [ADJP big ] dog ran".
e E,EA ADVP Pre-verbal adverb phrase: "He [ADVP quickly ] ran".
i MVa, MVb ADVP Post-verbal adverb phrase: "He ran [ADVP quickly ]".
q CPx SINV Clause with s-v inversion; used in paraphrases: "He ran, [SINV said Joe ]".
k K PRT Particle: "He came [PRT in]".
h NIax QP Right-branching number expressions (needed mainly for "$"): "I have $ [QP 5 ]".
d D##n, ND QP Left-branching number expressions: "I have [QP 500 ] dogs".
- CP S With paraphrase sentences, an S is needed containing the entire sentence: "[S John ran, he said ]".
- MVs WHADVP This is needed for "when", if used as conjunction: "I ran [WHADVP when ] I saw him".
t TO,MVi S,VP Infinitive phrase: "I intend [S [VP to go ] ]".
- QI,Mr,MX#d WHADVP,WHNP Indirect questions require a WHADVP, unless it is a question-word NP, in which case it needs a WHNP ("I wonder [WHNP who ] he saw"), or a question-word determiner, in which case it needs a WHNP around the entire noun phrase ("I wonder [WHNP which movie ] he saw"). In such cases, also, the remainder of the clause needs an S: "I wonder [WHPP which movie ] [S he saw ]".
g CO**,CO#n,COd# PP Left version of p: "[PP On Tuesday ], we saw him".

COMPLEMENT CONSTITUENTS.Once this first round of constituents is generated, various further constituents and modifications are needed. In some cases, a constituent is needed that includes everything in a larger constituent X but not in a subordinate constituent Y. For example, in the case of a typical clause, an NP is needed which contains the subject phrase - namely, everything included in the larger S but not in the VP. For example: [S The dog [VP ran in the park ] ] --> [S [NP The dog ] [VP ran in the park ] ]

There are a number of cases where "complement constituents" are needed; they are listed below.

1. For every opener clause (SBAR) in a larger S, a complement S is added: "[S [SBAR When I saw him ], [S he ran ] ]".

2. For every opener PP in an S, a complement S is added: "[S [PP On Tuesday ], [S we saw him] ]".

3. For every participle opener (S) in an S, a complement S is added: "[S [S Carrying the dog ], [S he left ] ]".

4. For every VP (started by an S link) in an S, an NP is added for the subject phrase: "[S [NP The dog ] [VP ran ] ]".

5. For every relative clause (SBAR) in an NP, an NP is added: "[NP [NP The dog ] [SBAR we saw ] ] ran".

6. For participle modifiers (VP) in an NP, an NP is added: "[NP [NP The dog ] [VP chasing the cat ] ] was black."

7. For every PP in an NP, an NP is added: "[NP [NP The dog ] [PP in the park ] ] was big".

8. For every appositive NP in an NP, an NP is added (appositive case): "[NP [NP The dog ], [NP a terrier ] ], was black".

9. For every NP in an s-v-inverted clause (SINV), a VP is added: "He ran, [S [VP said ] [NP John ] ]".

(One problem here concerns conjunction sentences. When a sentence has more than one sublinkage, the complement constituent generated for a given sublinkage must only include words in the sublinkage. For example, consider the sentence "The dog and the cat ran"; for the sublinkage "The dog ran", the complement generated should include only "the dog", not "The dog and the cat".)

One further modification takes place at this stage: If a constituent is started by an MVs or MVg (used with dependent clauses and post-verb participle phrases), we find any VP's or ADJP's that contain it (without going beyond a larger S or NP), and adjust them so that they end right before the m domain starts. So if we have "I [VP ran [SBAR when I saw him ] ]", this becomes "I [VP ran ] [SBAR when I saw him ]".

COMMAS.Commas are problematic. Generally, a comma at the border between two constituents should not be inluded in either one: "[SBAR When we saw him ] , [S we left ]". Under ordinary link logic, this usually will not occur. In the sentence just given, for example, the comma will be part of the f domain containing the opener clause: "[SBAR When we saw him , ] [S we left ]". We could shift the boundary of the SBAR to exclude the comma; however, when the complement S was generated, it then would include the comma (since the complement includes everything that's not in the SBAR). So we leave the right boundary of the SBAR where it is, until after the complement S has been generated; then we adjust it to exclude the comma.

MERGING THE CONSTITUENTS. The next stage of constituent generation involves merging the constituents generated for each sublinkage.

In the stages described above, domain structures and constituents are generated for each sublinkage separately. Take the sentence "The dog and the cat ran". This breaks down into two sublinkages, "The dog ran" and "the cat ran". The following constituent structures are produced:

[S [NP The dog ] 	       [VP ran ] S]

               [S [NP the cat] [VP ran ] S]

We can begin by simply merging the constituents in a brute force fashion; our new constituent structure simply combines all the constituents of the sublinkages. This yields:

[S [NP The dog NP] and [S [NP the cat NP] [VP [VP ran VP] VP] S] S]

There are several problems here. First, there are redundant constituents: only one VP and S are really needed. In the case of the VP's, this can be addressed simply by deleting one of them. In general, we delete all "duplicate" constituents: a duplicate constituent is one that has the same range and type as another constituent. The S's cannot be handled this way, since they do not have the same range. Thus, we say: if constituents A and B in different sublinkages X and Y have one endpoint in common, but A is larger at the other end, and B has no duplicate in X, then declare B invalid.

A further problem is that, according to Treebank convention, a larger NP is needed around the entire "and list":

[S [NP [NP The dog NP] and [NP the cat NP] NP] [VP ran VP] S]

To address this, we search for all "andlists". An andlist is a list of constituents such that all constituents are of the same type, and each constituent in the list is not present in the same sublinkage as any of the others. (Moreover, each constituent has no duplicate in any of the other sublinkages containing elements of the andlist.) (If an element e1 of an andlist a1 contains an element e2 of another andlist a2, then e1 must contain the whole of a2, or else a2 is invalid. Otherwise, in the sentence "[NP The dog with [NP the spot] ] and [NP the cat ] ran", "[NP the spot ] and [NP the cat ]" could be considered an andlist.) Once a valid andlist is found, we create a new constituent around the entire andlist which is the same type as the constituents making up the andlist.

OTHER FIXES. Some further modifications are needed after the merging of the constituents.

In the case of possessives, the NP must be extended to include the "'s": "[NP The dog's ] nose was black".

In the case of number expressions with "$", the QP must be extended to include the $.

In the case of paraphrase constructions ("[S [S He left ] , [S said John ] ]"), the S constituent around the paraphrasing phrase ("said John" in this case) is deleted.

Some idiomatic time expressions are treated as NP's in the treebank, though they don't really function that way: "He's coming [NP this week ]". In such cases we must convert the PP (or whatever it is) to an NP.

An S constituent start by a conjunction ("but" or "and") at the beginning of a sentence is deleted.

Adjective phrases and number expressions that are only one word long are not assigned a constituent: "The [ADJP very big ] dog ran"; "The big dog ran". We handle this by declaring such constituents invalid (and subsequently ignoring them) after they have been generated.

Finally, if there is no S constituent spanning the whole sentence (this might happen in cases where some words at either end of the sentence are null-linked and thus not included in any domain), such a constituent is generated.