Stanford Stanza sometimes splits a sentence into two sentences

Question

I am using stanza 1.6.1. I have been experimenting with Stanza's constituency parser.

In certain cases it splits a sentence into 2 Sentence objects. For example, take this sentence : Pull up Field with low precision.

It splits it into 2 sentences internally (Pull up and Field with low precision) and so the constituency parser output comes out as 2 trees (one for each sentence).

Changing "Field" to lowercase in above sentence makes Stanza treat it as one sentence and I get one tree representation (as expected) as constituency output.

Is there some way to make Stanza consider this as one sentence apart from string manipulation techniques like converting to lowercase? Or is there a case insensitive model that I could use?

Ro.oT · Accepted Answer · 2024-01-18 19:44:26Z

The issue seems to be particularly related to the older versions including 1.6.1 as reported by other users [1], [2]. I can reproduce your issue with:

doc = nlp("Pull up Field with low precision") 
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

which prints:

====== Sentence 1 tokens =======
id: (1,)    text: Pull
id: (2,)    text: up
====== Sentence 2 tokens =======
id: (1,)    text: Field
id: (2,)    text: with
id: (3,)    text: low
id: (4,)    text: precision

Solution: However, the new release of the library 1.7.0 does not seem to have this problem. Just install with:

pip install stanza # should install v1.7.0

and then test it via:

import stanza
stanza.download('en') # to download the default English language package
nlp = stanza.Pipeline('en') # to initialize the pipeline

doc = nlp("Pull up Field with low precision.")
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

which prints out:

====== Sentence 1 tokens =======
id: (1,)    text: Pull
id: (2,)    text: up
id: (3,)    text: Field
id: (4,)    text: with
id: (5,)    text: low
id: (6,)    text: precision
id: (7,)    text: .

You can also visualize the Constituency Parse at stanza.run

I tried Stanza 1.7.0. The above sentence works. But some sentences are still getting broken down. For example: 1. Get tables with rating > 3 in snowflake 2. datasets categorized under oracle Resource type that have been created in the last two months — zaki41, Commented Jan 19, 2024 at 16:16

Collectives™ on Stack Overflow

Stanford Stanza sometimes splits a sentence into two sentences

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related