Python regex - get contents in between

Question

I have a word/text file containing,

1. 10 Liter sample of an ideal gas is expanded reversibly and isothermally at 300k from initial pressure of 10atm to final pressure of 1atm. The heat absorbed by gas during the process is approximately.
(A)15kJ
(B)23kJ
(C)32kJ
(D)50kJ

[Answer]:(B)

[QuestionType]:single_correct

2. Which of the following statement is correct

(A)Li is hander than the other alkali metals.
(B)In solvay process NH3 is recovered when the solution containing NH4Cl is treated with H2O.
(C)Na2CO3 is pearl ash.
(D)Berylium and Aluminium ions do not have strong tendency to form complexes like 

[Answer]:(C)

[QuestionType]:single_correct

I need to get each question in a separate list starting from question number to [QuestionType].

( 1. to [QuestionType])

Output :

[[1. 10 Liter sample of an ideal gas is expanded reversibly and isothermally at 300k from initial pressure of 10atm to final pressure of 1atm. The heat absorbed by gas during the process is approximately.,(A)15kJ,(B)23kJ,(C)32kJ,(D)50kJ,[Answer]:(B),[QuestionType]:single_correct],
[2. Which of the following statement is correct,(A)Li is hander than the other alkali metals.,(B)In solvay process NH3 is recovered when the solution containing NH4Cl is treated with H2O.,(C)Na2CO3 is pearl ash.,(D)Berylium and Aluminium ions do not have strong tendency to form complexes like ,[Answer]:(C),[QuestionType]:single_correct]]

I tried in for loop but cant able to get contents in between

import docx
import re
doc = docx.Document("QnA.docx")
for i in doc.paragraphs:
    if re.match(r"^[0-9]+[.]+",i.text):
        print(i.text) # matched number condition
    if re.match(r"(^\[QuestionType\])",i.text):
        print(i.text) # matched QuestionType condition

The fourth bird · Accepted Answer · 2020-10-19 13:28:00Z

You might use a single pattern, starting the match with 1 or more digits and a dot.

Then continue matching all the lines that do not start with [QuestionType] and finally match that line.

^\d+\..*(?:\r?\n(?!\[QuestionType]).*)*\r?\n\[QuestionType]:.*

See a regex demo and a Python demo

For example

import re

regex = r"^\d+\..*(?:\r?\n(?!\[QuestionType]).*)*\r?\n\[QuestionType]:.*"

s = ("1. 10 Liter sample of an ideal gas is expanded reversibly and isothermally at 300k from initial pressure of 10atm to final pressure of 1atm. The heat absorbed by gas during the process is approximately.\n"
    "(A)15kJ\n"
    "(B)23kJ\n"
    "(C)32kJ\n"
    "(D)50kJ\n\n"
    "[Answer]:(B)\n\n"
    "[QuestionType]:single_correct\n\n"
    "2. Which of the following statement is correct\n\n"
    "(A)Li is hander than the other alkali metals.\n"
    "(B)In solvay process NH3 is recovered when the solution containing NH4Cl is treated with H2O.\n"
    "(C)Na2CO3 is pearl ash.\n"
    "(D)Berylium and Aluminium ions do not have strong tendency to form complexes like \n\n"
    "[Answer]:(C)\n\n"
    "[QuestionType]:single_correct")
    
print(re.findall(regex, s, re.M))

Output

['1. 10 Liter sample of an ideal gas is expanded reversibly and isothermally at 300k from initial pressure of 10atm to final pressure of 1atm. The heat absorbed by gas during the process is approximately.\n(A)15kJ\n(B)23kJ\n(C)32kJ\n(D)50kJ\n\n[Answer]:(B)\n\n[QuestionType]:single_correct', '2. Which of the following statement is correct\n\n(A)Li is hander than the other alkali metals.\n(B)In solvay process NH3 is recovered when the solution containing NH4Cl is treated with H2O.\n(C)Na2CO3 is pearl ash.\n(D)Berylium and Aluminium ions do not have strong tendency to form complexes like \n\n[Answer]:(C)\n\n[QuestionType]:single_correct']

Thân LƯƠNG Đình · Accepted Answer · 2020-10-19 13:42:43Z

First, you get content of each question using regex. After, you split \n for content of each question.

You could try following regex.

\d+\.[\s\S]+?QuestionType.*

I also try to test on python.

import re
content = '''1. 10 Liter sample of an ideal gas is expanded reversibly and isothermally at 300k from initial pressure of 10atm to final pressure of 1atm. The heat absorbed by gas during the process is approximately.
(A)15kJ
(B)23kJ
(C)32kJ
(D)50kJ

[Answer]:(B)

[QuestionType]:single_correct

2. Which of the following statement is correct

(A)Li is hander than the other alkali metals.
(B)In solvay process NH3 is recovered when the solution containing NH4Cl is treated with H2O.
(C)Na2CO3 is pearl ash.
(D)Berylium and Aluminium ions do not have strong tendency to form complexes like 

[Answer]:(C)

[QuestionType]:single_correct
'''

splitQuestion = re.findall(r"\d+\.[\s\S]+?QuestionType.*", content)

result = [];
for eachQuestion in splitQuestion:
    result.append(eachQuestion.split("\n"))

print(result)

Result.

[['1. 10 Liter sample of an ideal gas is expanded reversibly and isothermally at 300k from initial pressure of 10atm to final pressure of 1atm. The heat absorbed by gas during the process is approximately.', '(A)15kJ', '(B)23kJ', '(C)32kJ', '(D)50kJ', '', '[Answer]:(B)', '', '[QuestionType]:single_correct'], ['2. Which of the following statement is correct', '', '(A)Li is hander than the other alkali metals.', '(B)In solvay process NH3 is recovered when the solution containing NH4Cl is treated with H2O.', '(C)Na2CO3 is pearl ash.', '(D)Berylium and Aluminium ions do not have strong tendency to form complexes like ', '', '[Answer]:(C)', '', '[QuestionType]:single_correct']]

Collectives™ on Stack Overflow

Python regex - get contents in between

2 Answers 2

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Related