1

I am trying to extract a sub string from a string in python.

My data file contains line of the Quran where each one is marked with verse and chapter number at the beginning of the string. I want to try to extract the first number and second number and write these to a line in another text file Here is an example of a few lines of the txt file.

2|12|Of a surety, they are the ones who make mischief, but they realise (it) not.
2|242|Thus doth Allah Make clear His Signs to you: In order that ye may understand.

As you can see the verse and chapter could contain multiple digits so just counting the number of spaces from the start of the string would not be adequate. Is there a way of using regular expressions to try to extract as a string the first number(verse) and the second number (chapter)?

The code that I am writing this for will try to write to an Arff file the verse and chapter string. an example of a line in the arff file would be:

1,0,0,0,0,0,0,0,0,2,12

where the last 2 values are the verse and chapter.

here is the for loop that will write for each verse the attributes that i am interested in and then i want to attempt to write verse and chapter to the end by using regular expressions to extract the relevant substring for each line.

for line in verses:
    for item in topten:
        count = line.count(item)
        ARFF_FILE.write(str(count) + ",")
    # Here is where i could use regular expressions to extract the desired substring 
    # verse and chapter then write these to the end of a line in the arff file.
    ARFF_FILE.write("\n")

I think the regular expression for chapter number (first number before pipe) should be something like this, then use the group(0) function to get the first number and

"^(\d+)\|(\d)\|" 

then the regexp for verse should be gained by group(1)

but i dont know how to implement this in python. Does anyone have any ideas? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ response to a question.

I have just tried to implement you technique but am getting a " index error: list index out of range. my code is

for line in verses:
 for item in topten:
     parts = line.split('|')

     count = line.count(item)
     ARFF_FILE.write(str(count) + ",")
 ARFF_FILE.write(parts[0] + ",")
 ARFF_FILE.write(parts[1])  
 ARFF_FILE.write("\n")
1
  • 1
    What is topten? You don't instantiate it anywhere in the code you posted. In general it's unclear what your input is and what your desired output is. Commented Mar 28, 2011 at 18:15

3 Answers 3

4

If all your lines are formatted like A|B|C, then you don't need any regex, just split it.

for line in fp:
    parts = line.split('|') # or line.split('|', 2) if the last part can contain |
    # use parts[0], parts[1]
Sign up to request clarification or add additional context in comments.

3 Comments

but im already using a for loop for the line in corpus and so splitting it is not an option really.
@user680466: I'm afraid I don't understand you. I'm not saying to throw another loop in there, I'm saying that your loop should do the split.
@user680466: Move parts = line.split(...) before or after the inner loop, otherwise parts won't exist in the outer scope.
0

I think the easiest way would be to use a re.split() to get the verses text and a re.findall() to get the chapter and verses numbers The results would be stored in lists that can be used later Here is an example of the code:

#!/usr/bin/env python

import re

# string to be parsed
Quran= '''2|12|Of a surety, they are the ones who make mischief, but they realise (it) not.
2|242|Thus doth Allah Make clear His Signs to you: In order that ye may understand.'''

# list containing the text of all the verses
verses=re.split(r'[0-9]+\|[0-9]+\|',Quran)
verses.remove("")

# list containing the chapter and verse number:
#
#   if you look closely, the regex should be r'[0-9]+\|[0-9]+\|'
#   i ommited the last pipe character so that later when you need to split
#   the string to get the chapter and verse nembuer you wont have an
#   empty string at the end of the list
#
chapter_verse=re.findall(r'[0-9]+\|[0-9]+',Quran)


# looping over the text of the verses assuming len(verses)==len(chp_vrs)
for index in range(len(verses)):
    chapterNumber,verseNumber =chapter_verse[index].split("|")
    print "Chapter :",chapterNumber, "\tVerse :",verseNumber
    print verses[index]

Comments

-1

With parenthesis? Isn't that how all regular expressions work?

1 Comment

forget what i previously said, i have just tried to implement you technique but am getting a " index error: list index out of range. my code is for line in verses: for item in topten: parts = line.split('|') count = line.count(item) ARFF_FILE.write(str(count) + ",") ARFF_FILE.write(parts[0]) ARFF_FILE.write(parts[1]) ARFF_FILE.write("\n")

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.