I am trying to extract a sub string from a string in python.
My data file contains line of the Quran where each one is marked with verse and chapter number at the beginning of the string. I want to try to extract the first number and second number and write these to a line in another text file Here is an example of a few lines of the txt file.
2|12|Of a surety, they are the ones who make mischief, but they realise (it) not.
2|242|Thus doth Allah Make clear His Signs to you: In order that ye may understand.
As you can see the verse and chapter could contain multiple digits so just counting the number of spaces from the start of the string would not be adequate. Is there a way of using regular expressions to try to extract as a string the first number(verse) and the second number (chapter)?
The code that I am writing this for will try to write to an Arff file the verse and chapter string. an example of a line in the arff file would be:
1,0,0,0,0,0,0,0,0,2,12
where the last 2 values are the verse and chapter.
here is the for loop that will write for each verse the attributes that i am interested in and then i want to attempt to write verse and chapter to the end by using regular expressions to extract the relevant substring for each line.
for line in verses:
for item in topten:
count = line.count(item)
ARFF_FILE.write(str(count) + ",")
# Here is where i could use regular expressions to extract the desired substring
# verse and chapter then write these to the end of a line in the arff file.
ARFF_FILE.write("\n")
I think the regular expression for chapter number (first number before pipe) should be something like this, then use the group(0) function to get the first number and
"^(\d+)\|(\d)\|"
then the regexp for verse should be gained by group(1)
but i dont know how to implement this in python. Does anyone have any ideas? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ response to a question.
I have just tried to implement you technique but am getting a " index error: list index out of range. my code is
for line in verses:
for item in topten:
parts = line.split('|')
count = line.count(item)
ARFF_FILE.write(str(count) + ",")
ARFF_FILE.write(parts[0] + ",")
ARFF_FILE.write(parts[1])
ARFF_FILE.write("\n")