0

I have a python program that searches through a file for valid phone numbers according to a regex peattern. It then, if it finds a match, parses the number out and prints it on the screen. I want to modify it to make it recognize an extension if there is one. I added in a second pattern (patStringExten) but I am unsure how to make it parse out the extension. Any help with this would be greatly appreciated!

import sys
import re

DEF_A_CODE = "None"

def usage() :
        print "Usage:"
        print "\t" + sys.argv[0] + " [<file>]"

def searchFile( fileName, pattern ) :

        fh = open( fileName, "r" )

        for l in fh :
                l = l.strip()

                        # Here's the actual search
                match = pattern.search( l )

                if match :
                        nr = match.groups()
                                # Note, from the pattern, that 0 may be null, but 1 and 2 must exist
                        if not nr[0] :
                                aCode = DEF_A_CODE
                        else :
                                aCode = nr[0]
                        print "area code: " + aCode + \
                                        ", exchange: " + nr[1] + ", trunk: " + nr[2]+ ", extension: " + nr[3]
                else :
                        print "NO MATCH: " + l

        fh.close()

def main() :

                # stick filename
        if len( sys.argv ) < 2 :  # no file name
           # assume telNrs.txt
                fileName = "telNrs.txt"
        else :
                fileName = sys.argv[1]


                # for legibility, Python supplies a 'verbose' pattern
                #               requires a special flag
        #patString = '(\d{3})*[ .\-)]*(\d{3})[ .\-]*(\d{4})'

        patString = r'''
                                                                # don't match beginning of string (takes care of 1-)
                (\d{3})?                # area code (3 digits) (optional)
                [ .\-)]*                # optional separator (any # of space, dash, or dot,
                                                                #   or closing ')' )
                (\d{3})                 # exchange, 3 digits
                [ .\-]*                 # optional separator (any # of space, dash, or dot)
                (\d{4})                 # number, 4 digits
                '''
         patStringExten = r'''
                                                                # don't match beginning of string (takes care of 1-)
                (\d{3})?                # area code (3 digits) (optional)
                [ .\-)]*                # optional separator (any # of space, dash, or dot,
                                                                #   or closing ')' )
                (\d{3})                 # exchange, 3 digits
                [ .\-]*                 # optional separator (any # of space, dash, or dot)
                (\d{4})                 # number, 4 digits
                [ .\-x]*
                [0-9]{1,4}
                '''




        # Here is what the pattern would look like as a regular pattern:
        #patString = r'(\d{3})\D*(\d{3})\D*(\d{4})'


        # Instead of creating a temporary object each time, we will compile this
        #               regexp once, and store this object

        pattern = re.compile( patString, re.VERBOSE )

        searchFile( fileName, pattern )

main()
2
  • What are you asking here? How to call searchFile with patStringExten instead of patString? How to call it twice, once with each? How to merge the two into a single pattern that accepts either version? How to break the matches into groups that you can pull out by name or number? Commented Apr 27, 2015 at 7:59
  • What would be one pattern that accepts either version? And how would I print out an extension if there is one? Commented Apr 27, 2015 at 8:01

1 Answer 1

1

I'm not sure what you're asking, but I'm going to take a guess.

First, your code is ignoring the new pattern you created. If you want to actually use that patStringExten pattern instead of the patString pattern, you have to pass it to the compile call:

pattern = re.compile(patStringExten, re.VERBOSE)

But if you do that, the matches still only have 3 groups, not 4. Why? Because you didn't put grouping parentheses around the extension. To fix that, just put them in: change [0-9]{1,4} to ([0-9]{1,4}).

And meanwhile, now you're only matching phone numbers with extensions, not both with and without. You could of course fix that by looping over the two patterns and doing the same thing for each, but it's probably better to merge them into one pattern, by making the last group optional. (You might want to make the last two lines, not just the last group, optional… but since the penultimate line is already a 0-or-more match, it's the same either way.) So, change that ([0-9]{1,4}) to ([0-9]{1,4})?.

Now your groups will have 4 elements instead of 3, so your existing code that tries to print nr[3] will print the extension (or None if the optional part was missing) instead of raising an IndexError.

But really, it's probably cleaner to rewrite the output with string formatting. For example:

if nr[3]:
    print "area code: {}, exchange: {}, trunk: {}, ext: {}".format(
        aCode, nr[1], nr[2], nr[3])
else:
    print "area code: {}, exchange: {}, trunk: {}".format(
        aCode, nr[1], nr[2])

Rather than show the whole thing put together in code, seeing the pattern on Debuggex seems more useful, so you can see how it works visually (try it against different strings, to make sure it matches everything you want the way you want it):

                        # don't match beginning of string (takes care of 1-)
(\d{3})?                # area code (3 digits) (optional)
[ .\-)]*                # optional separator (any # of space, dash, or dot,
                                                #   or closing ')' )
(\d{3})                 # exchange, 3 digits
[ .\-]*                 # optional separator (any # of space, dash, or dot)
(\d{4})                 # number, 4 digits
[ .\-x]*
([0-9]{1,4})?

Regular expression visualization

Debuggex Demo

Sign up to request clarification or add additional context in comments.

4 Comments

This is exactly what I needed! Thank you! One more thing, in the print "area code: " + aCode + \ ", exchange: " + nr[1] + ", trunk: " + nr[2]+ ", extension: " + nr[3] line, how would I make it print the extension if there is one?
@Albert: You already wrote that part: ", extension: " + nr[3]. If you want to not print it if it's not there, instead of printing None, you can do if nr[3]: (that whole line) else: (that without the last part).
Thank you! I made the adjustments but for some reason it says it can't concatenate on this line: else : aCode = nr[0] print "area code: " + aCode + \ ", exchange: " + nr[1] + ", trunk: " + nr[2] Is there any reason why?
@Albert: I'd need to see the actual exception, but I'm guessing the problem is that one of your other values is None, and you're trying to concatenate a string and a None. Or you may have missed a quote or a comma or a plus somewhere. String formatting is better: print "area code: {}, exchange: {}, trunk: {}".format(aCode, nr[1], nr[2]). Hard to get wrong, and if one of those values isn't a string, it'll be converted to a string.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.