0

I have a bunch of text files with tabular data. It looks like this:

 1. BRISTOL CITY             42  16  4  1  43  13   8  7  6  23  27   59
 2. Plymouth Argyle          42  18  3  0  47   6   5  4 12  14  23   53
 3. Swansea City             42  13  6  2  46  14   9  3  9  32  31   53
 4. Brighton & Hove Albion   42  15  3  3  39  13   5  8  8  13  21   51
 5. Luton Town               42  14  4  3  47  18   7  3 11  21  31   49
 6. Millwall                 42   9 10  2  27  13   5  8  8  18  27   46
 7. Portsmouth               42  10  5  6  34  20   9  3  9  24  32   46
 8. Northampton              42  13  6  2  40  17   4  5 12  14  27   45
 9. Swindon Town             42  14  4  3  41  17   3  7 11  21  39   45
10. Watford                  42  10  6  5  35  23   7  4 10  22  31   44
11. Queen's Park Rangers     42  10  4  7  34  24   6  6  9  20  25   42
12. Charlton Athletic        42  11  6  4  33  14   3  8 10  22  37   42
13. Bristol Rovers           42   7  9  5  25  19   6  7  8  10  17   42
14. Brentford                42   9  4  8  27  23   4  8  9  14  28   38
15. Southend United          42  10  6  5  35  18   2  7 12  14  36   37
16. Gillingham               42  13  4  4  38  18   2  3 16  13  41   37
17. Merthyr Town             42  10  4  7  27  17   1 10 10  12  31   36
18. Norwich City             42   8  7  6  29  26   5  3 13  22  45   36
19. Reading                  42   9  8  4  24  15   1  6 14  12  40   34
20. Exeter City              42  10  4  7  27  18   3  3 15  20  66   33

It's very regular, but there's no standard separator and the column widths are not standard from table to table (even within the same files). (Spaces alone aren't a sufficient delimiter, as many of the names contain spaces and in some places, columns are separated by only a single space.)

I want to parse this into Python objects, but it's not really clear what the best way to do that is. Is there a way to use the CSV module to parse it? Do I need to use regex? Has someone written an awesome python library for parsing tabular text files?

5
  • 1
    what happens when you try to use the csv module? Is it not working? Commented Dec 12, 2013 at 3:21
  • Is '\t' the delimiter? Commented Dec 12, 2013 at 3:23
  • You could use regex to match each element per line. Commented Dec 12, 2013 at 3:23
  • Is the second column the only one that can contain letters and spaces? Or can other columns be non-numeric as well? Can the second column contain numbers? A couple more complicated sample rows might be useful. Commented Dec 12, 2013 at 4:03
  • I've added some more complicated data rows here. Except for the name, the columns contain only numbers. When I use the CSV module, I have to set a delimiter; one space breaks up the names and two spaces occasionally grafts a number onto the name. Commented Dec 12, 2013 at 5:10

3 Answers 3

1

Made a working regex. Look it up here for explanation/modifying.

The name of the line (like Accrington) is extracted with [\D]+?. That means "Take as much non-digits as you need to fit in the line". (+? - non greedy) So you can get alphabetical letters and (minimal) whitespace and that would be the name of your line...

import re
pattern = re.compile(r"^(\d+.)\s*([\D]+?)" + r"\s+(\d+)"*12 + r"\s*$")

Test

match = pattern.match("7. Accrington               22   5  3  3  26  17   1  5  5  22  31   20")
print match.groups()
Out[133]: 
('7.',
 'Accrington',
 '22',
 '5',
 '3',
 '3',
 '26',
 '17',
 '1',
 '5',
 '5',
 '22',
 '31',
 '20')

match2 = pattern.match("91. Accrington Bay              22   5  3  3  26  17   1  5  5  22  31   20")
print match2
Out[134]: 
('91.',
 'Accrington Bay',
 '22',
 '5',
 '3',
 '3',
 '26',
 '17',
 '1',
 '5',
 '5',
 '22',
 '31',
 '20')
Sign up to request clarification or add additional context in comments.

1 Comment

Yep, this works, and is better than the regex I was working on, thanks! I was hoping there's a non-regex way to do it, but I'll probably just use this.
0

The simplest solution would be to use regular expressions.

You can use the split() method (apart of Python's included re module) to split the data at every sequence of consecutive whitespace.

import re

data = '7. Accrington               22   5  3  3  26  17   1  5  5  22  31   20'
for line in re.split('\n+', data):
    print(re.split('\s+', line))

which will print the following:

['7.', 'Accrington', '22', '5', '3', '3', '26', '17', '1', '5', '5', '22', '31', '20']

Note that the above example also handles multiple lines of data (assuming such lines are separated by consecutive newlines).

3 Comments

If the OP's just splitting on consecutive whitespace, then there's no need for re: just line.split() will do it. But the OP would need to do at least a little more work, because the name part may contain multiple words. If the final terms are all numeric, though, he could recombine.
Good point; I have modified my example to only split at sequences of two or more whitespace characters.
That won't do it either: the OP warned that sometimes columns are only separated by a single space. :^)
0

skipinitialspace is what you need to use the csv module for this one.

$ cat << EOF > /tmp/sample.csv
> 7. Accrington               22   5  3  3  26  17   1  5  5  22  31   20
> 7. Accrington               22   5  3  3  26  17   1  5  5  22  31   20
> 8. Accrington               22   5  3  3  26  17   1  5  5  22  31   22
> 7. Accrington               22   5  3  3  26  17   1  5  5  22  31   21
> EOF
$ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> reader = csv.reader(open('/tmp/sample.csv'), skipinitialspace=True, quoting=csv.QUOTE_NONE, delimiter=' ')
>>> for row in reader: 
...     print row
... 
['7.', 'Accrington', '22', '5', '3', '3', '26', '17', '1', '5', '5', '22', '31', '20']
['7.', 'Accrington', '22', '5', '3', '3', '26', '17', '1', '5', '5', '22', '31', '20']
['8.', 'Accrington', '22', '5', '3', '3', '26', '17', '1', '5', '5', '22', '31', '22']
['7.', 'Accrington', '22', '5', '3', '3', '26', '17', '1', '5', '5', '22', '31', '21']

Don't forget you can unpack the results for each row, like so:

>>> for pk, name, a, b, c, d, e, f, g, h, i, j, k, l in reader: 

1 Comment

According to OP, many of the names (I assume this means the second field) contain spaces.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.