Reading tabular data with python

Question

I have a bunch of text files with tabular data. It looks like this:

 1. BRISTOL CITY             42  16  4  1  43  13   8  7  6  23  27   59
 2. Plymouth Argyle          42  18  3  0  47   6   5  4 12  14  23   53
 3. Swansea City             42  13  6  2  46  14   9  3  9  32  31   53
 4. Brighton & Hove Albion   42  15  3  3  39  13   5  8  8  13  21   51
 5. Luton Town               42  14  4  3  47  18   7  3 11  21  31   49
 6. Millwall                 42   9 10  2  27  13   5  8  8  18  27   46
 7. Portsmouth               42  10  5  6  34  20   9  3  9  24  32   46
 8. Northampton              42  13  6  2  40  17   4  5 12  14  27   45
 9. Swindon Town             42  14  4  3  41  17   3  7 11  21  39   45
10. Watford                  42  10  6  5  35  23   7  4 10  22  31   44
11. Queen's Park Rangers     42  10  4  7  34  24   6  6  9  20  25   42
12. Charlton Athletic        42  11  6  4  33  14   3  8 10  22  37   42
13. Bristol Rovers           42   7  9  5  25  19   6  7  8  10  17   42
14. Brentford                42   9  4  8  27  23   4  8  9  14  28   38
15. Southend United          42  10  6  5  35  18   2  7 12  14  36   37
16. Gillingham               42  13  4  4  38  18   2  3 16  13  41   37
17. Merthyr Town             42  10  4  7  27  17   1 10 10  12  31   36
18. Norwich City             42   8  7  6  29  26   5  3 13  22  45   36
19. Reading                  42   9  8  4  24  15   1  6 14  12  40   34
20. Exeter City              42  10  4  7  27  18   3  3 15  20  66   33

It's very regular, but there's no standard separator and the column widths are not standard from table to table (even within the same files). (Spaces alone aren't a sufficient delimiter, as many of the names contain spaces and in some places, columns are separated by only a single space.)

I want to parse this into Python objects, but it's not really clear what the best way to do that is. Is there a way to use the CSV module to parse it? Do I need to use regex? Has someone written an awesome python library for parsing tabular text files?

what happens when you try to use the csv module? Is it not working? — monkut
– monkut, Commented Dec 12, 2013 at 3:21
Is the second column the only one that can contain letters and spaces? Or can other columns be non-numeric as well? Can the second column contain numbers? A couple more complicated sample rows might be useful. — jpmc26
– jpmc26, Commented Dec 12, 2013 at 4:03
I've added some more complicated data rows here. Except for the name, the columns contain only numbers. When I use the CSV module, I have to set a delimiter; one space breaks up the names and two spaces occasionally grafts a number onto the name. — futuraprime
– futuraprime, Commented Dec 12, 2013 at 5:10

koffein · Accepted Answer · 2013-12-12 04:48:39Z

1

Made a working regex. Look it up here for explanation/modifying.

The name of the line (like Accrington) is extracted with [\D]+?. That means "Take as much non-digits as you need to fit in the line". (+? - non greedy) So you can get alphabetical letters and (minimal) whitespace and that would be the name of your line...

import re
pattern = re.compile(r"^(\d+.)\s*([\D]+?)" + r"\s+(\d+)"*12 + r"\s*$")

Test

match = pattern.match("7. Accrington               22   5  3  3  26  17   1  5  5  22  31   20")
print match.groups()
Out[133]: 
('7.',
 'Accrington',
 '22',
 '5',
 '3',
 '3',
 '26',
 '17',
 '1',
 '5',
 '5',
 '22',
 '31',
 '20')

match2 = pattern.match("91. Accrington Bay              22   5  3  3  26  17   1  5  5  22  31   20")
print match2
Out[134]: 
('91.',
 'Accrington Bay',
 '22',
 '5',
 '3',
 '3',
 '26',
 '17',
 '1',
 '5',
 '5',
 '22',
 '31',
 '20')

edited Dec 12, 2013 at 4:48

answered Dec 12, 2013 at 4:01

koffein

1,92214 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

futuraprime Over a year ago

Yep, this works, and is better than the regex I was working on, thanks! I was hoping there's a non-regex way to do it, but I'll probably just use this.

caleb531 · Accepted Answer · 2013-12-12 03:58:44Z

0

The simplest solution would be to use regular expressions.

You can use the split() method (apart of Python's included re module) to split the data at every sequence of consecutive whitespace.

import re

data = '7. Accrington               22   5  3  3  26  17   1  5  5  22  31   20'
for line in re.split('\n+', data):
    print(re.split('\s+', line))

which will print the following:

['7.', 'Accrington', '22', '5', '3', '3', '26', '17', '1', '5', '5', '22', '31', '20']

Note that the above example also handles multiple lines of data (assuming such lines are separated by consecutive newlines).

edited Dec 12, 2013 at 3:58

answered Dec 12, 2013 at 3:52

caleb531

4,3716 gold badges33 silver badges43 bronze badges

3 Comments

DSM Over a year ago

If the OP's just splitting on consecutive whitespace, then there's no need for re: just line.split() will do it. But the OP would need to do at least a little more work, because the name part may contain multiple words. If the final terms are all numeric, though, he could recombine.

caleb531 Over a year ago

Good point; I have modified my example to only split at sequences of two or more whitespace characters.

DSM Over a year ago

That won't do it either: the OP warned that sometimes columns are only separated by a single space. :^)

Matt Williamson · Accepted Answer · 2013-12-12 04:12:40Z

0

skipinitialspace is what you need to use the csv module for this one.

$ cat << EOF > /tmp/sample.csv
> 7. Accrington               22   5  3  3  26  17   1  5  5  22  31   20
> 7. Accrington               22   5  3  3  26  17   1  5  5  22  31   20
> 8. Accrington               22   5  3  3  26  17   1  5  5  22  31   22
> 7. Accrington               22   5  3  3  26  17   1  5  5  22  31   21
> EOF
$ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import csv
>>> reader = csv.reader(open('/tmp/sample.csv'), skipinitialspace=True, quoting=csv.QUOTE_NONE, delimiter=' ')
>>> for row in reader: 
...     print row
... 
['7.', 'Accrington', '22', '5', '3', '3', '26', '17', '1', '5', '5', '22', '31', '20']
['7.', 'Accrington', '22', '5', '3', '3', '26', '17', '1', '5', '5', '22', '31', '20']
['8.', 'Accrington', '22', '5', '3', '3', '26', '17', '1', '5', '5', '22', '31', '22']
['7.', 'Accrington', '22', '5', '3', '3', '26', '17', '1', '5', '5', '22', '31', '21']

Don't forget you can unpack the results for each row, like so:

>>> for pk, name, a, b, c, d, e, f, g, h, i, j, k, l in reader:

answered Dec 12, 2013 at 4:12

Matt Williamson

40.3k10 gold badges67 silver badges73 bronze badges

1 Comment

Tim Pierce Over a year ago

According to OP, many of the names (I assume this means the second field) contain spaces.

Collectives™ on Stack Overflow

Reading tabular data with python

3 Answers 3

1 Comment

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related