0

I have a large file from a proprietary archive format. Unzipping this archive gives a file that has no extension, but the data inside is comma-delimited. Adding a .csv extension or simply opening the file with Excel will work.

I have about 375-400 of these files, and I'm trying to extract a chunk of rows (about 13,500 out of 1.2M+ rows) between a keyword "Point A" and another keyword "Point B".

I found some code on this site that I think is extracting the data correctly, but I'm getting an error:

AttributeError: 'list' object has no attribute 'rows' when trying to save out the file. Can somebody help me get this data to save into a csv?

import re
import csv
import time

print(time.ctime())

file = open('C:/Users/User/Desktop/File with No Extension That\'s Very Similar to CSV', 'r')
data = file.read()
x = re.findall(r'Point A(.*?)Point B', data,re.DOTALL)

name = "C:/Users/User/Desktop/testoutput.csv"
with open(name, 'w', newline='') as file2:
    savefile = csv.writer(file2)
    for i in x.rows:
        savefile.writerow([cell.value for cell in i])

print(time.ctime())

Thanks in advance, any help would be much appreciated.

2
  • 2
    re.findall() returns a list. The error tells you that list objects have no rows attribute. You should look at what x contains and see what you can do from there Commented Jul 7, 2015 at 19:20
  • 2
    Can you please edit your question and fix the quotation marks? I'm sorry, I'm nitpicking, but man having the coloring wrong is confusing. Commented Jul 7, 2015 at 19:21

3 Answers 3

3

The following should work nicely. As mentioned, your regex usage was almost correct. It is possible to still use the Python CSV library to do the CSV processing by converting the found text into a StringIO object and passing that to the CSV reader:

import re
import csv
import time
import StringIO

print(time.ctime())

input_name = "C:/Users/User/Desktop/File with No Extension That's Very Similar to CSV"
output_name = "C:/Users/User/Desktop/testoutput.csv"

with open(input_name, 'r') as f_input, open(output_name, 'wb') as f_output:
    # Read whole file in
    all_input = f_input.read()  

    # Extract interesting lines
    ab_input = re.findall(r'Point A(.*?)Point B', all_input, re.DOTALL)[0]

    # Convert into a file object and parse using the CSV reader
    fab_input = StringIO.StringIO(ab_input)
    csv_input = csv.reader(fab_input)
    csv_output = csv.writer(f_output)

    # Iterate a row at a time from the input
    for input_row in csv_input:
        # Skip any empty rows
        if input_row:
            # Write row at a time to the output
            csv_output.writerow(input_row)

print(time.ctime())

You have not given us an example from your CSV file, so if there are problems, you might need to configure the CSV 'dialect' to process it better.

Tested using Python 2.7

Sign up to request clarification or add additional context in comments.

3 Comments

Hi Martin, I had a quick follow-up if you don't mind me asking. If I were to modify the re.findall command to go from "Point A" to the end of the file, how would I go about doing that? I've been using this quite a bit, but the file formats have changed and there no longer is a "Point B" for the end position. Thanks!
You could try changing it to r'Point A(.*)'
Thanks again! Tried (.*?) but not (.*).
1

You have 2 problems here: the first is related to the regular expression and the other is about the list syntax.

  1. Getting what you want

    The way you are using the regular expression will return to you a list with a single value (all lines into an unique string).

    Probably there is a better way of doing this but I would go now with something like this:

    with open('bla', 'r') as input:
        data = input.read()
        x = re.findall(r'Point A(.*?)Point B', data, re.DOTALL)[0]
        x = x.splitlines(False)[1:]
    

    That's not pretty but will return a list with all values between those two points.

  2. Working with lists

    There is no rows attribute inside lists. You just have to iterate over it:

    for i in x:
        do what you have to do
    

    See, I'm not familiar to the csv library but it looks that you will have to perform some manipulations to the i value before adding it to the library.

IMHO, I would avoid using CSV format since it is kind of "locale dependent" so it may not work as expected depending the settings your end-users may have on OS.

Comments

0

Updating the code so that @Martin Evans answer works on the latest Python version.

import re
import csv
import time
import io

print(time.ctime())

input_name = "C:/Users/User/Desktop/File with No Extension That's Very Similar to CSV"
output_name = "C:/Users/User/Desktop/testoutput.csv"

with open(input_name, 'r') as f_input, open(output_name, 'wt') as f_output:
    # Read whole file in
    all_input = f_input.read()  

    # Extract interesting lines
    ab_input = re.findall(r'Point A(.*?)Point B', all_input, re.DOTALL)[0]

    # Convert into a file object and parse using the CSV reader
    fab_input = io.StringIO(ab_input)
    csv_input = csv.reader(fab_input)
    csv_output = csv.writer(f_output)

    # Iterate a row at a time from the input
    for input_row in csv_input:
        # Skip any empty rows
        if input_row:
            # Write row at a time to the output
            csv_output.writerow(input_row)

print(time.ctime())

Also, by using 'wt' instead of 'wb' one can avoid

"TypeError: a bytes-like object is required, not 'str'"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.