0

I have a csv file which contains links to webpages. I'm collecting data from each link and saving it in a separate csv file.
Now in case I have to resume the file from the point where I left it, I have to manually delete the entries from the csv file and then run the code.
I went through documentation for csv module, but couldn't find any function that serves this purpose.
I also went through all other questions on Stackoverflow and other sites regarding this, but none helps.
Is there a way to delete rows the way I want them to?

Here is what I have right now

import pandas as p

df = p.read_csv("All_Links.csv")

for i in df.index:
    try:
        url= df.ix[i]['MatchLink']

        #code process the data in the link

        #made sure that processing has finished
        #Now need to delete that row
2
  • 2
    the process of deleting content from the middle of a file is only accomplished by reading over a file and writing everything except the line(s) you want to skip. You can read in all of the lines of a CSV and then splice the array and then write the array back out to a file, but this is only accomplishing the same thing but with greater memory requirements Commented Aug 17, 2013 at 8:06
  • have you considered to use df.drop(i, 1), look at api doc: pandas.pydata.org/pandas-docs/stable/generated/… Commented Aug 20, 2013 at 18:17

2 Answers 2

1

If you want to write the rest of the data that isn't processed back to the csv file, that is delete only the data that is processed you can just modify your algorithm to:

import pandas as p

df = p.read_csv("All_Links.csv")

for i in df.index:
    try:
        url= df.ix[i]['MatchLink']
        #code process the data in the link
        #made sure that processing has finished
        df.iloc[i:].to_csv('All_links.csv', index=False)

But this will write your file on every iteration, maybe it's best to remember the value of i and do it once you finished all the iterations:

import pandas as p

df = p.read_csv("All_Links.csv")

i = 0
for i in df.index:
    try:
        url= df.ix[i]['MatchLink']
        #code process the data in the link
        #made sure that processing has finished
    except:
        # something broke, this row isn't processed decrease i
        i -= 1
        break

# Now write the rest of unprocessed lines to a csv file
df.iloc[i:].to_csv('All_links.csv', index=False)
Sign up to request clarification or add additional context in comments.

Comments

1

Since you are already reading the whole file into the dataframe you can just start iterating from the point you left. Lets say you left on i=23, you can do:

import pandas as p

df = p.read_csv("All_Links.csv")

last_line_number = 23
for i in df.index[last_line_number:]:
    try:
        url= df.ix[i]['MatchLink']
        #code process the data in the link
        #made sure that processing has finished
        #Now need to delete that row

This is the simplest way. Something more robust would be to have 2 files, one for lines to be processed and one for processed lines.

4 Comments

Thanks for the answer, yes, thats one way to do it. But i'd wait if someone can answer the original question ie. "How I could delete the row", which would be the best for my application
Unfortunately with text files the only way is to write a new file or overwrite the existing one with the files you want each time. This is expensive. There is no way to delete only one line.
:-/ yeah, you are right, there are about 100,000 rows, and processing takes place in a single loop, anything related to file handling inside loop makes it super expensive. Thus I think @viktor's method is the best I can do.
yes thats a practical solution which should be performant enough and is more complete than mine.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.