0

I have some very big files (more than 100 millions lines).
And I need to read their last line.
As I 'm a Linux user, in a shell/script I use 'tail' for that.

Is there a way to rapidly read the last line of a file in python ?
Perhaps, using 'seek', but I 'm not aware with that.

The best I obtain is this :

from subprocess import run as srun

file = "/my_file"
proc = srun(['/usr/bin/tail', '-1', file], capture_output=True)
last_line = proc.stdout

All other pythonic code I tried are slower than calling external /usr/bin/tail

I also read these threads that not satisfy my demand :
How to implement a pythonic equivalent of tail -F?
Head and tail in one line
Because I want some speed of execution and avoid memory overload.

Edit: I try what I understand on comments and …

I get a strange comportment :

>>> with open("./Python/nombres_premiers", "r") as f:
...     a = f.seek(0,2)
...     l = ""
...     for i in range(a-2,0,-1):
...        f.seek(i)
...        l = f.readline() + l
...        if l[0]=="\n":
...           break
... 
1023648626
1023648625
1023648624
1023648623
1023648622
1023648621
1023648620
1023648619
1023648618
1023648617
1023648616
>>> l
'\n2001098251\n001098251\n01098251\n1098251\n098251\n98251\n8251\n251\n51\n1\n'
>>> with open("./Python/nombres_premiers", "r") as f:
...     a = f.seek(0,2)
...     l = ""
...     for i in range(a-2,0,-1):
...        f.seek(i)
...        l = f.readline()
...        if l[0]=="\n":
...           break
... 
1023648626
1023648625
1023648624
1023648623
1023648622
1023648621
1023648620
1023648619
1023648618
1023648617
1023648616
>>> l
'\n'

How to get l = 2001098251 ?

8
  • 1
    os.seek is your friend -- that's the same facility that tail itself uses. Commented Dec 6, 2024 at 23:35
  • When you say "How to implement a pythonic equivalent of tail -F" doesn't solve your problem you're wrong -- some of the answers there do use seek() with the correct arguments to skip directly to the end and so are just as efficient as tail itself. Just ignore any answer that doesn't refer to os.seek and os.SEEK_END. Commented Dec 6, 2024 at 23:36
  • 1
    Well, I can reach the end of file with f.seek(0,2) which return an integer (address to the extremely end of file). How to get the last line ? I don't know its length. Commented Dec 7, 2024 at 0:02
  • @Tawal You should seek(-2, os.SEEK_END) and then something like while f.read(1) != b'\n': f.seek(-2, os.SEEK_CUR) to get to the beginning of the last line. Commented Dec 7, 2024 at 0:18
  • The way tail does it is to rewind a bit from the end (1-4kb typically) and read a line at a time from there. If you want to get fancy you can rewind more until you find at least one newline between your location and the end of the file. Commented Dec 7, 2024 at 0:20

2 Answers 2

2

tail doesn't support arbitrarily long lines -- it takes the last chunk of the file and iterates from there. Doing the same thing yourself could look like:

def last_line(f, bufsize=4096):
    end_off = f.seek(0, 2)
    f.seek(max(end_off - bufsize, 0), 0)
    lastline = None
    while (line := f.readline()):
        if line[-1] == '\n':
            lastline = line
        else:
            break # last line is not yet completely written; ignore it
    return lastline[:-1] if lastline is not None else None

import sys
print(last_line(open(sys.argv[1], 'r')))

Note that if you want to continue to read new content as the file is edited over time, you should use inotify to watch for changes. https://stackoverflow.com/a/78969468/14122 demonstrates this.

Sign up to request clarification or add additional context in comments.

5 Comments

Seeking to arbitrary offsets is undefined behavior in text mode, though. (There's no way to implement sensible, efficient arbitrary-offset seek for arbitrary character encodings.)
Truth, that. Should probably be ignoring encoding failures, at least on the very first read.
Get this error : Traceback (most recent call last): File "<stdin>", line 2, in <module> File "<stdin>", line 6, in last_line TypeError: 'NoneType' object is not subscriptable
@Tawal, I just added a guard so that if we find no valid lines inside the last 4kb we return None instead of failing with that error. If you choose, you can of course return the incomplete content instead, or you could turn the buffer size up and allow the last line to be more than 4kb. However, this certainly shouldn't be something that can happen with the file you showed where the lines are all quite short. I'd need a reproducer (inclusive of the data file or code that creates a data file with which the problem takes place) to speak further.
(If your real file weren't line-oriented at all but instead were NUL-delimited, of course, that's an easy way to get into this state -- you'd need to replace readline() appropriately; it also could probably happen with the prior code revision and a completely empty file -- but neither of those corner cases fit the scenario in the question).
0

Using seek(), read() and readline(),
I can rapidly retrieve the last line of a text file :

with open("My_File", "r") as f:
     n = f.seek(0,2)
     for i in range(n-2, 0, -1):
             f.seek(i)
             if f.read(1)=="\n":
                     s = f.readline().replace("\n", "")
                     break

Edit: changed range(n-2, 1, -1) by range(n-2, 0, -1) in case the file has only 1 line.
Edit2: replaced s = f.readline()[:-1] by s = f.readline().replace("\n", "") in case there isn't Line Feed character.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.