I need to process 100 Gigabytes of logs in a weird format, then do some analysis on the results.
Initial parts of parsing and CLI done, tried on some test data with 1 GB and it took around a minute. I ran a sanity check that just copied *standard-input* to *standard-out*, and it showed that most of the time is spent in the reading part. Python did the same thing in a couple of seconds, if even.
To generate sample data:
yes "$(printf 'A%.0s' {1..10})" | head -c 1G > sample.txt
Common LISP code:
#!/usr/bin/env -S sbcl --script
(defun main ()
(loop for line = (read-line *standard-input* nil)
while line
do (write-string line)
(write-char #\NEWLINE)))
(eval-when (:execute)
(main))
Python code:
#!/usr/bin/env python3
import sys
def main():
for line in sys.stdin:
sys.stdout.write(line)
if __name__ == "__main__":
main()
To run:
time cat data/sample.txt | ./test.lisp > result_lisp.txt
# real 1m59.719s
# user 0m47.231s
# sys 1m13.008s
time cat data/sample.txt | ./test.py > result_python.txt
# real 0m9.557s
# user 0m7.688s
# sys 0m2.144s
SBCL version: SBCL 2.2.9.debian.
Python Version: Python 3.13.3.
Is there a workaround or fix for this? So far CL has never let me down on the performance side, even for some heavy number crunching it was usually faster than Python.
I decided to profile a similar snippet as the originals, now it literally does nothing but read-line.
#!/usr/bin/env -S sbcl --script
(require :sb-sprof)
(defun main ()
(loop for line = (read-line *standard-input* nil)
while line
))
(eval-when (:execute)
(sb-sprof:with-profiling (:max-samples 100000 :sample-interval 0.00001 :report :graph)
(main)))
It seems most of the time is spent dealing with UTF conversion.
Self Total Cumul
Nr Count % Count % Count % Calls Function
------------------------------------------------------------------------
1 12930 49.2 13191 50.2 12930 49.2 - SB-IMPL::INPUT-CHAR/UTF-8
2 9724 37.0 22523 85.7 22654 86.2 - (LAMBDA (&REST REST) :IN SB-IMPL::GET-EXTERNAL-FORMAT)
3 2517 9.6 25834 98.3 25171 95.8 - READ-LINE
4 146 0.6 146 0.6 25317 96.3 - foreign function pthread_sigmask
5 26 0.1 26257 99.9 25343 96.4 - MAIN
6 26 0.1 26 0.1 25369 96.5 - RESTORE-YMM
7 6 0.0 262 1.0 25375 96.5 - SB-IMPL::REFILL-INPUT-BUFFER
8 5 0.0 214 0.8 25380 96.6 - (FLET "WITHOUT-INTERRUPTS-BODY-2" :IN SB-IMPL::REFILL-INPUT-BUFFER)
9 5 0.0 5 0.0 25385 96.6 - SAVE-YMM
10 4 0.0 617 2.3 25389 96.6 - ALLOC-TRAMP
Getting rid of *standard-input* (really bad for shell-like scripts) seems to improve speed a lot, and it calls some different UTF-related things.
(require :sb-sprof)
(defun main (str)
(loop for line = (read-line str nil)
while line))
(eval-when (:execute)
(sb-sprof:with-profiling (:max-samples 100000 :sample-interval 0.00001 :report :graph)
(with-open-file (str "sample.txt")
(main str))))
Self Total Cumul
Nr Count % Count % Count % Calls Function
------------------------------------------------------------------------
1 2133 38.7 2395 43.4 2133 38.7 - SB-IMPL::FD-STREAM-READ-N-CHARACTERS/UTF-8
2 763 13.8 5212 94.5 2896 52.5 - SB-IMPL::ANSI-STREAM-READ-LINE-FROM-FRC-BUFFER
3 725 13.1 725 13.1 3621 65.6 - SB-KERNEL:UB32-BASH-COPY
4 435 7.9 1882 34.1 4056 73.5 - (LABELS SB-IMPL::BUILD-RESULT :IN SB-IMPL::ANSI-STREAM-READ-LINE-FROM-FRC-BUFFER)
5 261 4.7 261 4.7 4317 78.2 - READ-LINE
6 119 2.2 119 2.2 4436 80.4 - foreign function pthread_sigmask
7 33 0.6 5516 100.0 4469 81.0 - MAIN
8 16 0.3 2413 43.7 4485 81.3 - SB-INT:FAST-READ-CHAR-REFILL
9 15 0.3 15 0.3 4500 81.6 - RESTORE-YMM
10 11 0.2 11 0.2 4511 81.8 - SAVE-YMM
(declare (type fixnum a b))is enough to arrange for efficient compiled code. I wonder if the trouble here is utf8 parsing? Could we maybe read in N binary bytes, or read in a Latin-1 or binary record terminated by a newline? // Does the corresponding Racket (Scheme) program perform similarly poorly?read-lineallocates a string every time it reads a new line; this probably has a detrimental impact on performance. You might be able to work around this by usingread-sequenceto read blocks of data into an array, then searching through the array for newline characters to locate the lines.