python script to read and parse a text file into csv format

Question

I looked all through the related questions and could not find a solution. I'm pretty new with Python. Here's what I've got.

-I set up a honeypot on an Ubuntu VM that watches for access attempts to my server, blocks the access, then outputs details of the attempted access in a text formatted file. The format of each looks like this :

INTRUSION ATTEMPT DETECTED! from 10.0.0.1:80 (2022-06-06 13:17:24)
--------------------------
GET / HTTP/1.1 
HOST: 10.0.0.1 
X-FORWARDED-SCHEME http 
X-FORWARDED-PROTO: http 
x-FORWARDED-For: 139.162.191.89 
X-Real-IP: 139.162.191.89 
Connection: close 
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X)
Accept: */*
Accept-Encoding: gzip

The text file just grows and grows with access attempts however it's not in a format such as CSV that I can use for other programs. What I'd like to do is take this file, read it, parse the information and have it written in CSV format in a separate file, then delete the contents of the original file to stop duplicates.

I'm thinking removing the contents after each read may not be needed and could be handled in the CSV file by looking for duplicates and omitting them. However, I'm noticing multiple attempts and logs containing the same IP address meaning one host is attempting access multiple times so maybe deleting the original each time may be best.

How would you want to convert it to CSV? As in, what should be the pattern to convert it into columns and rows? — Robo Mop
– Robo Mop, Commented Jun 6, 2022 at 13:45
Yes. I'd prefer it be converted into columns: Date, Time, X-forwarded for, X-forwarded-proto, x-forwarded for, x-real ip,. Then the rows would read outputs for each value corresponding to the column names. If that makes sense. The idea being I can easily read attempted access date, time, where from etc etc. Right now, the honeypot just outputs one large growing txt file with the format I put in the question. Each new attempt is noted with "Intrusion attempt detected" — johnnyBdemented
– johnnyBdemented, Commented Jun 6, 2022 at 14:01
Hmm I see. That can be challenging considering different error messages can have different formats and different number of headers. If you'd like I can make a rudimentary sort of answer, that assumes all error messages are similar to the one you provided. If you can, please update the question to show different types of error logs as well. — Robo Mop
– Robo Mop, Commented Jun 6, 2022 at 14:07
From what I'm seeing in the log file, all the logs are almost identical. Each starts with the same header, then contains 11-15 lines each organized in the same way. Realistically I only need to parse, and organize the first 7 lines. Those are the important lines that pertain to the information I'd like to have separated in an easy to read format. Something that could read lines starting with "Intrusion attempt detected" through "Connection: close", and organize those would be optimal. Setting the python program to execute each time the txt file is populated. — johnnyBdemented
– johnnyBdemented, Commented Jun 6, 2022 at 14:19
It's quite the task for sure. The header itself will take some ungodly regex to extract, but it's definitely doable. I hope it's not terribly urgent, I'll try it out in a while? — Robo Mop
– Robo Mop, Commented Jun 6, 2022 at 14:35

Nimantha · Accepted Answer · 2022-06-11 07:20:50Z

This is a rough code that needs to be tweaked and tested on your log file

It reads the log file and parses the data then add it into a data frame and finally a CSV file

import re
# NOTE: make sure pandas is installed otherwise use "python -m pip install pandas -U"
import pandas as pd

# open and read the log file
# NOTE: change the 'log_file.txt' with the log file name/directory
with open('log_file.txt', 'r') as f:
    log_txt = f.read()

# initiate a saving list
df_list = []

# split attemps by this words
for msg in log_txt.split('INTRUSION ATTEMPT DETECTED! '):
    # if emity ignore
    if not msg:
        continue

    # temporary measure
    unnamed_count = 0
    
    # split with the ---- to seperate the ip and the timestamp
    from_when, headers = msg.split('\n--------------------------\n')

    # regex to extract the ip and timestamp
    # NOTE: you can change the names by changing the value inside the <>
    row_dict = re.match(r'^from (?P<ip>\S+) \((?P<timestamp>.+)\)$', from_when).groupdict()

    # split the headers with the newline character 
    for head in headers.split('\n'):
        # if ":" in the list add it to the dictionary
        if ':' in head:
            # split by the ":" and add the key and value to the dict
            key, val = head.split(':', 1)
            row_dict[key.strip()] = val.strip()
        
        # known header without the ":"
        # NOTE: you can define the any header key you know with the same way
        elif 'X-FORWARDED-SCHEME ' in head.strip():
            # clean and add
            row_dict['X-FORWARDED-SCHEME'] = head.replace('X-FORWARDED-SCHEME ', '').strip()
        
        # unknown header without the ":"
        elif head.strip():
            row_dict[f'unnamed:{unnamed_count}'] = head.strip()
            unnamed_count+=1
    
    # add the row to the saving list after sorting it's keys to start with the unnamed then alphabetically
    df_list.append(dict(sorted(row_dict.items(), key=lambda x: (-x[0].startswith('unnamed'), x))))

# convert the saving list to dataframe then to csv file
df = pd.DataFrame(df_list)
# NOTE: replace the 'out.csv' with the output file name/directory
df.to_csv('out.csv', index=False)

sample output

unnamed:0	Accept	Accept-Encoding	Connection	HOST	User-Agent	X-FORWARDED-PROTO	X-FORWARDED-SCHEME	X-Real-IP	ip	timestamp	x-FORWARDED-For
GET / HTTP/1.1	/	gzip	close	10.0.0.1	Mozilla/5.0 (Macintosh; Intel Mac OS X)	http	http	139.162.191.89	10.0.0.1:80	2022-06-06 13:17:24	139.162.191.89
GET / HTTP/1.1	/	gzip	close	10.0.0.1	Mozilla/5.0 (Macintosh; Intel Mac OS X)	http	http	139.162.191.89	10.0.0.1:80	2022-06-06 13:17:24	139.162.191.89
GET / HTTP/1.1	/	gzip	close	10.0.0.1	Mozilla/5.0 (Macintosh; Intel Mac OS X)	http	http	139.162.191.89	10.0.0.1:80	2022-06-06 13:17:24	139.162.191.89

Collectives™ on Stack Overflow

python script to read and parse a text file into csv format

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related