How to cleanup some content from the text file

Question

I have the following data in a CSV.

"ID","OTHER_FIELDS_2"
"87","25 R160  22  13  E"
"87","25 R165  22  08  E"
"77",""
"18","20 BA-10  12  06  2  30  S"
"18","20 BA-20  12  06  2  30  S"
"88","20 TH-42  02  02  5  30  MT"
"66","20 AD-38  12  06  B"
"66","20 AD-38  30  07  B"
"70","20 OL-45  19  11  B"
"70","20 EM-45  19  08  B"

After running my Python code, I got the following output:

18,"
","20 BA-10  12  06  2  30  S
20 BA-20  12  06  2  30  S","
",**********
66,"
","20 AD-38  12  06  B
20 AD-38  30  07  B","
",**********
70,"
","20 OL-45  19  11  B
20 EM-45  19  08  B","
",**********
77,"
",,"
",**********
87,"
","25 R160  22  13  E
25 R165  22  08  E","
",**********
88,"
",20 TH-42  02  02  5  30  MT,"
",**********

But I have to generate the following output in txt format:

18
20 BA-10  12  06  2  30  S
20 BA-20  12  06  2  30  S
**********
66
20 AD-38  12  06  B
20 AD-38  30  07  B
**********
70
20 OL-45  19  11  B
20 EM-45  19  08  B
**********
77
**********
87
25 R160  22  13  E
25 R165  22  08  E
**********
88
20 TH-42  02  02  5  30  MT
**********

Here is my code:

import pandas as pd
import csv

df = pd.read_csv('idDetails.csv')
data_rows = [] 

group_column = 'ID'
selected_columns = ['OTHER_FIELDS_2'] 
grouped_data = df.groupby(group_column)[selected_columns]

for group_name, group_df in grouped_data:
    #print(f"{group_name}")
    other_data = group_df.to_string(header=False,index=False)
    other_data_a = group_df.fillna('').dropna(axis = 0, how = 'all') 
    other_data_b = other_data_a.to_string(header=False,index=False)
    #print(other_data_b) 
    other_data_c = '**********'
    #print(other_data_c)
    data_rows.append([group_name, '\n', other_data_b, '\n', other_data_c])  
dfo = pd.DataFrame(data_rows)
dfo.to_csv('idDetailsoutput.txt', header=False, index=False)

A possible solution to get the desired output is appreciated.

maybe you should write it as normal text file - without using to_csv - because your output doesn't look like csv. CSV needs values for every column in row but you have emopy places It would need to convert every row to one string using "".join() and later convert all strings to one text using "\n".join(). And later write it with open(), write(), close() — furas
– furas, Commented Jul 22 at 20:21

jackal · Accepted Answer · 2025-07-23 07:22:41Z

4

The builtin csv module can handle your source data.

You don't need anything as "heavyweight" as pandas for this as the processing is trivial.

import csv

FILEIN = "foo.csv"
FILEOUT = "foo.txt"
STARS = "**********"

with open(FILEIN, newline="") as datain, open(FILEOUT, "w") as dataout:
    reader = csv.reader(datain)
    next(reader)  # skip column headers
    prev = None
    for key, value in sorted(reader, key=lambda e: int(e[0])):
        if key != prev:
            if prev is not None:
                print(STARS, file=dataout)
            print(prev := key, file=dataout)
        if value:
            print(value, file=dataout)
    print(STARS, file=dataout)

edited Jul 23 at 7:22

answered Jul 23 at 7:09

jackal

29.1k3 gold badges9 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

codewithpurpose · Accepted Answer · 2025-07-24 01:09:45Z

The problem with your current code is that you're putting each group into a row of a DataFrame and then saving it with to_csv(). That adds extra quotes and weird line breaks in your output. Instead, you should just write everything directly to a text file using normal file writing — no need to use pandas.to_csv for this.


import pandas as pd

# Read the CSV
df = pd.read_csv('idDetails.csv')

# Group by ID
grouped = df.groupby('ID')['OTHER_FIELDS_2']

# Open the file in write mode
with open('idDetailsoutput.txt', 'w') as f:
    for group_id, items in grouped:
        f.write(f"{group_id}\n")  # Write the ID
        for item in items:
            if pd.notna(item) and item.strip() != "":
                f.write(f"{item.strip()}\n")  # Write each non-empty value
        f.write("**********\n")  # Separator line

Output:

18
20 BA-10  12  06  2  30  S
20 BA-20  12  06  2  30  S
**********
66
20 AD-38  12  06  B
20 AD-38  30  07  B
**********
70
20 OL-45  19  11  B
20 EM-45  19  08  B
**********
77
**********
87
25 R160  22  13  E
25 R165  22  08  E
**********
88
20 TH-42  02  02  5  30  MT
**********

furas · Accepted Answer · 2025-07-23 09:28:02Z

Your data doesn't look like CSV but as normal text so I wouldn't use to_csv() (nor module csv) but I would convert every row to string and write it as any text file using standard open(), write(), close()

First I would create rows without '\n'

data_rows.append([group_name, other_data_b, other_data_c])

And later I would use for-loop to convert every row to text using "\n".join(row) and write it in file.

Because some values may not be strings (ie. group_name is a integer and in the future you may have other values) so I would add map(str, row) to convert every element to string.

This gives this code

with open('idDetailsoutput.txt', 'w') as output:
    for row in data_rows:
        text = "".join( map(str, row) )
        output.write(text + "\n")

Full working code with example data directly in code.

I use io.StringIO() only to create file-like object in memory - so everyone can simply copy and run it for tests.

data = '''"ID","OTHER_FIELDS_2"
"87","25 R160  22  13  E"
"87","25 R165  22  08  E"
"77",""
"18","20 BA-10  12  06  2  30  S"
"18","20 BA-20  12  06  2  30  S"
"88","20 TH-42  02  02  5  30  MT"
"66","20 AD-38  12  06  B"
"66","20 AD-38  30  07  B"
"70","20 OL-45  19  11  B"
"70","20 EM-45  19  08  B"
'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(data))
#df = pd.read_csv('idDetails.csv')
data_rows = [] 

group_column = 'ID'
selected_columns = ['OTHER_FIELDS_2'] 

grouped_data = df.groupby(group_column)[selected_columns]

for group_name, group_df in grouped_data:
    #print(f"{group_name}")
    other_data = group_df.to_string(header=False,index=False)
    other_data_a = group_df.fillna('').dropna(axis = 0, how = 'all') 
    other_data_b = other_data_a.to_string(header=False,index=False)
    #print(other_data_b) 
    other_data_c = '**********'
    #print(other_data_c)
    data_rows.append([group_name, other_data_b, other_data_c])  

print(data_rows)

with open('idDetailsoutput.txt', 'w') as output:
    for row in data_rows:
        text = "\n".join(map(str, row))
        output.write(text + "\n")

#dfo = pd.DataFrame(data_rows)
#dfo.to_csv('idDetailsoutput.txt', header=False, index=False)

Result:

18
20 BA-10  12  06  2  30  S
20 BA-20  12  06  2  30  S
**********
66
20 AD-38  12  06  B
20 AD-38  30  07  B
**********
70
20 OL-45  19  11  B
20 EM-45  19  08  B
**********
77

**********
87
25 R160  22  13  E
25 R165  22  08  E
**********
88
20 TH-42  02  02  5  30  MT
**********

Because 77 has empty column OTHER_FIELDS_2 so result has empty line.

If you would like to remove it then you would have to filter elements in row before converting to text. Or you would have to skip empty element when you append() to data_rows.

For empty string (or False, None) you could use filter(None, row)

with open('idDetailsoutput.txt', 'w') as output:
    for row in data_rows:
        text = "\n".join(filter(None, map(str, row)))
        output.write(text + "\n")

Result (without empty line)

18
20 BA-10  12  06  2  30  S
20 BA-20  12  06  2  30  S
**********
66
20 AD-38  12  06  B
20 AD-38  30  07  B
**********
70
20 OL-45  19  11  B
20 EM-45  19  08  B
**********
77
**********
87
25 R160  22  13  E
25 R165  22  08  E
**********
88
20 TH-42  02  02  5  30  MT
**********

EDIT:

As @Ramrab suggests in comment it can be done without pandas which is used only for groupby. Python has also itertools.groupby which could be useful.

Full working code with itertools.groupby.
I split code to small parts to better show every step. Writing is the same as before.

data = '''"ID","OTHER_FIELDS_2"
"87","25 R160  22  13  E"
"87","25 R165  22  08  E"
"77",""
"18","20 BA-10  12  06  2  30  S"
"18","20 BA-20  12  06  2  30  S"
"88","20 TH-42  02  02  5  30  MT"
"66","20 AD-38  12  06  B"
"66","20 AD-38  30  07  B"
"70","20 OL-45  19  11  B"
"70","20 EM-45  19  08  B"
'''

import csv
import itertools
import io

# --- load data as CSV
#with open('idDetails.csv') as data_in:
with io.StringIO(data) as data_in:
    reader = csv.reader(data_in)
    next(reader)  # skip header
    data = list(reader)  # get all rows at once

print('--- data ---\n')
print(*data, sep='\n')

# --- convert first column to int
# converting to integer because sorting strings like `"2","11"` would gives wrong order `"11","2"`
for row in data:
    row[0] = int(row[0])

print('\n--- data with integers ---\n')
print(*data, sep='\n')

# --- sorted only by first column
data = sorted(data, key=lambda row:row[0])   # sorted only by first column
#data = sorted(data)                         # sorted by first and second column

print('\n--- data sorted ---\n')
print(*data, sep='\n')

# --- group by first column
data_rows = []

grouped_data = itertools.groupby(data, key=lambda row:row[0])

print()
for group_name, group_rows in grouped_data:
    # get all items from iterator if I would have to use data many times
    # because iterator can be used only once
    group_rows = list(group_rows) 

    print('- group -')
    print(f"{group_name = }")
    print(f"{group_rows = }")

    other_data_b = "\n".join(row[1] for row in group_rows)
    other_data_c = '**********'

    data_rows.append([group_name, other_data_b, other_data_c])

print('\n--- data_rows ---\n')
print(*data_rows, sep='\n')

# --- write it 
with open('idDetailsoutput.txt', 'w') as output:
    for row in data_rows:
        text = "\n".join(filter(None, map(str, row)))
        output.write(text + "\n")

# --- show file content (for test)
print('\n--- result ---\n')
with open('idDetailsoutput.txt') as input_file:
    print(input_file.read())

Output from all print()

--- data ---

['87', '25 R160  22  13  E']
['87', '25 R165  22  08  E']
['77', '']
['18', '20 BA-10  12  06  2  30  S']
['18', '20 BA-20  12  06  2  30  S']
['88', '20 TH-42  02  02  5  30  MT']
['66', '20 AD-38  12  06  B']
['66', '20 AD-38  30  07  B']
['70', '20 OL-45  19  11  B']
['70', '20 EM-45  19  08  B']

--- data with integers ---

[87, '25 R160  22  13  E']
[87, '25 R165  22  08  E']
[77, '']
[18, '20 BA-10  12  06  2  30  S']
[18, '20 BA-20  12  06  2  30  S']
[88, '20 TH-42  02  02  5  30  MT']
[66, '20 AD-38  12  06  B']
[66, '20 AD-38  30  07  B']
[70, '20 OL-45  19  11  B']
[70, '20 EM-45  19  08  B']

--- data sorted ---

[18, '20 BA-10  12  06  2  30  S']
[18, '20 BA-20  12  06  2  30  S']
[66, '20 AD-38  12  06  B']
[66, '20 AD-38  30  07  B']
[70, '20 OL-45  19  11  B']
[70, '20 EM-45  19  08  B']
[77, '']
[87, '25 R160  22  13  E']
[87, '25 R165  22  08  E']
[88, '20 TH-42  02  02  5  30  MT']

- group -
group_name = 18
group_rows = [[18, '20 BA-10  12  06  2  30  S'], [18, '20 BA-20  12  06  2  30  S']]
- group -
group_name = 66
group_rows = [[66, '20 AD-38  12  06  B'], [66, '20 AD-38  30  07  B']]
- group -
group_name = 70
group_rows = [[70, '20 OL-45  19  11  B'], [70, '20 EM-45  19  08  B']]
- group -
group_name = 77
group_rows = [[77, '']]
- group -
group_name = 87
group_rows = [[87, '25 R160  22  13  E'], [87, '25 R165  22  08  E']]
- group -
group_name = 88
group_rows = [[88, '20 TH-42  02  02  5  30  MT']]

--- data_rows ---

[18, '20 BA-10  12  06  2  30  S\n20 BA-20  12  06  2  30  S', '**********']
[66, '20 AD-38  12  06  B\n20 AD-38  30  07  B', '**********']
[70, '20 OL-45  19  11  B\n20 EM-45  19  08  B', '**********']
[77, '', '**********']
[87, '25 R160  22  13  E\n25 R165  22  08  E', '**********']
[88, '20 TH-42  02  02  5  30  MT', '**********']

--- result ---
18
20 BA-10  12  06  2  30  S
20 BA-20  12  06  2  30  S
**********
66
20 AD-38  12  06  B
20 AD-38  30  07  B
**********
70
20 OL-45  19  11  B
20 EM-45  19  08  B
**********
77
**********
87
25 R160  22  13  E
25 R165  22  08  E
**********
88
20 TH-42  02  02  5  30  MT
**********

Introducing pandas to this solution is overcomplicated and totally unnecessary for the dataset as described by OP. Also, if performance is important, this implementation is remarkably slow
I used original code and only changed last part but I agree that all this should be done without pandas which is used only for groupby. I was thinking about using itertools.groupby but finally I show only how to write it correctly.
to check idea with itertools.groupby I finally created working code.

PaulS · Accepted Answer · 2025-07-23 09:17:45Z

Another possible solution:

col = 'OTHER_FIELDS_2'

print('\n******\n'.join(
    (df.groupby('ID', as_index=False, sort=True)[col]
     .apply(lambda x:
         f'{df.loc[x.index[0], "ID"]}' + 
         (f'\n{x.to_string(index=False)}' if ~x.isna().any() 
          else '')))[col].to_list()) + '\n******')

This solution reformats the grouped data from the dataframe to the desired TXT-style output by leveraging groupby to organize entries by ID, and apply with a custom lambda function to format each group. Inside the lambda, it checks if the group contains only NaN values using ~x.isna().any() and formats accordingly: if non-empty, it includes the ID followed by the joined lines using to_string (without the index), otherwise it outputs only the ID. The final list of formatted strings is collected using .to_list(), and all lines are joined using '\n******\n'.join(...) to produce a continuous block of text matching the required structure.

Output:

18
20 BA-10  12  06  2  30  S
20 BA-20  12  06  2  30  S
******
66
20 AD-38  12  06  B
20 AD-38  30  07  B
******
70
20 OL-45  19  11  B
20 EM-45  19  08  B
******
77
******
87
25 R160  22  13  E
25 R165  22  08  E
******
88
20 TH-42  02  02  5  30  MT
******

Collectives™ on Stack Overflow

How to cleanup some content from the text file

4 Answers 4

Comments

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related