3

I need to convert a folder with around 4,000 .txt files into a single .csv with two columns: (1) Column 1: 'File Name' (as specified in the original folder); (2) Column 2: 'Content' (which should contain all text present in the corresponding .txt file).

Here you can see some of the files I am working with.

The most similar question to mine here is this one (Combine a folder of text files into a CSV with each content in a cell) but I could not implement any of the solutions presented there.

The last one I tried was the Python code proposed in the aforementioned question by Nathaniel Verhaaren but I got the exact same error as the question's author (even after implementing some suggestions):

import os
import csv

dirpath = 'path_of_directory'
output = 'output_file.csv'
with open(output, 'w') as outfile:
    csvout = csv.writer(outfile)
    csvout.writerow(['FileName', 'Content'])

    files = os.listdir(dirpath)

    for filename in files:
        with open(dirpath + '/' + filename) as afile:
            csvout.writerow([filename, afile.read()])
            afile.close()

    outfile.close()

Other questions which seemed similar to mine (for example, Python: Parsing Multiple .txt Files into a Single .csv File?, Merging multiple .txt files into a csv, and Converting 1000 text files into a single csv file) do not solve this exact problem I presented (and I could not adapt the solutions presented to my case).

1 Answer 1

-1

I had a similar requirement and so I wrote the following class

import os
import pathlib
import glob
import csv
from collections import defaultdict

class FileCsvExport:
    """Generate a CSV file containing the name and contents of all files found"""
    def __init__(self, directory: str, output: str, header = None, file_mask = None, walk_sub_dirs = True, remove_file_extension = True):
        self.directory = directory
        self.output = output
        self.header = header
        self.pattern = '**/*' if walk_sub_dirs else '*'
        if isinstance(file_mask, str):
            self.pattern = self.pattern + file_mask
        self.remove_file_extension = remove_file_extension
        self.rows = 0

    def export(self) -> bool:
        """Return True if the CSV was created"""
        return self.__make(self.__generate_dict())

    def __generate_dict(self) -> defaultdict:
        """Finds all files recursively based on the specified parameters and returns a defaultdict"""
        csv_data = defaultdict(list)
        for file_path in glob.glob(os.path.join(self.directory, self.pattern),  recursive = True):
            path = pathlib.Path(file_path)
            if not path.is_file():
                continue
            content = self.__get_content(path)
            name = path.stem if self.remove_file_extension else path.name
            csv_data[name].append(content)
        return csv_data

    @staticmethod
    def __get_content(file_path: str) -> str:
        with open(file_path) as file_object:
            return file_object.read()

    def __make(self, csv_data: defaultdict) -> bool:
        """
        Takes a defaultdict of {k, [v]} where k is the file name and v is a list of file contents.
        Writes out these values to a CSV and returns True when complete.
        """
        with open(self.output, 'w', newline = '') as csv_file:
            writer = csv.writer(csv_file, quoting = csv.QUOTE_ALL)
            if isinstance(self.header, list):
                writer.writerow(self.header)
            for key, values in csv_data.items():
                for duplicate in values:
                    writer.writerow([key, duplicate])
                    self.rows = self.rows + 1
        return True

Which can be used like so

...
myFiles = r'path/to/files/'
outputFile = r'path/to/output.csv'

exporter = FileCsvExport(directory = myFiles, output = outputFile, header = ['File Name', 'Content'], file_mask = '.txt')
if exporter.export():
    print(f"Export complete. Total rows: {exporter.rows}.")

In my example directory, this returns

Export complete. Total rows: 6.

Note: rows does not count the header if present

This generated the following CSV file:

"File Name","Content"
"Test1","This is from Test1"
"Test2","This is from Test2"
"Test3","This is from Test3"
"Test4","This is from Test4"
"Test5","This is from Test5"
"Test5","This is in a sub-directory"

Optional parameters:

  • header: Takes a list of strings that will be written as the first line in the CSV. Default None.
  • file_mask: Takes a string that can be used to specify the file type; for example, .txt will cause it to only match .txt files. Default None.
  • walk_sub_dirs: If set to False, it will not search in sub-directories. Default True.
  • remove_file_extension: If set to False, it will cause the file name to be written with the file extension included; for example, File.txt instead of just File. Default True.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.