1

Instead of defining documentslike this ...

documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]

... I want to read the same three sentences from two different txt files with the first sentence in the first file, and sentence 2 and 3 in the second file.

I have come up with this code:

# read txt documents
os.chdir('text_data')
documents = []
for file in glob.glob("*.txt"): # read all txt files in working directory
    file_content = open(file, "r")
    lines = file_content.read().splitlines()
    for line in lines:
        documents.append(line)

But the documents resulting from the two strategies seem to be in different format. I want the second strategy to produce the same output as the first.

5
  • 1
    ... what is wrong? Please try to be specific with your problem statements. Commented Mar 25, 2017 at 23:38
  • Edited for clarity. Commented Mar 25, 2017 at 23:43
  • 1
    My point was that instead of writing "the documents resulting form the two strategies seem to be in different format" you should instead show the output Commented Mar 25, 2017 at 23:45
  • 1
    Also, doing this: lines = file_content.read().splitlines() is not necessary. You can iterate directly over the file handler, and it iterates over lines. So just for line in file_content: would be sufficient (although you'll get the trailing newlines). Likely, you just want documents.append(file_content.read()) And you don't have to iterate over the file at all... Commented Mar 25, 2017 at 23:48
  • 1
    Possible duplicate of combine multiple text files into one text file using python Commented Mar 26, 2017 at 0:35

3 Answers 3

1

If I understand your code correctly, this is equivalent and more performant (no reading the entire file into a string, then splitting to a list).

os.chdir('text_data')
documents = []
for file in glob.glob("*.txt"): # read all txt files in working directory
    documents.extend( line for line in open(file) )

Or maybe even one line.

documents = [ line for line in open(file) for file in glob.glob("*.txt") ]
Sign up to request clarification or add additional context in comments.

1 Comment

you need to reverse the order of the "for"s in the list comprehension
0

Instead of .read().splitlines(), you can use .readlines(). This will place every file's contents into a list.

1 Comment

I am new to stack overflow, @juanpa.arrivillaga. What I meant was that the contents of the list that .readlines() creates could be further appended to documents, but I see that your most recent comment answered what I was trying to explain. Thank you.
0

... I want to read the same three sentences from two different txt files with the first sentence in the first file, and sentence 2 and 3 in the second file.

Translating the requirements directly gives:

with open('somefile1.txt') as f1:
    lines_file1 = f1.readlines()
with open('somefile2.txt') as f2:
    lines_file2 = f2.readlines()
documents = lines_file1[0:1] + lines_file2[1:3]

FWIW, given the kind of work you're doing, the [fileinput module][1] may be helpful.

Hope this get you back in business :-)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.