2

For a single large text (~4GB) I need to search for ~1million phrases and replace them with complementary phrases. Both the raw text and the replacements can easily fit in memory. The naive solution will literally takes years to finish as a single replacement takes about a minute.

Naive solution:

for search, replace in replacements.iteritems():
    text = text.replace(search, replace)

The regex method using re.subis x10 slower:

for search, replace in replacements.iteritems():
    text = re.sub(search, replace, text)

At any rate, this seems like a great place use Boyer-Moore string, or Aho-Corasick; but these methods as they are generally implemented only work for searching the string and not also replacing it.

Alternatively, any tool (outside of Python) that can do this quickly would also be appreciated.

Thanks!

6
  • Could you show the equivalent regex you're using and possibly some timings? Commented Sep 5, 2013 at 20:49
  • once you found all matches with something like Aho-Corasick you could then perform your replacements (checking for overlap of course) Commented Sep 5, 2013 at 20:53
  • The timings are about 60s for each text.replace() call, which for 1million calls takes about two years. @JonClements Commented Sep 5, 2013 at 20:55
  • For Aho-Corasick, the search is fast, but the replacements are then what I imagine holding you back. The replacement string is not necessarily the same length as the matching string, which means that the new whole string of roughly the same size must be copied wholesale, or at least split into before and after pieces for every match. Either way, this is very slow in Python. Perhaps not the case if you drop down into C, where you can avoid string splits, joins, etc. @cmd Commented Sep 5, 2013 at 20:57
  • @Chris some of the python string building drops down to C for you. like str.join Commented Sep 5, 2013 at 21:00

4 Answers 4

1

There's probably a better way than this:

re.sub('|'.join(replacements), lambda match: replacements[match.group()], text)

This does one search pass, but it's not a very efficient search. The re2 module may speed this up dramatically.

Sign up to request clarification or add additional context in comments.

1 Comment

I think this is much closer to an optimal solution in that it does this in a single pass. The lambda to the correct replacements looked great; but it's looking to be terribly slow. I'll look into re2. Thanks!
1

Outside of python, sed is usually used for this sort of thing.

For example (taken from here), to replace the word ugly with beautiful in the file sue.txt:

sed -i 's/ugly/beautiful/g' /home/bruno/old-friends/sue.txt

You haven't posted any profiling of your code, you should try some timings before you do any premature optimization. Searching and replacing text in a 4GB file is a computationally-intensive operation.

ALTERNATIVE Ask: should I be doing this at all? -

You discuss below doing an entire search and replace of the Wikipedia corpus in under 10ms. This rings some alarm bells as it doesn't sound like great design. Unless there's an obvious reason not to you should be modifying whatever code you use to present and/or load that to do the search and replace as a subset of the data is being loaded/viewed. It's unlikely you'll be doing many operations on the entire 4GB of data so restrict your search and replace operations to what you're actually working on. Additionally, your timing is still very imprecise because you don't know how big the file you're working on is.

On a final point, you note that:

the speedup has to be algorithmic not chaining millions of sed calls

But you indicated that the data you're working with was a "single large text (~4GB)" so there shouldn't be any chaning involved if I understand what you mean by that correctly.

UPDATE: Below you indicate that to do the operation on a ~4KB file (I'm assuming) takes 90s, this seems very strange to me - sed operations don't normally take anywhere close to that. If the file is actually 4MB (I'm hoping) then it should take 24 hours to evaluate (not ideal but probably acceptable?)

6 Comments

This doesn't really help; the point is that I have many, multiple replacements. This solution would involve ~1million calls to sed, and ~1million passes through the 4GB file. Both the sed and the naive implementation above are O(n * m), which is just too slow for the given data size.
Yes. time sed 's/Amistad/amistad/g' full_03_wikipedia_news_big > /dev/null runs in 90s. This means that ~1million * 90s = 2.8 years.
Something weird is up if that operation is taking 90s on a 4KB file (I estimate 4KB from you saying * 1 million ~ 4GB)- are those files definitely on your filesystem? also try the test with sed -i.
Would it be possible to post that full_03_wikipedia_news_big file on pastebin or gist.github? I want to see why it takes so long.
It's just a cleaned text dump of a subset of the wikipedia corpus. I can't put it up online. But the speed-up has to be algorithmic, not in chaining millions of sed calls. Unless you can get a single sed call to complete a single 4GB search and replace in under 10ms, then the doing it a million times will take at least few hours.
|
1

I had this use case as well, where I needed to do ~100,000 search and replace operations on the full text of Wikipedia. Using sed, awk, or perl would take years. I wasn't able to find any implementation of Aho-Corasick that did search-and-replace, so I wrote my own: fsed. This tool happens to be written in Python (so you can hack into the code if you like), but it's packaged up as a command line utility that runs like sed.

You can get it with:

pip install fsed

1 Comment

There isn't any easy to understand example for your library. That is restricting me to use your library. Also, lack of regex
0

they are generally implemented only work for searching the string and not also replacing it

Perfect, that's exactly what you need. Searching with an ineffective algorithm in a 4G text is bad enough, but doing several replacing is probably even worse... you potentially have to move gigabytes of text to make space for the expansion/shrinking caused by the size difference of source and target text.

Just find the locations, then join the pieces with the replacements parts.

So a dumb analogy would be be "_".join( "a b c".split(" ") ), but of course you don't want to create copies the way split does.

Note: any reason to do this in python?

2 Comments

It doesn't have to be in Python; and off the shelf tool would be ideal since it's a one-time operation that I'm doing. As you've pointed out, it's not just searching that's the issue, but also that replacements are costly because of the string splitting.
this should make it fast: 1) record positions, don't split 2) join, don't do replace 3) if your OS supports it, you can use sendfile, in this case you don't even have to store the original content. +) And sorry, no, I don't know any off-the-shelf sw which can do this quickly. Quick to write the app: sure... execute: nope...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.