Python: Improving sub-string search by embedding sophisticated algorithms

Question

I am extending my previous question python efficient substring search,

I am interested to improve the performance of sub-string search implementation,

Some of the answers from my previous question pointed out that substring search is implemented by using fastsearch that uses an inspired by B-M algorithm, here is the source code

More answers have pointed me to a python implementation of Boyer–Moore Algorithm, Rabin–Karp algorithm.

will it be efficient to embed c code as a good implementation of substring search using those algorithms (B-M,Rabin-Karp)?

Usually you extend Python using C code (not embed) e.g., using Cython (you don't need permission to create an extension module). Also 100 times seems to good to be true. Have you tested the code? Have you tried to measure performance for a word at start/middle/end/missing from a large string (assuming you target your algorithm to be used on large string). Here's an example how Python can be optimized: Simple Python Challenge: Fastest Bitwise XOR on Data Buffers — jfs
– jfs, Commented Sep 5, 2012 at 5:40
I've read your code for RabinKarp and it is incorrect. You return true if h==h256 where h, h256 have only 256 distinct values. Nested loop does nothing. — jfs
– jfs, Commented Sep 5, 2012 at 5:52
Read the code carefully. continue advances internal loop. It does nothing for the outer loop. — jfs
– jfs, Commented Sep 5, 2012 at 6:01
@ J.F. Sebastian: Thanks, you are right, i will fix it now and edit everything again, — Michael
– Michael, Commented Sep 5, 2012 at 6:03

Martijn Pieters · Accepted Answer · 2012-09-04 12:38:03Z

10

You haven't specified by what you mean by 'efficient'. What tradeoffs are you willing to make? Would you be prepared to pay a price in performance loss when initializing a new string? When starting the search? Would you trade more memory for more speed?

The python developers set clear goals when they developed the python string library:

should be faster than the current brute-force algorithm for all test cases (based on real-life code), including Jim Hugunin’s worst-case test

small setup overhead; no dynamic allocation in the fast path (O(m) for speed, O(1) for storage)

sublinear search behaviour in good cases (O(n/m))

no worse than the current algorithm in worst case (O(nm))

should work well for both 8-bit strings and 16-bit or 32-bit Unicode strings (no O(σ) dependencies)

many real-life searches should be good, very few should be worst case

reasonably simple implementation

So the devs had set some limits on performance for the search case and the setup case, on storage requirements, and also on maintenance efficiency. Those boundaries ruled out Boyer–Moore (as it requires preprocessing on the searched-for string, a startup cost and a storage cost), and although I see no evidence that the devs considered Rabin-Karp, it can be ruled out on the same grounds (you need to create the hashes and store these).

The boundaries were set based on a lot of python internals and usage experience. The above summary wasn't pulled out of thin air, it is merely a summary of that experience.

Now, if you have a specific case where your trade-offs can be set differently, then sure, a C implementation of a different algorithm could well beat the standard Python implementation. But it'll be more efficient according to a different set of criteria.

In any case, the Python search algorithm deals with the small strings case. If you try to apply it to a large body of text, the algorithms will not be able to perform as well as one that makes different choices that work well for large texts. And if you had to search for text through 10,000,000 documents you'd want to use some kind of indexing solution instead puny little python string search.

Compare that to sorting a list of 100 items with the default sort implementation, vs. sorting 10,000,000,000 integers. In the latter case there are sorting implementations that can easily beat the default Python offer.

It should also be noted that Python has a history of algorithm innovation; the standard sort algorithm in Python is TimSort, a new algorithm invented by Tim Peters to fit the pragmatic real-life circumstances the Python interpreter has to deal with. That algorithm has since been made the default in Java and the Android platform as well. Thus, I tend to trust the Python core devs decisions.

As far as I know, noone has embedded a different implementation, as replacing the default is not going to work without patching the Python C code. You can easily create a specialized string type that implements a different search algorithm, of course. There may well be libraries out there that use C for specialized search algorithms that use Boyer-Moore, Rabin-Karp or any other algorithm, as that might well be the better choice for their specific problem domain.

edited Sep 4, 2012 at 12:38

answered Sep 4, 2012 at 9:12

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

HerrKaputt Over a year ago

Really good answer. It's good to know that Python wasn't born out of the blue sky and that its creators/developers know what they are doing.

Michael Over a year ago

+1, Very good answer, although if every one will trust others, no innovation will exist ;-) , as regards to the Rabin-Karp, storing the hashes is not a problem, since you are using a single rolling hash function, therefore it doesn't have a ground for ruling it out. I am interested mainly on time performance. Today, memory is cheap and less important.

Martijn Pieters Over a year ago

@Michael: Not sure that that invalidates my answer at all; you did not specify what efficiency means, and are you certain you can quantify the costs in all cases for general python usage? If so, then take that to the Python dev list! Remember that you need to justify more than just pure speed here.

Martijn Pieters Over a year ago

@Michael: Also, you continue changing the premises of your questions. That makes it damn hard to provide answers. Remember that posts on Stack Overflow are not just for your own benefit, but for future visitors too. Changing the question every time you discover something new means you should be on a forum or mailinglist, not SO, perhaps. Please leave the meaning of a question alone and create new posts or move it to the Py dev list instead.

Michael Over a year ago

@Martijn Pieters: i haven't change the question, i only proved that the current implementation of python developers is not enough as an answer to your answer. i am still interested if it is worth while to implement the algorithm in C and embed them in python, your answer is good and i voted it up, i am interested if someone did implement what i suggested and embedded it

|

Collectives™ on Stack Overflow

Python: Improving sub-string search by embedding sophisticated algorithms

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related