I use the polars, urllib and tldextract packages in python to parse 2 columns of URL strings in zstd-compressed parquet files (averaging 8GB, 40 million rows). The parsed output include the scheme, netloc, subdomain, domain, suffix, path, query, and fragment. On my PC, processing a single file takes about 16 minutes (64GB RAM, 32 logical cores). Input data is read from one SSD (x:/) and output is written to a separate SSD (y:/).

My code doesn't use multiprocessing. It relies on polars' streaming and vectorization features for efficiency. RAM usage is near the limit when processing. I don't get the full benefit of polars' vectorization since urllib and tldextract are not native to polars' engine (rust).

Are there alternative approaches or modifications that would help speed processing? Is 16 minute processing time reasonable? I've considered creating a Rust extension for polars that has the same functionality of urllib and tldextract but that seems like reinventing the wheel.

The method below does the heavy lifting (python 3.12, polars 1.31)

def build_parsed_url_for_date(
    event_date: str,
    silver_root: Path,
    gold_root: Path,
    compression: str = "zstd",
) -> None:
    """
    For a given event_date (YYYY-MM-DD):

    - Read SILVER parquet from: silver_root / f"event_date={event_date}" / *.parquet
    - Parse `url` and `referrer` columns into components (PSL-based).
    - Write GOLD/parsed_url parquet to:
          gold_root / "parsed_url" / f"event_date={event_date}" / "part.parquet"
    """
    # input paths
    silver_partition = silver_root / f"event_date={event_date}"
    if not silver_partition.exists():
        raise FileNotFoundError(f"SILVER partition not found: {silver_partition}")

    silver_files = sorted(silver_partition.glob("*.parquet"))
    if not silver_files:
        raise FileNotFoundError(f"No parquet files in {silver_partition}")


    # polars lazy frame
    lf = pl.scan_parquet([str(f) for f in silver_files])

    # wrappers for urllib and tldextract parsers
    parser_url = make_polars_parser("url")  # returns dict of parsed components
    parser_ref = make_polars_parser("ref")

    # Predicate pushdown
    lf_parsed = (
        lf.select(["id", "referrer", "url"])
        .with_columns(
            pl.col("url").map_elements(parser_url, return_dtype=url_struct_dtype("url")).alias("url_parsed"),
            pl.col("referrer").map_elements(parser_ref, return_dtype=url_struct_dtype("ref")).alias("ref_parsed"),
        )
        .unnest("url_parsed")
        .unnest("ref_parsed")
    )

    # Output path
    gold_parsed_root = gold_root / "parsed_url"
    out_dir = gold_parsed_root / f"event_date={event_date}"
    out_dir.mkdir(parents=True, exist_ok=True)
    out_path = out_dir / "part.parquet"

    # Stream/write to parquet
    lf_parsed.sink_parquet(str(out_path), compression=compression)

7 Replies 7

You're processing > 41 K row/sec. So waiting a thousand seconds for the result seems reasonable to me.

I guess one could grovel over more than 42 rows in a millisecond? But we'd have to know exactly which code segments a profiling run reports as "hot", and the OP omits such details.

Measure single-core performance and profiling first. Then worry about efficiently sending work to dozens of cores.


same functionality of urllib and tldextract

Find a relevant Rust crate, and focus on making it easy to call from python.

Or write a Rust app linked against an existing parser crate, which consumes an input file and writes to a parsed output file. Then python apps can read results from that file.

Are you using Polars' map_elements, map_batches, or ...? It'd help if you showed a code example.

added primary method I'm currently using to parse URLs

You didn't mention a github repo URL. I don't have a repex to run or profile. The code you posted makes it look like you have 31 idle cores.

I don't know exactly what these do:

    ... = make_polars_parser("url")  # returns dict of parsed components
    ... = make_polars_parser("ref")

They suggest we're parsing with interpreted bytecode, non-vectorized, limited by the GIL.

The string-to-string parsing you're doing is not very fancy; any language environment could do it. But you say "I need speed", which tends to rule out interpreted python. Use crates such as tld, url, or others, then funnel millions of URLs through Rust code which keeps dozens of cores busy.

How often do you need to run this script? It sounds to me like a one-time job. In that case I don't see a reason to optimize anything.

To speed it up from a Polars perspective, you would have to emulate parser_url and parser_ref with Polars expressions.

It's unclear what exactly they do, but I assume it would involve .str.* and possibly struct operations:

Or wrap Rust crates as a Polars plugin.

I have not checked the code, but a quick search shows a potential existing example:

You could post a question (similar to the first link) that shows the actual parsing code if you need help porting it to expressions.

I need to increase processing speed so that backfilling daily histories takes hours, not days or weeks.

At this time, I've taken up a suggestion to build my own URL parsing engine in Rust (using the url and publicsuffix crates) for Python. I've been successful so far. The Rust program builds successfully and I am able to import the new library into python and run it. Results looks reasonable so far. I have a bit more pipeline work to do. I'll share the results when I'm done.

Your Reply

By clicking “Post Your Reply”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.