5

I received a data dump of the SQL database.

The data is formatted in an .sql file and is quite large (3.3 GB). I have no access to the actual database and I don't know how to handle this .sql file in Python.

I am looking for specific steps to take so I can use this SQL file in Python and analyze the data.

2
  • What kind of database? If its mysql, you need to install mysql on your local machine, create a new db, load the dump, and then query it like you would the original library (using PyMySQL, for example). Similar steps exis 5th for any other brand. Just google "restore <brand> from dump" and that should get you started Commented Jan 10, 2019 at 15:37
  • 2
    Possible duplicate of How to parse MySQL built-code from within Python? Commented Jan 10, 2019 at 15:39

4 Answers 4

2

It would be an extraordinarily difficult process to try to construct any sort of Python program that would be capable of parsing the SQL syntax of any such of a dump-file and to try to do anything whatsoever useful with it.

"No. Absolutely not. Absolute nonsense." (And I have over 30 years of experience, including senior management.) You need to go back to your team, and/or to your manager, and look for a credible way to achieve your business objective ... because, "this isn't it."

The only credible thing that you can do with this file is to load it into another mySQL database ... and, well, "couldn't you have just accessed the database from which this dump came?" Maybe so, maybe not, but "one wonders."

Anyhow – your team and its management need to "circle the wagons" and talk about your credible options. Because, the task that you've been given, in my professional opinion, "isn't one." Don't waste time – yours, or theirs.

Sign up to request clarification or add additional context in comments.

1 Comment

The argument for not giving me access to their database is because they have data of multiple companies in their SQL database, including competing companies. My management has no idea of what I'm doing and how this all works. I am on my own in my company on this project and I've been given carte blanche to figure out how to do this. Is there any other solution of handling this? I want to analyze the data that they have from us, but I can't access the SQL database directly due to their "policy".
2

Eventually I had to install MAMP to create a local mysql server. I imported the SQL dump with a program like SQLyog that let's you edit SQL databases.

This made it possible to import the SQL database in Python using SQLAlchemy, MySQLconnector and Pandas.

Comments

2

The module sqlparse does a pretty good job in this. For example:

import sqlparse
import collections
import pandas as pd

with open('dump.sql', 'r') as sqldump:

   parser = sqlparse.parsestream(sqldump)
   headers = {}
   contents = collections.defaultdict(list)

   for statement in parser:

       if statement.get_type() == 'INSERT':

           sublists = statement.get_sublists()
           table_info = next(sublists)
           table_name = table_info.get_name()

        headers[table_name] = [
            col.get_name()
            for col in table_info.get_parameters()
        ]

        contents[table_name].extend(
            tuple(
                s.value.strip('"\'')
                for s in next(rec.get_sublists()).get_identifiers()
            )
            for rec in next(sublists).get_sublists()
        )

data = {
    name: pd.DataFrame.from_records(table, columns = headers[name])
    for name, table in contents.items()
}

It is slow, but does the job, I guess until a few GB file size. Even better if you extract the tables one by one (lower memory use), and seek in the file object to the first INSERT statement of the table of interest (in order to avoid processing the other huge statements by the sqlparse lexer).

Comments

0

This can help SqlDumpReader

from sql_dump_parser import SqlSimpleDumpParser

sample_lines = [
    'create table TBL1 (id1 int, id2 int, id3 int);',
    'insert into TBL1 (id2, id1) values (1, 2)',
    'insert into TBL1 values (3, 4, 5)'
    ]

sql_parser = SqlSimpleDumpParser()
data = sql_parser.parse_tables(sample_lines)
print(data)
print(sql_parser.table_descriptions)

OUTPUT:

{'TBL1': [[2, 1, None], [3, 4, 5]]}
{'TBL1': {'id1': int, 'id2': int, 'id3': int}}

Read files:

from sql_dump_parser import SqlSimpleDumpParser
sql_parser = SqlSimpleParser()
with open("sample_data\\dump01.sql", "r", encoding='UTF-8') as file_in:
    data = sql_parser.parse_tables(file_in)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.