3

Often people ask questions on Stack Overflow with an output of print(dataframe). It is convenient if one has a way of quickly loading the dataframe data into a pandas.dataframe object.

What is/are the most suggestible ways of loading a dataframe from a dataframe-string (which may or may not be properly formatted)?

Example-1

If you want to load the following string as a dataframe what would you do?

# Dummy Data
s1 = """
Client NumberOfProducts ID
A      1                2
A      5                1
B      1                2
B      6                1
C      9                1
"""

Example-2

This type is more similar to what you find in csv file.

# Dummy Data
s2 = """
Client, NumberOfProducts, ID
 A, 1, 2
 A, 5, 1
 B, 1, 2
 B, 6, 1
 C, 9, 1
"""

Expected Output

enter image description here

References

Note: The following two links do not address the specific situation presented in Example-1. The reason I think my question is not a duplicate is that I think one cannot load the string in Example-1 using any of the solutions already posted on those links (at the time of writing).

  1. Create Pandas DataFrame from a string. Note that pd.read_csv(StringIO(s1), sep), as suggested here, doesn't really work for Example-1. You get the following output.
    enter image description here

  2. This question was marked as a duplicate of two Stack Overflow links. One of them is the one above, which fails in addressing the case presented in Example-1. And the second one is . Among all the answers presented there, only one looked like it might work for Example-1, but it did not work.

# could not read the clipboard and threw error
pd.read_clipboard(sep='\s\s+')

Error Thrown:

PyperclipException: 
    Pyperclip could not find a copy/paste mechanism for your system.
    For more information, please visit https://pyperclip.readthedocs.org
9
  • @yatu I would prefer to use Method-2 like you mentioned, however, it fails to load the data properly from Example-1. Which is why I opened this question and left a reference to a similar question, but not the same in the content of the question. Commented Oct 25, 2019 at 9:07
  • First was tested with df = pd.read_clipboard(sep='\s+') and working nice for me Commented Oct 25, 2019 at 9:31
  • I tried pd.read_clipboard(sep='\s+') and got the same error as pd.read_clipboard(sep='\s\s+'). I think it is system-configuration specific. Commented Oct 25, 2019 at 9:37
  • ya, I agree, it seems problem is with your clipbourd in your pc/nb. Commented Oct 25, 2019 at 10:11
  • 1
    Ok. My concern was if as marked-a-duplicate this question will get deleted in future. I just checked meta, and it looks like that won't happen. SO, am okay as well. meta.stackoverflow.com/questions/320522/… Commented Oct 25, 2019 at 10:30

1 Answer 1

2

I can suggest two methods to approach this problem.

Method-1

Process the string with regex and numpy to make the dataframe. What I have seen is that this works most of the time. This would for the case presented in "Example-1".

# Make Dataframe
import pandas as pd
import numpy as np
import re

# Make Dataframe
# s = s1
ncols = 3 # number_of_columns
ss = re.sub('\s+',',',s.strip())
sa = np.array(ss.split(',')).reshape(-1,ncols)
df = pd.DataFrame(dict((k,v) for k,v in zip(sa[0,:], sa[1:,].T)))
df

Method-2

Use io.StringIO to feed into pandas.read_csv(). But this would work if the separator is well defined. For instance, if your data looks similar to "Example-2". Source credit

import pandas as pd
from io import StringIO

# Make Dataframe
# s = s2
df = pd.read_csv(StringIO(s), sep=',')

Output

enter image description here

Sign up to request clarification or add additional context in comments.

7 Comments

I think first solution is problematic, because get alwyas strings, second is good, unfortunately it is dupe.
@jezrael But if you try using the second one for the data in Example-1, it does not work. So, this question sheds light on how to handle such scenarios. Is it still a duplicate?
I think yes, it is dupe. Maybe is possible find better dupe too.
Maybe more general solution is added to dupes.
I totally support for not cluttering the question stack. But, the solutions that have been marked as this question's duplicate, they do not address the issue I raised for Example-1. Am I missing something?
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.