1

When loading a csv file in pandas I've encountered the bellow error message:

DtypeWarning: Columns have mixed types. Specify dtype option on import  
or set low_memory=False

Reading online I found few solutions.

One, to set low_memory=False, but I understand that this is not a good practice and it doesn't really resolve the problem.

Second solution is to set a data type for each column (or each column with mixed data types):

pd.read_csv(csv_path_name, dtype={'first_column': 'str', 'second_column': 'str'})

Again, from what I read, not the ideal solution if we have a big dataset.

Third solution - create a converter function. To my understanding this might be the most appropriate solution. I found code which works for me, but I am trying to better understand what is this function exactly doing:

def convert_dtype(x):
    if not x:
        return ''
    try:
        return str(x)
    except:
        return ''

df = pd.read_csv(csv_path_name, converters={'first_col':convert_dtype, 'second_col':convert_dtype, etc.... } )

Can someone please explain the function code to me?

Thanks

1
  • Hey, I don't feel like this was exactly what I wanted to understand. Nevertheless it is a useful thread to read, thanks! The breakdown Bending Rodriguez provided helped me understand the function. Commented Apr 25 at 14:50

1 Answer 1

3

if not x checks if x is an empty string. if it is empty it returns '', which is an empty string without any content.

def convert_dtype(x):
    if not x:
        return ''

try: return str(x) tries to convert and return x as a string.

    try:
        return str(x)

if converting and returning x as a string doesn't work, it returns ''.

    except:
        return ''

Basically, if the content of the column is empty from the start or can't be converted to string it's discarded and replaced with a string not having any content. I can't judge however if this is a good approach, it depends on what you are trying to accomplish with your application. Your column will only contain strings afterwards nonetheless.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot for breaking it down and explaining. I agree, that doesn't seem to be the best approach, because I have a lot of columns which should be integers, floats, etc.. Not sure what the best approach is and what are the best practices to map column types in big datasets. Any advise or link I could read more about?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.