5

I am trying to do some simple analyses on the Kenneth French industry portfolios (first time with Pandas/Python), data is in txt format (see link in the code). Before I can do computations, first want to load it in a Pandas dataframe properly, but I've been struggling with this for hours:

import urllib.request
import os.path
import zipfile
import pandas as pd
import numpy as np

# paths
url = 'http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/48_Industry_Portfolios_CSV.zip'
csv_name = '48_Industry_Portfolios.CSV'
local_zipfile = '{0}/data.zip'.format(os.getcwd())
local_file = '{0}/{1}'.format(os.getcwd(), csv_name)

# download data
if not os.path.isfile(local_file):
    print('Downloading and unzipping file!')
    urllib.request.urlretrieve(url, local_zipfile)
    zipfile.ZipFile(local_zipfile).extract(csv_name, os.path.dirname(local_file))

# read from file
df = pd.read_csv(local_file,skiprows=11)
df.rename(columns={'Unnamed: 0' : 'dates'}, inplace=True)

# build new dataframe
first_stop = df['dates'][df['dates']=='201412'].index[0]
df2 = df[:first_stop]

# convert date to datetime object
pd.to_datetime(df2['dates'], format = '%Y%m')
df2.index = df2.dates

All the columns, except dates, represent financial returns. However, due to the file formatting, these are now strings. According to Pandas docs, this should do the trick:

df2.convert_objects(convert_numeric=True)

But the columns remain strings. Other suggestions are to loop over the columns (see for example pandas convert strings to float for multiple columns in dataframe):

for d in df2.columns:
if d is not 'dates':
    df2[d] = df2[d].map(lambda x: float(x)/100)

But this gives me the following warning:

 home/<xxxx>/Downloads/pycharm-community-4.5/helpers/pydev/pydevconsole.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  try:

I have read the documentation on views vs copies, but having difficulty to understand why it is a problem in my case, but not in the code snippets in the question I linked to. Thanks

Edit:

df2=df2.convert_objects(convert_numeric=True)

Does the trick, although I receive a depreciation warning (strangely enough that is not in the docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html)

Some of df2:

     dates    Agric    Food     Soda     Beer     Smoke    Toys     Fun    \
dates                                                                           
192607  192607     2.37     0.12   -99.99    -5.19     1.29     8.65     2.50   
192608  192608     2.23     2.68   -99.99    27.03     6.50    16.81    -0.76   
192609  192609    -0.57     1.58   -99.99     4.02     1.26     8.33     6.42   
192610  192610    -0.46    -3.68   -99.99    -3.31     1.06    -1.40    -5.09   
192611  192611     6.75     6.26   -99.99     7.29     4.55     0.00     1.82   

Edit2: the solution is actually more simple than I thought:

df2.index = pd.to_datetime(df2['dates'], format = '%Y%m')
df2 = df2.astype(float)/100
3
  • silly question did you assign back the result of convert_objects? e.g. df = df.convert_objects(convert_numeric=True)? Commented Oct 29, 2015 at 21:11
  • Can you add what your df2 looks like? Edit your question with at least a partial view of the data frame. Commented Oct 29, 2015 at 21:12
  • @EdChum: that works, although I do not understand why I would need to assign the result, isn't that stored? Renaming columns is stored as well. Commented Oct 29, 2015 at 21:24

3 Answers 3

2

I would try the following to force convert everything into floats:

df2=df2.astype(float)
Sign up to request clarification or add additional context in comments.

Comments

1

You can convert specific column to float(or any numerical type for that matter) by

df["column_name"] = pd.to_numeric(df["column_name"])

Posting this because pandas.convert_objects is deprecated in pandas 0.20.1

Comments

0

You need to assign the result of convert_objects as there is no inplace param:

df2=df2.convert_objects(convert_numeric=True)

you refer to the rename method but that one has an inplace param which you set to True.

Most operations in pandas return a copy and some have inplace param, convert_objects is one that does not. This is probably because if the conversion fails then you don't want to blat over your data with NaNs.

Also the deprecation warning is to split out the different conversion routines, presumably so you can specialise the params e.g. format string for datetime etc..

3 Comments

Thanks for your help! I've added the solution, just realized I had the same issue with converting the dates. Maybe a bit off-topic, but suppose I would only want to convert a single column from strings to floats (as I tried to do in my original question), how can I avoid the SettingWithCopyWarning?
You should filter the df for the object dtypes and then cast these using astype so something like cols = list(df) cols.pop('dates' for col in cols: df[cols] = df[cols].astype(float) / 100 should work but really convert_objects is the way to go
Unfortunately convert_objects is Deprecated since version 0.21.0. so another methods should be preferred nowadays

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.