2

I have a dataframe pd1 got with pandas

pd1 = pd.read_csv(r'c:\am\wiki_stats\topandas.txt',sep=':',
                  header=None, names  = ['date-time','domain','requests-qty','response-bytes'],
                   parse_dates=[1], converters={'date-time': to_datetime}, index_col = 'date-time')

with index

>> pd1.index:  

 DatetimeIndex(['2016-01-01 00:00:00', '2016-01-01 00:00:00',
                '2016-01-01 00:00:00', '2016-01-01 00:00:00',
                '2016-01-01 00:00:00', '2016-01-01 00:00:00',
                '2016-01-01 00:00:00', '2016-01-01 00:00:00',
                '2016-01-01 00:00:00', '2016-01-01 00:00:00',
                ...
                '2016-08-05 12:00:00', '2016-08-05 12:00:00',
                '2016-08-05 12:00:00', '2016-08-05 12:00:00',
                '2016-08-05 12:00:00', '2016-08-05 12:00:00',
                '2016-08-05 12:00:00', '2016-08-05 12:00:00',
                '2016-08-05 12:00:00', '2016-08-05 12:00:00'],
               dtype='datetime64[ns]', name='date-time', length=6084158, freq=None)

But when I want to set index to that colomn, I get error as below (I initially wanted to set multiple columns index, that error appeared, then tried to created other dataframe from it pd_new_index = pd1.set_index(['requests-qty','domain']) with other columns as index (ok) and to make new frame also setting index to 'date-time' column back pd_new_2 = pd_new_index.set_index(['date-time']) - same error). 'date-time' does not look like special keyword and also that column is index now. Why error?

KeyError Traceback (most recent call last) C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2656 try: -> 2657 return self._engine.get_loc(key) 2658 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'date-time'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last) in ----> 1 pd_new_2 = pd_new_index.set_index(['date-time'])

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in set_index(self, keys, drop, append, inplace, verify_integrity) 4176 names.append(None) 4177 else: -> 4178 level = frame[col]._values 4179 names.append(col) 4180 if drop:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in getitem(self, key) 2925 if self.columns.nlevels > 1: 2926 return self._getitem_multilevel(key) -> 2927 indexer = self.columns.get_loc(key) 2928 if is_integer(indexer): 2929 indexer = [indexer]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2657
return self._engine.get_loc(key) 2658 except KeyError: -> 2659 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2660
indexer = self.get_indexer([key], method=method, tolerance=tolerance) 2661 if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'date-time'

1 Answer 1

1

Reason is date-time is already index, here DatetimeIndex, so not possible select it like columns by names.

Reason is parameter index_col:

pd1 = pd.read_csv(r'c:\am\wiki_stats\topandas.txt',
                  sep=':',
                  header=None, 
                  names  = ['date-time','domain','requests-qty','response-bytes'],
                  parse_dates=[1], 
                  converters={'date-time': to_datetime}, 
                  index_col = 'date-time')

For MultiIndex add list of columns names in index_col, remove converters and specify column name in parse_dates parameter:

import pandas as pd
from io import StringIO

temp=u"""2016-01-01:d1:0:0
2016-01-02:d2:0:1
2016-01-03:d3:1:0"""
#after testing replace 'pd.compat.StringIO(temp)' to r'c:\am\wiki_stats\topandas.txt''
df = pd.read_csv(StringIO(temp), 
                 sep=':',
                 header=None, 
                 names  = ['date-time','domain','requests-qty','response-bytes'],
                 parse_dates=['date-time'], 
                 index_col = ['date-time','domain'])

print (df)

date-time  domain                              
2016-01-01 d1                 0               0
2016-01-02 d2                 0               1
2016-01-03 d3                 1               0

print (df.index)
MultiIndex([('2016-01-01', 'd1'),
            ('2016-01-02', 'd2'),
            ('2016-01-03', 'd3')],
           names=['date-time', 'domain'])

EDIT1: Solution with append parameter in set_index:

import pandas as pd
from io import StringIO


temp=u"""2016-01-01:d1:0:0
2016-01-02:d2:0:1
2016-01-03:d3:1:0"""
#after testing replace 'pd.compat.StringIO(temp)' to r'c:\am\wiki_stats\topandas.txt''
df = pd.read_csv(StringIO(temp), 
                 sep=':',
                 header=None, 
                 names  = ['date-time','domain','requests-qty','response-bytes'],
                 parse_dates=['date-time'], 
                 index_col = 'date-time')

print (df)
           domain  requests-qty  response-bytes
date-time                                      
2016-01-01     d1             0               0
2016-01-02     d2             0               1
2016-01-03     d3             1               0

print (df.index)
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03'], 
              dtype='datetime64[ns]', name='date-time', freq=None)

df1 = df.set_index(['domain'], append = True)
print (df1)
                   requests-qty  response-bytes
date-time  domain                              
2016-01-01 d1                 0               0
2016-01-02 d2                 0               1
2016-01-03 d3                 1               0

print (df1.index)
MultiIndex([('2016-01-01', 'd1'),
            ('2016-01-02', 'd2'),
            ('2016-01-03', 'd3')],
           names=['date-time', 'domain'])
Sign up to request clarification or add additional context in comments.

9 Comments

How do I add other column to index to make index like pd1.set_index(['date-time','domain'])?
I understood I can append, can't i? pd_new_index4 = pd1.set_index(['domain'], append = True) when after that command I run pd_new_index_v4.head(5) it shows two first column names below others - like only first before. But print (pd_new_index_v4.index) gives nothing and after some other clicks I have insufficient memory to display page something error in jupyter. That is another issue I suppose. But append should work?
@AlexeiMartianov - I think pd_new_index4 = pd1.set_index(['domain'], append = True) is good solution, what return print (pd_new_index_v4.index) ? It is nothing? It is weird
I guess it's low memory issue, my dataset could be considered large (200Mb text file). Or it is not that large? How do I know maybe Jupyter is just lagging?
@AlexeiMartianov - sure, it is call multi line string, and u is unicode used for python 2, now it should be removed, because python 3 support unicode
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.