1

I appended information from several Excel files into a single data frame. Each Excel file has the same structure but corresponds to a different city. The city name is always located in the same cell (C2).

How can I extract the city name in each file so that it appears as a column for the corresponding rows in my newly created data frame?

My appended data frame looks like this:

 Col1     Col2      
 40       34
 104      108
 23        1
 43        21

Hence, I can't tell which rows belong to file X or file Y. Ideally, I'd like to have a data frame such as:

Col1   Col2     Col3      
City A   40       34
City A  104      108
City B   23        1
City B   43       21

I'm not sure if I should edit/write directly to the Excel files before I append them in order to add the corresponding city column. Or if I should this after or in the process of appending to my data frame.

Any guidance would be great.

Edit: This is my best attempt at reproducing the structure of my Excel sheets. Note the column A and rows 5, 6 and 7 are blank. The city name is located in row 2 column C.

I want to extract the information in rows 8 through 11 and add the city name in cell C3 as a column next to these rows.

     ColA     ColB       ColC     ColD  ColE  ColF ColG
Row1          Type       XYZ                
Row2      CityName       XXX                
Row3      CityCode        10                
Row4         RYear        13                
Row5                        
Row6                        
Row7                        
Row8          Rank       Cat.       88    89   90    91
Row9            11         A       111   106  102   101
Row10           12         B       121   144  126   121
Row11           13         C       100   107  100   101

Edit2: Following ALollz's advice, I tried the following code unsuccessfully. I get an error " 'DataFrame' object has no attribute 'ColC' ". Note that files_xlsx is a list that includes all Excel files.

all_data = pd.DataFrame()

 for f in files_xlsx:
    city_name = pd.read_excel(f, "SheetA", nrows=2).ColC[1]
    data = pd.read_excel(f, "SheetA", parse_cols="B:J")
    data['col_city'] = city_name
 all_data = all_data.append(data,ignore_index=True)

Edit3: Kept trying and finally found something that works. The only issue is that cityname is only set to one row and not the entire column, which is what I want. Any help?

  df = pd.DataFrame()

for f in files_xlsx:
    city_name = pd.read_excel(f, "Sheet1", nrows=2, parse_cols="C", header=None, skiprows=1, skip_footer=264)    
    data = pd.read_excel(f, "Sheet1", parse_cols="B:J", header=None, skiprows=8) 
    data['City'] = city_name
    df = df.append(data)
10
  • Can you post a bit about what the excel files look like? Commented Aug 17, 2018 at 19:57
  • It would be helpful if you could just post the head (first 10 lines) of your excel file because we don't know what it looks like. Commented Aug 17, 2018 at 20:01
  • How are you reading the files? Are you manually specifying a list, or do the names have any information about the city? Commented Aug 17, 2018 at 20:18
  • @ALollz, I first pick out the Excel files in my directory into a list. Then I loop over the list of Excel files to append to an empty data frame. Commented Aug 17, 2018 at 20:21
  • 1
    Probably just easiest to read the file twice. The first time, read just the first few lines you need to determine the name: city_name = pd.read_excel('your_file', nrows=2).ColC[1] then you can read skipping the first 8 rows and assign that value to a column. Commented Aug 17, 2018 at 20:26

1 Answer 1

1

You can use nrows=1 for read only one value to one element df and then select value by DataFrame.iat:

f = 'file.xlsx'
city_name = pd.read_excel(f, "Sheet1", nrows=1, parse_cols="C", header=None, skiprows=1)    
print (city_name)
     0
0  XXX

data = pd.read_excel(f, "Sheet1", parse_cols="B:J", header=None, skiprows=8) 
data['City'] = city_name.iat[0,0]
print (data)
    0  1    2    3    4    5 City
0  11  A  111  106  102  101  XXX
1  12  B  121  144  126  121  XXX
2  13  C  100  107  100  101  XXX

In loop:

dfs = []
for f in files_xlsx:
    city_name = pd.read_excel(f, "Sheet1", nrows=1, parse_cols="C", header=None, skiprows=1)
    data = pd.read_excel(f, "Sheet1", parse_cols="B:J", header=None, skiprows=8)
    data['City'] = city_name.iat[0,0]
    dfs.append(data)

df = pd.concat(dfs, ignore_index=True)
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks so much. This worked great! If you don't mind, can I ask why you opted for (a) dfs = [] rather than dfs = pd.DataFrame() and (b) dfs.append(data) rather than dfs = dfs.append(data) ? Lastly, I know it works, but why the need to concatenate in the last line, clearly appending the data is not enough, but why not?
@StatsScared Hmm, good question. Maybe it is alternative solution, I guess it should be better for performance.
Can you expand on why the need to concatenate? My other two questions (a and b) seem to be more stylistic, or so I gather from your response.
@StatsScared - Sure. It is 2 different approaches. If use dfs = [] and dfs.append(df) it use pure python append for add new DataFrame to list and output is list of DataFrames. But if use dfs = pd.DataFrame() and then dfs = dfs.append(data) it use DataFrame.append and in loop add values to DataFrame. And it should be slowier, because list.append with pd.concatis faster like in each iteration call DataFrame.append.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.