1

I have a dataframe where one column in that dataframe has the GPA of first-year students. I want to loop through this column and append to a list of lists all values that fall within 0.4 units of each other. For example, if I have the values (0.4, 0.6, 0.8, 3, 3.4), then I want my list to be [[0.4,0.6,0.8], [3, 3.4]].

This is the code I have tried.

averages = [[] for w in range(len(df['GPA_year1'])//4)]

small = min(df['GPA_year1']) + 0.4

for i in range(len(averages)): 

    for y in range(len(df['GPA_year1'])):

        if small - 0.4 <= df['GPA_year1'][y] <= (small + 0.4):

            averages[i].append(df['GPA_year1'][y])

    small = small + 0.4

However, when I run this code in Jupyter Notebook, it seems to run forever, which makes me think that there may be an infinite loop somewhere (?) but I'm not sure where the infinite loop might be.

Here is the dataframe

enter image description here

1
  • Do you want to keep the sequence as it is or you want also to sort the numbers in GPA_year1? Commented Nov 30, 2019 at 18:45

2 Answers 2

1

From your expected result I see that:

  • The first bin contains elements in the range [0.4 - 0.8].
  • The next bin starts from 3.0.

So you:

  • Don't want one-side-open bins (the first bin is closed at both sides).
  • Want neither "ëmpty bins" nor "adjacent ranges" (e.g. [0.4 - 0.8), then [0.8 - 1.2) and so on.

You want rather something like this:

  • Set the üpper limit to the lowest element in the source list + 0.4.
  • Put in the first "bin" elements <= limit (append this list to averages).
  • Drop these elements from the list.
  • Repeat the above procedure while the list is not empty.

I also assume that the result should be a plain Python list of lists.

To get this result, try the following code:

averages = []
src = df['GPA_year1'].sort_values()
while not src.empty:
    limit = src.min() + 0.4
    currBin = src[src <= limit]
    averages.append(currBin.to_list())
    src.drop(currBin.index, inplace=True)

This code should run quicker, beacause:

  • Due to sort_values() there is no need for the inner loop.
  • All values for the current bin are selected in a single instruction.
  • Dropping of "used" values is performed also in a single instruction.

For GPA_year1 column from your DataFrame this code generates:

[[0.74], [1.95, 2.18, 2.34], [3.23, 3.23, 3.44, 3.49], [3.64, 3.78, 3.82]]

One more remark concerning your code:

averages = [[] for w in range(len(df['GPA_year1'])//4)]

looks strange. How do you know that the output list will contain just 4 lists? Accidentally this is the case for your sample data, but consider such case that:

  • One part of values will be "very bad" (all around some lower limit).
  • The second part of values will be "very good" (all around some upper limit).

Then the number of "bins" will be just 2 (not 4).

Sign up to request clarification or add additional context in comments.

Comments

0

This is my way: assuming df is your Dataframe:

GPA_year1 = df['GPA_year1'].tolist()
GPA_year1 = [3.82, 3.64, 1.95, 3.44, 2.18, 3.49, 3.78, 3.23, 0.74, 3.23, 0.74, 3.23, 2.34]

Sort the list:

GPA_year1.sort()

initialize the averages list with the first element

averages = [[GPA_year1[0]]]

loop trough your list:

for x, y in zip(GPA_year1, GPA_year1[1:]):
    if y - x <= 0.4:
        averages[-1].append(y)
    else:
        averages.append([y]) #if not create a new sublist
print(averages)
# [[0.74, 0.74], [1.95, 2.18, 2.34], [3.23, 3.23, 3.23, 3.44, 3.49, 3.64, 3.78, 3.82]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.