How to parse and access columns based on headers in file? - Python

Question

I believe this is a 3 step process but please bear with me. I'm currently reading Shell output which is being saved to a file and the output looks like this:

Current Output:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 123.345.789:1234        0.0.0.0:*               LISTEN      23044/test          
tcp        0      0 0.0.0.0:5915            0.0.0.0:*               LISTEN      99800/./serv    
tcp        0      0 0.0.0.0:1501            0.0.0.0:*                           -

I'm trying to access each columns information based on the header value. This is something I was able to do in Powershell but not sure how to achieve it in Python.

Expected Output:

Proto,Recv-Q,Send-Q,Local Address,Foreign Address,State,PID/Program name
tcp,0,0,123.345.789:1234,0.0.0.0:*,LISTEN,23044/test          
tcp,0,0,0.0.0.0:5915,0.0.0.0:*,LISTEN,99800/./serv    
tcp,0,0,0.0.0.0:1501,0.0.0.0:*,,-

proto = data["Proto"]
for p in proto:
    print(p)

Output: tcp tcp tcp

What I've tried?:

Where do I begin.. I've tried Splitting, Replacing and Translate. Also, I did try Regex but couldn't quite figure it out :/

Proto,Recv-Q,Send-Q,Local,Address,,,,,,,,,,,Foreign Address,,,,,,,,,State,,,,,, PID/Program,name    
tcp,,,,,,,,0,,,,,,0 123.345.789:1234,,,,,,,,0.0.0.0:*,,,,,,,,,,,,,,,LISTEN,,,,,,23021/java,,,,,,,,  
tcp,,,,,,,,0,,,,,,0 0.0.0.0:5915,,,,,,,,,,,,0.0.0.0:*,,,,,,,,,,,,,,,LISTEN,,,,,,99859/./statserv    
tcp,,,,,,,,0,,,,,,0 0.0.0.0:1501,,,,,,,,,,,,0.0.0.0:*,,,,,,,,,,,,,,,LISTEN,,,,,,-

Since some of the headers contain a space in between them it's sort of difficult to map the columns accordingly.

Looking for the best way to approach this.

Thank you.

sitting_duck · Accepted Answer · 2022-06-20 18:00:49Z

2

Answer updated to handle missing State value

Skip the first row, indicate that there is no header, assign header names and then split on one or more spaces.

df = pd.read_csv(sim_txt, skiprows=1, header=None, sep='\s+', 
                 names=['Proto','cv-Q','Send-Q','Local Address','Foreign Address','State','PID/Program name']
                ).apply(row_fixer, axis=1) 
print(df)

  Proto  cv-Q  Send-Q     Local Address Foreign Address   State  PID/Program name
0   tcp     0       0  123.345.789:1234       0.0.0.0:*  LISTEN        23044/test
1   tcp     0       0      0.0.0.0:5915       0.0.0.0:*  LISTEN      99800/./serv
2   tcp     0       0      0.0.0.0:5916       0.0.0.0:*     NaN      99801/./serv
3   tcp     0       0      0.0.0.0:1501       0.0.0.0:*  LISTEN                 -

df.to_csv('output.csv', index=None)

The above depends on the following function. It looks for a NaN the last column in the row which would indicate that the State value is missing. When that situation is found the last two values are swapped. (Note: this function detects NaNs by leveraging the fact that NaN != NaN):

def row_fixer(x):
    if x.iat[-1] != x.iat[-1]:
        xc = x.copy()
        xc.iat[-1] = xc.iat[-2]
        xc.iat[-2] = np.NaN
        return xc    
    return x

The example above is based on the following example data:

Proto  cv-Q  Send-Q     Local Address Foreign Address   State  PID/Program name
  tcp     0       0  123.345.789:1234       0.0.0.0:*  LISTEN        23044/test
  tcp     0       0      0.0.0.0:5915       0.0.0.0:*  LISTEN      99800/./serv
  tcp     0       0      0.0.0.0:5916       0.0.0.0:*              99801/./serv
  tcp     0       0      0.0.0.0:1501       0.0.0.0:*  LISTEN                 -

edited Jun 20, 2022 at 18:00

answered Jun 20, 2022 at 6:11

sitting_duck

3,7801 gold badge17 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jona Over a year ago

Thanks for the answer. This works well however I forgot to add the fact that the column "State" sometimes has a missing value meaning, sometimes "LISTEN" is empty. This causes the neighbour columns data to be messed up.

sitting_duck Over a year ago

@Jona Updated my answer to handle that situation

Tim Roberts · Accepted Answer · 2022-06-20 03:30:37Z

2

You are post-processing the output of the netstat command. netstat itself is just reformatting the information in /proc/net/tcp, which you can also read. As with the netstat output, you may need to make your own header line, but the data lines are all space separated. A simple line.split() should do it.

If you still want to use netstat, as I said, just throw away the header line and use split. You know what the columns are.

for ln in output:
    fields = ln.split()
    print( ','.join(fields) )

answered Jun 20, 2022 at 3:30

Tim Roberts

55.3k4 gold badges28 silver badges41 bronze badges

2 Comments

Jona Over a year ago

Hey Tim, appreciate the reply. Believe it or not, after trying to follow this I am still unable to pair or match-up the data based on the headers. Even after ignoring the headers, I am still left with countless amount of spaces in-between. Any other way of attempting this?

Tim Roberts Over a year ago

Remember that ln.split(' ') and ln.split() do two VERY different things. My guess is you are doing the first, and that would produce the results you describe. Passing no parameters to split treats a SERIES of whitespace as a single unit.

Freeman · Accepted Answer · 2022-06-20 05:06:31Z

1

Split based on a string with two or more spaces using a regex.

for ln in testset:
    splitted = re.split(r'\s{2,}', ln.replace("\n", ""))
    print(splitted)

answered Jun 20, 2022 at 5:06

Freeman

365 bronze badges

1 Comment

Jona Over a year ago

Thank you for taking the time and trying to help out! Appreciate it, Freeman.

Collectives™ on Stack Overflow

How to parse and access columns based on headers in file? - Python

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related