1

I'm trying to write a program to scan videos, find what languages the audio and subtitles are available in, and then use those findings for input.

Currently, I'm generating the output with this:

with open('output.txt', 'wt') as output_f:
    p = subprocess.Popen(command, stdout=output_f, stderr=output_f)

Here's the bit of text from my scan that I need.

  + audio tracks:
    + 1, Japanese (aac) (2.0 ch) (iso639-2: jpn)
  + subtitle tracks:
    + 1, English (iso639-2: eng) (Text)(SSA)

So I need to find out what number is in front of Japanese, but only after it comes after "audio tracks"

Similarly, I need to find what number is in front of English, but only after it comes after "subtitle tracks"

I'm pretty sure I need to use Regular Expressions to do this, but I'm lost on where to begin.

4
  • 1
    Why the subprocess call? Commented Apr 24, 2013 at 6:41
  • 1
    You need to do this in 2 steps: pick out the part of text that show the audio/video tracks with regex, then do a second pass on the smaller part of text to extract information. Commented Apr 24, 2013 at 6:43
  • Japanese and English are just examples right? You actually want to find the number in front of the language but after audio tracks: and subtitle tracks:. This shouldn't be a problem, you simply have to do a lookbehind for audio tracks or subtitle tracks or use some groups. Commented Apr 24, 2013 at 6:51
  • Subprocess is called because of the way I'm executing the command. No, I need the Japanese language for Audio (or Undefined as the case is sometimes) and I need the English subtitles. The problem stems from the issue of having dual audio and multiple subtitles on some vidoes. Commented Apr 24, 2013 at 7:06

3 Answers 3

1

You don't really need a regex here - anyway it seems too complicated to use one of those for me too.

Here's some regular parsing:

with open('output.txt', 'wt') as output_f:
    parseTracks = False
    lines = tuple(output_f)
    for line in lines:
        if 'audio tracks' in line:
            parseTracks = True
        if parseTracks:
            if 'Japanese' in line:
                theNumber = int(''.join([char for char in line if char in '1234567890']))

Same thing with the subtitles.

Sign up to request clarification or add additional context in comments.

2 Comments

Replace char in '123456789' with char.isdigit() Also you will take too many digits and so it's still wrong.
So when I run this code I get the following error: lines = tuple(output_f) io.UnsupportedOperation: not readable
0

You could do something like this:

>>> import re
>>> audio_regex = re.compile(r'\+ audio tracks:\n\s*\+ (?P<number>\d+), (?P<lang>\w+)')
>>> subtitle_regex = re.compile(r'\+ subtitle tracks:\n\s*\+ (?P<number>\d+), (?P<lang>\w+)')
>>> text = '''
...   + audio tracks:
...     + 1, Japanese (aac) (2.0 ch) (iso639-2: jpn)
...   + subtitle tracks:
...     + 1, English (iso639-2: eng) (Text)(SSA)
... '''
>>> match = audio_regex.search(text)  #find the first match
>>> match.group('number')
'1'
>>> match.group('lang')
'Japanese'
>>> audio_regex.findall(text)   #find all matches
[('1', 'Japanese')]
>>> subtitle_regex.findall(text)
[('1', 'English')]

Tweak the regexes above to be more or less flexible depending on the format of your file(e.g. if instead of a single space you could have more spaces you can replace the spaces with \s+ to match one or more space.

Comments

0

This will work (use with .findall()):

(?<=subtitle tracks:\n)\s+\+\s(\d+)
(?<=audio tracks:\n)\s+\+\s(\d+)

Check for a certain prefix (include the newline), then consume the white space and select numbers after a '+'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.