0

I'm trying to parse some data from a .DAT file, which consists of some data about transactions. Below is a sample of the said data:

CABCDE123456000000000000000ABCD12345678XY PAYMENTS ABCD ELECTRONIC 12345678 AUTH CANCELLED 2025050800000000000000180000812345678 20250508ABCXXXXXXXXX 202505091234567ABCDEF BBB ABC

I want to parse this data as a CSV and store it in a file every minute. However, I keep running into a roadblock with the following error:

Skipping line 0: missing AUTH CANCELLED → timestamp+XXXXXXXX segment
Skipping line 1: missing AUTH CANCELLED → timestamp+XXXXXXXX segment
Skipping line 2: missing AUTH CANCELLED → timestamp+XXXXXXXX segment

In my command, I have specified a method called parseFileContent i.e get the data between the text that says AUTH CANCELLED and 2025050800000000000000180000812345678, but it keeps giving me the error above when the data is present.

Here is a snippet of my code. For context, I'm interested getting the following info: mobile number, amount, transaction id and transaction date.

mobile number in this piece of text 2025050800000000000000180000812345678 and is the last 10 digits 0812345678 and the transaction date is the first 8 digits 20250508 and lastly amount is 18000 which should be written as 180.00, and as for the transaction Id in this text 20250508ABCXXXXXXXXX, the relevant info is ABCXXXXXXXXX. Where am I missing it.

Your assistance will be highly appreciated. Thanks!

private function parseFileContent(string $content, string $fileName): Collection
{
    $results = collect();
    $lines = explode("\n", $content);


    foreach ($lines as $lineNumber => $line) {
        $line = trim($line);


        // Match everything between "AUTH CANCELLED" and the next 14-digit timestamp + XXXXXXXX
        if (preg_match('/AUTH\s+CANCELLED\s+(.*?)\s+\d{14}XXXXXXXX/i', $line, $segmentMatch)) {
            $segment = trim($segmentMatch[1]);


            // Match: 8-digit date, 20-digit amount, 10-digit mobile number (may have varying spaces in between)
            if (preg_match('/(\d{8})\s*(\d{20})\s*(\d{10})/', $segment, $matches)) {
                $dateRaw = $matches[1];
                $amountRaw = $matches[2];
                $mobileRaw = $matches[3];


                $cleanAmount = ltrim($amountRaw, '0');
                $amount = number_format(((int)($cleanAmount ?: '0')) / 100, 2, '.', '');


                // Try to extract transaction ID that follows the date
                $transactionId = '';
                if (preg_match('/' . preg_quote($dateRaw, '/') . '\s*([A-Z0-9]{5,})/', $segment, $tm)) {
                    $transactionId = trim($tm[1]);
                }


                $results->push([
                    'file' => $fileName,
                    'line' => $lineNumber + 1,
                    'date' => $this->parseDate($dateRaw),
                    'amount' => $amount,
                    'mobile_number' => $mobileRaw,
                    'transaction_id' => $transactionId,
                    'raw_line' => $line
                ]);
            } else {
                $this->warn("No transaction match in scoped segment on line {$lineNumber}: {$segment}");
            }
        } else {
            $this->warn("Skipping line {$lineNumber}: missing AUTH CANCELLED → timestamp+XXXXXXXX segment");
        }
    }


    return $results;
}
6
  • 2
    Your first regular expression does not match your example. It seems that it's not \d{14}XXXXXXXX, but \d{16}XXXXXXXX. Haven't you got more precise information about the syntax of the data format? Is it something standard or related to a custom application log or data format? We need to know all the possible cases you could potentially have in your *.dat files. Commented Jul 28 at 13:54
  • 1
    I personally think that it's not worth re-inventing the wheel with some custom regular expressions, creating some bugs, if someone already wrote a PHP library to read these *.dat files. If you can't find any parser for this specific file format, then ok, code it yourself. How big are these files? I presume you read a file completely and put it in memory, passed via the $content parameter. If the file is 500 MB, you might get into trouble. Seeking line by line with fgets() might be less memory consuming. Commented Jul 28 at 14:04
  • Are the XXXXXXXX really in the *.dat file? or is just data impersonalization for the question? If yes, please replace it by 12345678 or something similar, so that we can understand and help. I ask this because 08XXXXXXXX isn't a valid mobile number. Commented Jul 28 at 14:11
  • @PatrickJanser yes I shielded the data with XXX because it is sensitive data, but the data is present in the .DAT file. Secondly the data in the .DAT file is written in text form with lots of white spaces, so the current regex expression is missing most of the data because of the huge number of white spaces. I ve tried checking for a parser but still no success, could you know of any? Commented Jul 29 at 12:00
  • 1
    In this case, can you please edit your question and replace XXXXXXXX by 12345678 and add a few more examples? Actually, you only have a single line of content, so this won't really help us find out the possible variations of your file format. And no, I don't know any parser as we don't have any specification about the format. *.dat is kind of any binary content, for a huge quantity of applications. If you know from where they come, you probably could find some specs and give us more precisions. Commented Jul 29 at 13:28

1 Answer 1

1

You don't necessary need to use several regular expressions to get what you want out of your lines of the *.dat content.

I used named groups instead of indexed groups. In one line, it's rather long and unreadable:

\bAUTH\s+CANCELLED\s+(?<date>\d{8})(?<amount>\d{20})(?<phone>\d{10})\s+(?<transaction_date>\d{8})(?<transaction_id>\w{10,20})

Using named capturing groups is a bit more convenient than indexes, because if you change the pattern later, it won't shift your indexes and force you to update the code reading the matched groups.

In PCRE regex flavor, you can use the x flag, to write your regex on several lines, with optional spaces and comments, for clarity:

# Match a cancelled payment (\b will match a word boundary):
\b AUTH \s+ CANCELLED \s+

# Capture the date, amount and phone out of "20250508000000000000000180000812345678":
(?<date>\d{8}) (?<amount>\d{20}) (?<phone>\d{10})

# Some spaces before the transaction information:
\s+

# Capture the transaction date and id (between 10 and 20 word chars):
(?<transaction_date>\d{8}) (?<transaction_id>\w{10,20})

See it in action here: https://regex101.com/r/axhKJp/2

PHP code

<?php
const CANCELLED_PAYMENT_PATTERN = <<<END_OF_REGEX
/
# Match a cancelled payment only, where \b will match a word boundary:
\b AUTH \s+ CANCELLED \s+
# Capture the date, amount and phone out of "20250508000000000000000180000812345678":
(?<date>\d{8}) (?<amount>\d{20}) (?<phone>\d{10})
\s+
# Capture the transaction date and id (between 10 and 20 word chars):
(?<transaction_date>\d{8}) (?<transaction_id>\w{10,20})
/x
END_OF_REGEX;

const DATA = <<<END_OF_DATA
CABCDE123456000000000000000ABCD12345678XY PAYMENTS ABCD ELECTRONIC 12345678 AUTH CANCELLED 20250508000000000000000180000812345678 20250508ABCXXXXXXXXX 202505091234567ABCDEF BBB ABC
END_OF_DATA;

if (preg_match(CANCELLED_PAYMENT_PATTERN, DATA, $match)) {
    print 'Found a cancelled payment. $match = ';
    var_export($match);
}
else {
    die('No cancelled payment found');
}

PHP execution result

$match contains both the indexed and named groups:

Found a cancelled payment. $match = array (
  0 => 'AUTH CANCELLED 20250508000000000000000180000812345678 20250508ABCXXXXXXXXX',
  'date' => '20250508',
  1 => '20250508',
  'amount' => '00000000000000018000',
  2 => '00000000000000018000',
  'phone' => '0812345678',
  3 => '0812345678',
  'transaction_date' => '20250508',
  4 => '20250508',
  'transaction_id' => 'ABCXXXXXXXXX',
  5 => 'ABCXXXXXXXXX',
)

You can also run it here: https://onlinephp.io/c/f5446

Remarks

As you can see, I removed the case-insensitive flag, because I think that you will probably always have AUTH CANCELLED in uppercase. But if it's not the case, then re-enable the i flag.

About the date, you could also capture the year, month and day directly, so that you don't need to call another parsing function which is probably doing this again with a second regex or some string splitting.

Sign up to request clarification or add additional context in comments.

1 Comment

Hello @PatrickJansen yes I managed to have it up and running guided by your advice provided thank you for your assistance. I m sorry I did not provide feedback on time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.