1

I need to extract a sub string inside a dynamic input, I've achieved the output I need, but it's only pure hard code, so it's not that dynamic and reliable. Is there any other way for me to extract the part "B1003 = Engineering Business Card" (Item Description) & "2"(Quantity), these are both dynamic, an entirely different item could be input such as; "O1003 = Pencil", "O1004 = Sticky Notes". Is there a way to code this in regex that would enable a more reliable code?

The input being read here is from an extracted text using Tesseract OCR, I need to extract the needed information and pass it to another service.

var requisition = `Lines
Line Item Description Category Name Quantity UOM Price Amount (USD) Status Funds Status //this line is static
1 B1003 = Engineering Business Card Business Cards 2 Ea 50.00USD 100 Pending Approval Not Reserved //this line is dynamic
Requester Jay Doe Supplier ABC Corp //this line is static
Lines
Line Item Description Category Name Quantity UOM Price Amount (USD) Status Funds Status //this line is static
1 O1003 = Pencil Office Supplies 5 Ea 50.00USD 100 Pending Approval Not Reserved //this line is dynamic
Requester Jay Doe Supplier ABC Corp //this line is static
`;

//rule 1 - Gets all Items + Quantity
//rule 2 - Gets all Items
//rule 3 - Gets all Quantity
//resultArray - Contains Quantity + Item e.g. 2 B1003 Engineering Business Cards

var rule1 = /(B1002 = Accountant Business Card|B1003 = Engineering Business Card|B1001 = Sales and Marketing Business Card|O1001 = Black Ballpen Branded Panda Regular with Eraser|O1002 = Notebook|O1003 = Pencil|O1004 = Stick Notes) (.*) ([0-9]|[0-9][0-9]|[0-9][0-9][0-9])/
var rule2 = /(B1002 = Accountant Business Card|B1003 = Engineering Business Card|B1001 = Sales and Marketing Business Card|O1001 = Black Ballpen Branded Panda Regular with Eraser|O1002 = Notebook|O1003 = Pencil|O1004 = Stick Notes)/
var rule3 = /([0-9]|[0-9][0-9]|[0-9][0-9][0-9])/

var resultarray = []

var stringarray = requisition.split("\n")
stringarray.forEach(element => {
    var result = element.match(rule1)
    if (result!=null){
        var itemName = result[0].match(rule2)
        var quantity = result[0].match(rule3)
        resultarray.push (quantity[0]+ " " + itemName[0])
    }
});

console.log (resultarray.join(", "))

Note: Just to make things clearer, this is the image I'm extracted the text from Legend: Blue - Static Unboxed - Dynamic Yellow - Text needed to be extracted (Also dynamic)

- This is the image extracted, first line is static, second line is dynamic

Expected result is 2 B1003 = Engineering Business Card(, B1002 = Accountant Business Card - will output if there is a similar item in the code) Please check the comments on requisition variable.

Again, I can already get the desired output, I just need to know how the code can be done differently and more dynamically and reliable using RegEx. Please bear with me as I don't know much about RegEx. Thanks!

2
  • 2
    FYI, ([0-9]|[0-9][0-9]|[0-9][0-9][0-9]) should be probably written as (\d{1,3}) or (\d+). The rest is too vague. Please provide exact requirements for the pattern you need to match. Commented Nov 7, 2019 at 9:27
  • use new RegExp(variable). Commented Nov 7, 2019 at 9:29

2 Answers 2

2

Short answer:

var requisition = `Lines
Line Item Description Category Name Quantity UOM Price Amount (USD) Status Funds Status //this line is static
1 B1003 = Engineering Business Card 2 Ea 50.00USD 100 Pending Approval Not Reserved //this line is dynamic
Requester Jay Doe Supplier ABC Corp //this line is static
Lines
Line Item Description Category Name Quantity UOM Price Amount (USD) Status Funds Status //this line is static
1 O1003 = Pencil Office Supplies 5 Ea 50.00USD 100 Pending Approval Not Reserved //this line is dynamic
Requester Jay Doe Supplier ABC Corp //this line is static
`;

//rule 1 - Gets all Items + Quantity
//rule 2 - Gets all Items
//rule 3 - Gets all Quantity
//resultArray - Contains Quantity + Item e.g. 2 B1003 Engineering Business Cards

var rule1 = /(B1002 = Accountant Business Card|B1003 = Engineering Business Card|B1001 = Sales and Marketing Business Card|O1001 = Black Ballpen Branded Panda Regular with Eraser|O1002 = Notebook|O1003 = Pencil|O1004 = Stick Notes)[^\d]+(\d+) .*/

var resultarray = []

var stringarray = requisition.split("\n")
stringarray.forEach(element => {
    var result = element.match(rule1)
    if (result!=null){
        var itemName = result[1]
        var quantity = result[2]
        resultarray.push (quantity + " " + itemName)
    }
});

console.log (resultarray.join(", "))

Output:

2 B1003 = Engineering Business Card, 5 O1003 = Pencil

Long answer:

There are many things to fix:

  1. Use only rule 1 (after some modifications) to match everything (item name and quantity) using (\d+)
  2. Get rid of rule 2 and 3
  3. Use result[1] as item name and result[2] as quantity

Please note that all your fields are space separated and fields can contain spaces so your data is not structured. It would be a lot more reliable if you had for instance a tab delimited file. So the rule I used to find the quantity is "ignore everything after the product name until there is a number" but if some day you have a category that contains a number, you will be stuck and there will be nothing you can do without a structured file

Sign up to request clarification or add additional context in comments.

7 Comments

the "Business Card" after Engineering Business Card is it's Category, please check image attached, I edited it.
ok, got, it, I modified my answer and now it matches the 2 items: the output of the script is now "2 B1003 = Engineering Business Card, 5 O1003 = Pencil"
Yes, but you need a structured input file, for instance with tabs or semicolons between the fields instead of spaces. For instance if you had "1;B1003 = Engineering Business Card;Business Cards;2;Ea;50.00USD;100;Pending Approval;Not Reserved;" then it's easy: item is field number 2 and quantity is field number 4, which is regex ^[^;]*;([^;]*);[^;]*;([^;]*);.* and you don't need product names anymore
But with the current format, if one line is "1 B1234 = New Product Cool New Furniture 2 Ea 50.00USD 100 Pending Approval Not Reserved" then is it item New Productin category Cool New Furniture or New Product Cool in category New Furniture? There is no way you can know, there is no way I can know, so there is no way a computer can know :)
That's right, with the current input file (space between fields and spaces inside the fields) it's not possible to do it
|
0

You can put it all into a single regular expression, and capture the quantity in one group, and the itemName in the other. Then extract those groups from the match (if there's a match):

var requisition = `Lines
Line Item Description Category Name Quantity UOM Price Amount (USD) Status Funds Status //this line is static
1 B1003 = Engineering Business Card Business Cards 2 Ea 50.00USD 100 Pending Approval Not Reserved //this line is dynamic
Requester Jay Doe Supplier ABC Corp //this line is static
Lines
Line Item Description Category Name Quantity UOM Price Amount (USD) Status Funds Status //this line is static
1 O1003 = Pencil Office Supplies 5 Ea 50.00USD 100 Pending Approval Not Reserved //this line is dynamic
Requester Jay Doe Supplier ABC Corp //this line is static
`;

var rule = /(B1002 = Accountant Business Card|B1003 = Engineering Business Card|B1001 = Sales and Marketing Business Card|O1001 = Black Ballpen Branded Panda Regular with Eraser|O1002 = Notebook|O1003 = Pencil|O1004 = Stick Notes).*(\d{1,3})/

var resultarray = []

var stringarray = requisition.split("\n")
stringarray.forEach(element => {
  const match = element.match(rule);
  if (match) {
    const [, itemName, quantity] = match;
    resultarray.push(quantity + ' ' + itemName);
  }
});

console.log(resultarray)

For a more minimal example:

const input = `Lines
foo 1
bar 2
baz don't match`;
const pattern = /(foo|bar) (\d+)/;
const output = [];
input
  .split('\n')
  .forEach((line) => {  
    const match = line.match(pattern);
    if (match) {
      const [, itemName, quantity] = match;
      output.push(quantity + ' ' + itemName);
    }
  });
console.log(output);

9 Comments

Hi. Thanks for taking time in analyzing my code, but I'm just quite confused, the output when I run your snippet is: [ "0 B1003 = Engineering Business Card" ] When it should take the quantity, which is "2", the number before "Ea"
You edited your question after I posted the answer, so my answer's snippet didn't include your edited input. After putting your new input in, it looks to run as desired
I have one question to clarify, is it possible to have my desired output without using the specific words? e.g. Engineering Business Card, etc. Is it possible to use pure RegEx combinations? Or is this the only way to do this?
Not unless you can identify a pattern to it. Eg from Engineering Business Card Business Cards. how would you be sure that you want Engineering Business Card and not Engineering Business Card Business or Engineering Business?
Yeah, I've edited my post once again, can you check the image once more? This is the only pattern I have, the first line being static, as well as the word "Ea" and the rest being dynamic.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.