To push unique elements read from file using regex into array-Perl

Question

Here is my file:

  heaven
  heavenly
  heavenns
  abc
  heavenns
  heavennly

According to my code, only heavenns and heavennly should be pushed into @myarr, and they should be in array only one time. How to do that?

my $regx = "heavenn\+";
my $tmp=$regx;

$tmp=~ s/[\\]//g;

$regx=$tmp;
print("\nNow regex:", $regx);

my $file  = "myfilename.txt";

my @myarr;
open my $fh, "<", $file;  
while ( my $line = <$fh> ) {
 if ($line =~ /$regx/){
    print $line;
push (@myarr,$line);
}
}

print ("\nMylist:", @myarr); #printing 2 times heavenns and heavennly

Why the rigmarole with the backslash before the plus, and then carefully removing it? — Jonathan Leffler
– Jonathan Leffler, Commented Jul 22, 2013 at 4:01
Note: "heavenn\+" and "heavenn+" produce the same string. — ikegami
– ikegami, Commented Jul 22, 2013 at 4:07
Note: heavenn and heavenn+ are the same if you don't capture what is matched. — ikegami
– ikegami, Commented Jul 22, 2013 at 4:07
I don't buy that the use of csh has anything to do with the behaviour of your Perl script (unless you're doing something insane like creating the script as a string on the command line; then anything is possible given the metasyntactic zoo of operators in the csh). Of course, there are those (including me) who'd argue that you shouldn't be using csh in the first place, but that's your funeral, not mine. — Jonathan Leffler
– Jonathan Leffler, Commented Jul 22, 2013 at 4:12

Jonathan Leffler · Accepted Answer · 2013-07-22 05:47:36Z

1

This is Perl, so There's More Than One Way To Do It (TMTOWTDI). Here's one of them:

#!/usr/bin/env perl
use strict;
use warnings;

my $regex = "heavenn+";
my $rx = qr/$regex/;
print "Regex: $regex\n";

my $file  = "myfilename.txt";
my %list;
my @myarr;
open my $fh, "<", $file or die "Failed to open $file: $?";

while ( my $line = <$fh> )
{
    if ($line =~ $rx)
    {
        print $line;
        $list{$line}++;
    }
}

push @myarr, sort keys %list;

print "Mylist: @myarr\n";

Sample output:

Regex: heavenn+
heavenns
heavenns
heavennly
Mylist: heavennly
 heavenns

The sort isn't necessary (but it presents the data in a sane order). You could add items to the array when the count in $list{$line} is 0. You could chomp the input lines to remove the newline. Etc.

What if I want to push only particular words. For example, if my file is, 1. "heavenns hello" 2. "heavenns hi", "3.heavennly good". What to do to print only 'heavenns' and 'heavennly'?

Then you have to arrange to capture the word only. That means refining the regex. Assuming you want heavenn at the start of the word and don't mind what alphabetic characters come after that, then:

#!/usr/bin/env perl
use strict;
use warnings;

my $regex = '\b(heavenn[A-Za-z]*)\b';  # Single quotes necessary!
my $rx = qr/$regex/;
print "Regex: $regex\n";

my $file  = "myfilename.txt";
my %list;
my @myarr;
open my $fh, "<", $file or die "Failed to open $file: $?";

while ( my $line = <$fh> )
{
    if ($line =~ $rx)
    {
        print $line;
        $list{$1}++;
    }
}

push @myarr, sort keys %list;

print "Mylist: @myarr\n";

Data file:

1. "heavenns hello"
2. "heavenns hi",
"3.heavennly good". What to d
heaven
heavenly
heavenns
abc
heavenns
heavennly

Output:

Regex: \b(heavenn[A-Za-z]*)\b
1. "heavenns hello"
2. "heavenns hi",
"3.heavennly good". What to d
heavenns
heavenns
heavennly
Mylist: heavennly heavenns

Note that the names in the list no longer include newlines.

After a chat

This version takes a regex from the command line. The script invocation is:

perl script.pl -p 'regex' [file ...]

It will read from standard input if no file is specified on the command line (better than having a fixed input file name — by a large margin). It looks for multiple occurrences of the specified regex on each line, where the regex can be preceded by or followed by (or both) 'word characters' as specified by \w.

#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Std;

my %opts;
getopts('p:', \%opts) or die "Usage: $0 [-p 'regex']\n";

my $regex_base = 'heavenn';
#$regex_base = $ARGV[0] if defined $ARGV[0];
$regex_base = $opts{p} if defined $opts{p};

my $regex = '\b(\w*' . ${regex_base} . '\w*)\b';
my $rx = qr/$regex/;
print "Regex: $regex (compiled form: $rx)\n";

my %list;
my @myarr;

while (my $line = <>)
{
    while ($line =~ m/$rx/g)
    {
        print $line;
        $list{$1}++;
        #$line =~ s///;
    }
}

push @myarr, sort keys %list;

print "Matched words: @myarr\n";

Given the input file:

1. "heavenns hello"
2. "heavenns hi",
"3.heavennly good". What to d
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
An unheavenly host.  Good heavens! It heaves to like a yacht!
heaven
Is it heavens
heavenly
heavenns
abc
heavenns
heavennly

You can get outputs such as:

$ perl script.pl -p 'e\w*?ly' myfilename.txt
Regex: \b(\w*e\w*?ly\w*)\b (compiled form: (?^:\b(\w*e\w*?ly\w*)\b))
"3.heavennly good". What to d
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
An unheavenly host.  Good heavens! It heaves to like a yacht!
heavenly
heavennly
Matched words: equally heavenly heavennly heavennnly heavennnnly unheavenly
$ perl script.pl myfilename.txt
Regex: \b(\w*heavenn\w*)\b (compiled form: (?^:\b(\w*heavenn\w*)\b))
1. "heavenns hello"
2. "heavenns hi",
"3.heavennly good". What to d
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
heavenns
heavenns
heavennly
Matched words: heavennly heavennnly heavennnnly heavenns heavennsy
$

edited Jul 22, 2013 at 5:47

answered Jul 22, 2013 at 4:11

Jonathan Leffler

759k145 gold badges961 silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

karate_kid Over a year ago

Hey tanks.. But I've one more doubt now. What if I want to push only particular words. Exa, 1) If my file is, 1. "heavenns hello" 2. "heavenns hi", "3.heavennly good".. What to do to print only 'heavenns' and 'heavennly'??

Jonathan Leffler Over a year ago

Have you got blanks on the ends of your lines?

karate_kid Over a year ago

No. I don't have blanks.

Jonathan Leffler Over a year ago

You'd have to show me a hex dump of the file so I can see what's up. There must be something different between the lines if 'duplicates' are appearing. What that difference is, I don't know, but there must be one.

karate_kid Over a year ago

Hey I used the below mentioned answer. That chomp one. Now it's printing unique values, but how to get only 'words'???

|

ikegami · Accepted Answer · 2013-07-22 04:05:10Z

1

For a given value in $_, !$seen{$_}++ is only true the first time it's executed.

my $regx = qr/heavenn/;

my @matches;
my %seen;
while (<>) {
   chomp;
   push(@mymatches, $_) if /$regx/ && !$seen{$_}++;
}

answered Jul 22, 2013 at 4:05

ikegami

391k17 gold badges290 silver badges554 bronze badges

4 Comments

ikegami Over a year ago

if so, you've changed something. Can't help you if you don't tell us when you changed.

karate_kid Over a year ago

I put <fh> in my while as in my above code and using '$line' instead of '$_'

karate_kid Over a year ago

Thanks it worked.. Now can you tell me how do I only extract words?

ikegami Over a year ago

push @mymatches, grep !$seen{$_}++, /($regx)/g;

Community · Accepted Answer · 2017-05-23 12:03:29Z

0

If you want to push only the first occurance of a word, you can add the following in your loop, after the regex:

# Assumes "my %seen;" is declared outside the loop.
next if $seen{$line}++;

More approaches to uniqueness: How do I print unique elements in Perl array?

edited May 23, 2017 at 12:03

CommunityBot

11 silver badge

answered Jul 22, 2013 at 4:06

DVK

130k33 gold badges219 silver badges337 bronze badges

2 Comments

karate_kid Over a year ago

Where to put this next if?

DVK Over a year ago

@karate_kid - after if ($line =~ /$regx/){

Collectives™ on Stack Overflow

To push unique elements read from file using regex into array-Perl

3 Answers 3

After a chat

8 Comments

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

After a chat

8 Comments

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related