1

Here is my file:

  heaven
  heavenly
  heavenns
  abc
  heavenns
  heavennly

According to my code, only heavenns and heavennly should be pushed into @myarr, and they should be in array only one time. How to do that?

my $regx = "heavenn\+";
my $tmp=$regx;

$tmp=~ s/[\\]//g;

$regx=$tmp;
print("\nNow regex:", $regx);

my $file  = "myfilename.txt";

my @myarr;
open my $fh, "<", $file;  
while ( my $line = <$fh> ) {
 if ($line =~ /$regx/){
    print $line;
push (@myarr,$line);
}
}

print ("\nMylist:", @myarr); #printing 2 times heavenns and heavennly
5
  • 1
    Why the rigmarole with the backslash before the plus, and then carefully removing it? Commented Jul 22, 2013 at 4:01
  • Note: "heavenn\+" and "heavenn+" produce the same string. Commented Jul 22, 2013 at 4:07
  • Note: heavenn and heavenn+ are the same if you don't capture what is matched. Commented Jul 22, 2013 at 4:07
  • I don't buy that the use of csh has anything to do with the behaviour of your Perl script (unless you're doing something insane like creating the script as a string on the command line; then anything is possible given the metasyntactic zoo of operators in the csh). Of course, there are those (including me) who'd argue that you shouldn't be using csh in the first place, but that's your funeral, not mine. Commented Jul 22, 2013 at 4:12
  • @JonathanLeffler - some people enjoy pain :) Commented Jul 22, 2013 at 10:15

3 Answers 3

1

This is Perl, so There's More Than One Way To Do It (TMTOWTDI). Here's one of them:

#!/usr/bin/env perl
use strict;
use warnings;

my $regex = "heavenn+";
my $rx = qr/$regex/;
print "Regex: $regex\n";

my $file  = "myfilename.txt";
my %list;
my @myarr;
open my $fh, "<", $file or die "Failed to open $file: $?";

while ( my $line = <$fh> )
{
    if ($line =~ $rx)
    {
        print $line;
        $list{$line}++;
    }
}

push @myarr, sort keys %list;

print "Mylist: @myarr\n";

Sample output:

Regex: heavenn+
heavenns
heavenns
heavennly
Mylist: heavennly
 heavenns

The sort isn't necessary (but it presents the data in a sane order). You could add items to the array when the count in $list{$line} is 0. You could chomp the input lines to remove the newline. Etc.


What if I want to push only particular words. For example, if my file is, 1. "heavenns hello" 2. "heavenns hi", "3.heavennly good". What to do to print only 'heavenns' and 'heavennly'?

Then you have to arrange to capture the word only. That means refining the regex. Assuming you want heavenn at the start of the word and don't mind what alphabetic characters come after that, then:

#!/usr/bin/env perl
use strict;
use warnings;

my $regex = '\b(heavenn[A-Za-z]*)\b';  # Single quotes necessary!
my $rx = qr/$regex/;
print "Regex: $regex\n";

my $file  = "myfilename.txt";
my %list;
my @myarr;
open my $fh, "<", $file or die "Failed to open $file: $?";

while ( my $line = <$fh> )
{
    if ($line =~ $rx)
    {
        print $line;
        $list{$1}++;
    }
}

push @myarr, sort keys %list;

print "Mylist: @myarr\n";

Data file:

1. "heavenns hello"
2. "heavenns hi",
"3.heavennly good". What to d
heaven
heavenly
heavenns
abc
heavenns
heavennly

Output:

Regex: \b(heavenn[A-Za-z]*)\b
1. "heavenns hello"
2. "heavenns hi",
"3.heavennly good". What to d
heavenns
heavenns
heavennly
Mylist: heavennly heavenns

Note that the names in the list no longer include newlines.


After a chat

This version takes a regex from the command line. The script invocation is:

perl script.pl -p 'regex' [file ...]

It will read from standard input if no file is specified on the command line (better than having a fixed input file name — by a large margin). It looks for multiple occurrences of the specified regex on each line, where the regex can be preceded by or followed by (or both) 'word characters' as specified by \w.

#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Std;

my %opts;
getopts('p:', \%opts) or die "Usage: $0 [-p 'regex']\n";

my $regex_base = 'heavenn';
#$regex_base = $ARGV[0] if defined $ARGV[0];
$regex_base = $opts{p} if defined $opts{p};

my $regex = '\b(\w*' . ${regex_base} . '\w*)\b';
my $rx = qr/$regex/;
print "Regex: $regex (compiled form: $rx)\n";

my %list;
my @myarr;

while (my $line = <>)
{
    while ($line =~ m/$rx/g)
    {
        print $line;
        $list{$1}++;
        #$line =~ s///;
    }
}

push @myarr, sort keys %list;

print "Matched words: @myarr\n";

Given the input file:

1. "heavenns hello"
2. "heavenns hi",
"3.heavennly good". What to d
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
An unheavenly host.  Good heavens! It heaves to like a yacht!
heaven
Is it heavens
heavenly
heavenns
abc
heavenns
heavennly

You can get outputs such as:

$ perl script.pl -p 'e\w*?ly' myfilename.txt
Regex: \b(\w*e\w*?ly\w*)\b (compiled form: (?^:\b(\w*e\w*?ly\w*)\b))
"3.heavennly good". What to d
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
An unheavenly host.  Good heavens! It heaves to like a yacht!
heavenly
heavennly
Matched words: equally heavenly heavennly heavennnly heavennnnly unheavenly
$ perl script.pl myfilename.txt
Regex: \b(\w*heavenn\w*)\b (compiled form: (?^:\b(\w*heavenn\w*)\b))
1. "heavenns hello"
2. "heavenns hi",
"3.heavennly good". What to d
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
Good heavennsy! What a heavennnly output from an equally heavennnnly input!
heavenns
heavenns
heavennly
Matched words: heavennly heavennnly heavennnnly heavenns heavennsy
$
Sign up to request clarification or add additional context in comments.

8 Comments

Hey tanks.. But I've one more doubt now. What if I want to push only particular words. Exa, 1) If my file is, 1. "heavenns hello" 2. "heavenns hi", "3.heavennly good".. What to do to print only 'heavenns' and 'heavennly'??
Have you got blanks on the ends of your lines?
No. I don't have blanks.
You'd have to show me a hex dump of the file so I can see what's up. There must be something different between the lines if 'duplicates' are appearing. What that difference is, I don't know, but there must be one.
Hey I used the below mentioned answer. That chomp one. Now it's printing unique values, but how to get only 'words'???
|
1

For a given value in $_, !$seen{$_}++ is only true the first time it's executed.

my $regx = qr/heavenn/;

my @matches;
my %seen;
while (<>) {
   chomp;
   push(@mymatches, $_) if /$regx/ && !$seen{$_}++;
}

4 Comments

if so, you've changed something. Can't help you if you don't tell us when you changed.
I put <fh> in my while as in my above code and using '$line' instead of '$_'
Thanks it worked.. Now can you tell me how do I only extract words?
push @mymatches, grep !$seen{$_}++, /($regx)/g;
0

If you want to push only the first occurance of a word, you can add the following in your loop, after the regex:

# Assumes "my %seen;" is declared outside the loop.
next if $seen{$line}++;

More approaches to uniqueness: How do I print unique elements in Perl array?

2 Comments

Where to put this next if?
@karate_kid - after if ($line =~ /$regx/){

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.