3

I've tried using sed for this. I've tried putting the lines of interest in variables as well.

I have two examples I want to achieve for now. Lets say I have thousands of urls in a file called links.txt, here are the first three:

EDIT: I added another site to show domain for real-world example?

https://site.com/category/feed/
https://site2.org/feed/
https://site3.net/science/astronomy/feed/
https://feed.site4.info/market/feed/news.xml

and I paste these variables in the terminal:

TAG='<outline type="rss" title= text= version="RSS" xmlUrl= htmlUrl=/>'
NAMES=$(sed "s/https\:\/\///g;s/[\/].*//;s/\..*//g" links.txt)
XMLS=$(sed "s/.*/xmlUrl=\"&\"/" links.txt)
HTMLS=$(sed "s/.*/htmlUrl=\"&\"/g" links.txt)

How can I use this kind of strategy: while IFS= read -r line; do echo "$line" to take the "stream" of the first three lines in the links.txt file, suffix the next stream with the variables on the same line numbers, to create the name number of lines of populated $TAG of each link, and >> COMBINED.txt?

This is the result I want:

<outline type="rss" title="site" text="site" version="RSS" xmlUrl="https://site.com/category/feed/" htmlUrl="https://site.com/"/>
<outline type="rss" title="site2" text="site2" version="RSS" xmlUrl="https://site2.org/feed/" htmlUrl="https://site2.org/"/>
<outline type="rss" title="site3" text="site3" version="RSS" xmlUrl="https://site3.net/science/astronomy/feed/" htmlUrl="https://site3.net/"/>
<outline type="rss" title="site4" text="site4" version="RSS" xmlUrl="https://feed.site4.info/market/feed/news.xml" htmlUrl="https://site4.info/"/>

I've tried several attempts, things like this and many others

TAG1='<outline type="rss"'
TAG2='version="RSS"'
TAG3='\/>'
echo "$XMLS" >> xmlurls.txt
sed "s/.*/$TAG1 & $TAG2 /" < xmlurls.txt |  sed "s/[ \t]*$/$TAG3/" >> COMBINED.txt

I’ve tried modifying the variables, escaping slashes but frequently get the error of unterminated s.

Here's another example of what I want to do with this kind of strategy: one file has a couple of dozen lines, here are the first few lines

llvm-cfi-verify
llvm-config
llvm-cov
llvm-cvtres
llvm-cxxdump
llvm-cxxfilt
llvm-diff
llvm-dis
llvm-dlltool
llvm-dwarfdump
llvm-dwp

I applied the following below:

while IFS= read -r line; do echo "$line" | sed "s/.*/       --slave   \/usr\/bin\/"$line"\\t\\t\\t\\t& \\t\\t\\t\\t\/usr\/bin\/"$line"-\${version}/g"; done < llvm.txt >> COMBINED

Result:

       --slave   /usr/bin/llvm-cfi-verify               llvm-cfi-verify                 /usr/bin/llvm-cfi-verify-${version}
       --slave   /usr/bin/llvm-config               llvm-config                 /usr/bin/llvm-config-${version}
       --slave   /usr/bin/llvm-cov              llvm-cov                /usr/bin/llvm-cov-${version}
       --slave   /usr/bin/llvm-cvtres               llvm-cvtres                 /usr/bin/llvm-cvtres-${version}
       --slave   /usr/bin/llvm-cxxdump              llvm-cxxdump                /usr/bin/llvm-cxxdump-${version}
       --slave   /usr/bin/llvm-cxxfilt              llvm-cxxfilt                /usr/bin/llvm-cxxfilt-${version}
       --slave   /usr/bin/llvm-diff             llvm-diff               /usr/bin/llvm-diff-${version}
       --slave   /usr/bin/llvm-dis              llvm-dis                /usr/bin/llvm-dis-${version}
       --slave   /usr/bin/llvm-dlltool              llvm-dlltool                /usr/bin/llvm-dlltool-${version}
       --slave   /usr/bin/llvm-dwarfdump                llvm-dwarfdump              /usr/bin/llvm-dwarfdump-${version}
       --slave   /usr/bin/llvm-dwp              llvm-dwp                /usr/bin/llvm-dwp-${version}

the tab spacing wasn't pretty, but at least it's very close to what I want. Additionally, I am only applying one modification to one file, and my goal is to apply this sort of thing to chain multiple files/variable "streams" into a combined file.:

I'm on debian 13 with gnu sed, but I can also test this on alpine, fedora, void, opensuse, etc.. if needed.

Thanks for reading. I tried reading the following, it was difficult to find this kind of question. Maybe the keywords I use in google are incorrect.

EDIT: @markp-fuso, I've been able add/remove the "https://", or add an "s" to "http://" and create new text files without that prefix, but in the end, I want it like that. Thanks everyone, all useful information. I guess I need to learn perl basics now.

4
  • It would help if you could provide the easiest way to reproduce the problem. This works for me: echo 'xmlUrl="https://site.com/category/feed/"' | sed 's/.*/<outline type="rss" & version="RSS" /' | sed 's/[ \t]*$/\/>/' Commented yesterday
  • 1
    the 3rd input line has a domain of site3.net but the expected output has a xmlUrl entry with a domain of site3.com (.net vs .com); if that's not a typo then please update the question with an explanation on when/how the domain should be modified Commented 19 hours ago
  • 1
    all 3 input lines contain domains with 2 dot delimited strings (eg, site.com); in your real world data can the domain contain more than 2 such strings (eg, a.nother.site.com)? can your real world data include ip addresses and/or ports (eg, 5.6.7.8, 4.5.6.7:260)? if any of these are possible then please update the question to a) show such examples and b) explain what should be assigned to the title and text attributes in the expected result Commented 19 hours ago
  • Your fourth entry makes the way you choose the title & text more confusing. At first I thought it was "everything except the top level domain" (left most part of the hostname) now it looks like "highest level domain except for top" (2nd from the right most part of the hostname). Commented 17 hours ago

5 Answers 5

7

how can I use this kind of strategy: while IFS= read -r line; do echo "$line"

Don't. That's not how shells are meant to be used.

For most text processing, I'd use perl which has made sed/awk obsolete since the late 80s.

$ perl -lne 'print qq{<outline type="rss" title="$1" text="$1" version="RSS" xmlUrl="$_" htmlUrl="$&/"/>}
               if m{^https://(?:[^/]*\.)?([^./]+)\.[^/]*}' < links.txt
<outline type="rss" title="site" text="site" version="RSS" xmlUrl="https://site.com/category/feed/" htmlUrl="https://site.com/"/>
<outline type="rss" title="site2" text="site2" version="RSS" xmlUrl="https://site2.org/feed/" htmlUrl="https://site2.org/"/>
<outline type="rss" title="site3" text="site3" version="RSS" xmlUrl="https://site3.net/science/astronomy/feed/" htmlUrl="https://site3.net/"/>
<outline type="rss" title="site4" text="site4" version="RSS" xmlUrl="https://feed.site4.info/market/feed/news.xml" htmlUrl="https://feed.site4.info/"/>

Where:

  • -ln is the sed -n mode where the expression passed to -e is evaluated for each line of input (with the current line (stripped of its delimiter with -l like in sed) in the $_ variable).
  • qq{...} is another form of "..." which like "..." allows $var expansions within (like in shells) but makes it easier to embed "s within the quoted string (see also q{...} for '...' for hard quotes).
  • m{...} similarly is like /.../ (like in awk) except it makes it easier to embed /s within. That's to match $_ against a regexp (it's short for $_ =~ m{...} like awk's /.../ is short for $0 ~ /.../). Here the regexp matches https:// at the beginning (^) optionally (?) followed by a sequence of non-/ character and a . (which we ignore) followed by any number of characters other than . and / (which is captured into $1 thanks to the (...)), followed by a literal . and any number of characters other than / (the .net/.com... tld parts).
  • In what we print if the regexp matches, $_ is the whole line as said above, $& is what is matched by the whole regexp, $1 by the first capture group.

If you add the -MEnglish option, you can replace $_ with $ARG and $& with $MATCH, or you can make it even more explicit by naming the capture groups:

perl -lne '
  print qq{<outline type="rss" title="$+{title}" text="$+{title}" version="RSS" xmlUrl="$+{feed}" htmlUrl="$+{site}/"/>}
    if m{^(?<feed>(?<site>https://(?:[^/]*\.)?(?<title>[^./]+)\.[^/]*).*)}' < links.txt

Where %+ is the associative array that maps capture group names to what they matched, each element accessed with $+{key}.

You can learn about

  • the special variables with perldoc -v '$&' or perldoc -v '%+', etc.
  • functions (like print) or operators (like -m) with perldoc -f print / perldoc -f m
  • how to invoke perl with perldoc perlrun (for those -l, -n, -e options).
  • the syntax with perldoc perlsyn (perldoc -f if will point you to that).
  • Modules (such as English) with perldoc with the name of the module as argument (perldoc English).

Beware however that some systems don't come with the perl documentation installed by default. You may need a apt install perl-doc or equivalent or read the documentation online (see links above).

Your second example would be trivial:

perl -lpe '$_ = qq{ --slave /usr/bin/$_ $_ /usr/bin/$_-\${version}}' < input

-lp is the sed without -n mode, so same as with -n except that $_ is -printed after the -expression has been evaluated.

5

What I would do, using a template engine, here Perl's tpage from Template::Toolkit module, the clean and maintainable way:

Template:

cat rss.tmpl
<outline type="rss" title="[% title %]" text="[% title %]" version="RSS" xmlUrl="[% xmlUrl %]" htmlUrl="[% htmlUrl %]"/>

Input:

cat input
site.com/category/feed/
site2.org/feed/
site3.net/science/astronomy/feed/

Shell code:

while read url; do
    url=${url%/} domain=${url%%/*} title=${domain%%.*} htmlurl=$domain/
    tpage --define title=$title --define xmlUrl=$url/ --define htmlUrl=$htmlurl rss.tmpl
done < input

Output:

<outline type="rss" title="site" text="site" version="RSS" xmlUrl="site.com/category/feed/" htmlUrl="site.com/"/>
<outline type="rss" title="site2" text="site2" version="RSS" xmlUrl="site2.org/feed/" htmlUrl="site2.org/"/>
<outline type="rss" title="site3" text="site3" version="RSS" xmlUrl="site3.net/science/astronomy/feed/" htmlUrl="site3.net/"/>

Install via system package, example for Debian and derivatives:

apt install libtemplate-perl

Or with Perl's cpan utility:

cpan Template

If you prefer Python's Jinja, check jinja2-cli

cat rss.j2
<outline type="rss" title="{{ title }}" text="{{ title }}" version="RSS" xmlUrl="{{ xmlUrl }}" htmlUrl="{{ htmlUrl }}"/>

Code:

while read url; do
    url=${url%/} domain=${url%%/*} title=${domain%%.*} htmlurl=$domain/
    jinja2 --format=env rss.j2<<EOF
        title=$title
        xmlUrl=$url/
        htmlUrl=$htmlurl
EOF
done < input

Install Python's module:

pip install jinja2-cli
0
3

For the 1st data set ...

Assumptions/understandings:

  • the title, text and htmlUrl attributes are derived from the last 2 dot-delimited strings in the domain (eg, for feed.site4.info we're interested in site4 and site4.info)
  • we do not need to worry about ip addresses and/or ports; otherwise OP will need to provide details on how to parse said addresses (eg, 5.6.7.8, 4.5.6.7:260) into the title and text attributes
  • input will be from a pipe

One awk idea:

$ cat links.awk
BEGIN { FS = "/" }                                   # use "/" as input field delimiter
      { 
        gsub(/^[[:space:]]*|[[:space:]]*$/,"")       # strip leading/trailing white space
        n=split($3,a,/[.]/)                          # split 2nd field on periods and place results in array a[]

        printf "<outline type=\"rss\" title=\"%s\" text=\"%s\" version=\"RSS\" xmlUrl=\"%s\" htmlUrl=\"%s\"/>\n",
               a[n-1], a[n-1], $0, $1 FS $2 FS a[n-1] "." a[n] FS
      }

NOTE: the gsub(...) line is effectively a 'no-op' if there is no leading and/or trailing white space and could be removed (or commented out with a leading #) if OP is 100% sure there's no need to worry about leading/trailing white space

Taking for a test drive:

$ cat links.txt | awk -f links.awk

This generates:

<outline type="rss" title="site" text="site" version="RSS" xmlUrl="https://site.com/category/feed/" htmlUrl="https://site.com/"/>
<outline type="rss" title="site2" text="site2" version="RSS" xmlUrl="https://site2.org/feed/" htmlUrl="https://site2.org/"/>
<outline type="rss" title="site3" text="site3" version="RSS" xmlUrl="https://site3.net/science/astronomy/feed/" htmlUrl="https://site3.net/"/>
<outline type="rss" title="site4" text="site4" version="RSS" xmlUrl="https://feed.site4.info/market/feed/news.xml" htmlUrl="https://site4.info/"/>

For the 2nd data set ...

Assumptions:

  • the intention is to generate visually aligned columns
  • input will be from a pipe

One idea using sed and column:

$ cat llvm.sed
s/(.*)/--slave \/usr\/bin\/\1 & \/usr\/bin\/\1-\${version}/g

Taking for a test drive:

$ cat llvm.txt | sed -E -f llvm.sed | column -t

This generates:

--slave  /usr/bin/llvm-cfi-verify  llvm-cfi-verify  /usr/bin/llvm-cfi-verify-${version}
--slave  /usr/bin/llvm-config      llvm-config      /usr/bin/llvm-config-${version}
--slave  /usr/bin/llvm-cov         llvm-cov         /usr/bin/llvm-cov-${version}
--slave  /usr/bin/llvm-cvtres      llvm-cvtres      /usr/bin/llvm-cvtres-${version}
--slave  /usr/bin/llvm-cxxdump     llvm-cxxdump     /usr/bin/llvm-cxxdump-${version}
--slave  /usr/bin/llvm-cxxfilt     llvm-cxxfilt     /usr/bin/llvm-cxxfilt-${version}
--slave  /usr/bin/llvm-diff        llvm-diff        /usr/bin/llvm-diff-${version}
--slave  /usr/bin/llvm-dis         llvm-dis         /usr/bin/llvm-dis-${version}
--slave  /usr/bin/llvm-dlltool     llvm-dlltool     /usr/bin/llvm-dlltool-${version}
--slave  /usr/bin/llvm-dwarfdump   llvm-dwarfdump   /usr/bin/llvm-dwarfdump-${version}
--slave  /usr/bin/llvm-dwp         llvm-dwp         /usr/bin/llvm-dwp-${version}

NOTE: this assumes the input does not contain white space (column's default field delimiter); otherwise we could modify the sed and column calls to use a different character as the field delimiter

2

Not sure about sed. But you can do it using awk and paste and some temporary files:

$ awk -F/ '{print $3}' links.txt | tee long | awk -F. '{print $1}' >short

$ paste links.txt long short > final

$ awk  '{print "<outline type=\"rss\" title=\""$3"\" text=\""$3"\" version=\"RSS\" xmlUrl=\""$1"\" htmlUrl=\""$
2"\"/>"}' final
<outline type="rss" title="site" text="site" version="RSS" xmlUrl="https://site.com/category/feed/" htmlUrl="site.com"/>
<outline type="rss" title="site2" text="site2" version="RSS" xmlUrl="https://site2.org/feed/" htmlUrl="site2.org"/>
<outline type="rss" title="site3" text="site3" version="RSS" xmlUrl="https://site3.net/science/astronomy/feed/" htmlUrl="site3.net"/>

$ rm long short final
0

Your second example (the llvm stuff) is a bjillion times easier than your first example.

THING=llvm-cfi-verify
echo "--slave  /usr/bin/${THING}  ${THING}  /usr/bin/$THING\${version}"

This is because you're not doing anything to process the input, just sticking it in the output unchanged.

You may have heard the saying "make a tool that does one thing and does it well." If you look at that from the point of view of a user of those tools, that means, "if your tool isn't awesome for this job, there's probably a better one out there."

Instead of mastering sed, you should get to be familiar with a wide range of tools and which situations they're good for. You can look up how to do things with a tool you know about, but it's harder to realize you need a tool you've never heard of.

I would use totally different approaches to the two problems you've shown us. I would probably use Perl or Python for the first problem and a three-line bash script for the 2nd.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.