I have the following script for batch pdf-ocr processing & it works fine
#!/bin/sh
# apt-get install exactimage tesseract-ocr ghostscript
# bash tut: http://linuxconfig.org/bash-scripting-tutorial
# Linux PDF,OCR: http://blog.konradvoelkel.de/2013/03/scan-to-pdfa/
y="`pwd`/$1"
echo Will create a searchable PDF for $y
x=`basename "$y"`
name=${x%.*}
mkdir "$name"
cd "$name"
# splitting to individual pages
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o out_%04d.jpg -f "$y"
# process each page
for f in $( ls *.jpg ); do
# extract text
tesseract -l eng -psm 3 $f ${f%.*} hocr
# echo Page ?? of ?? done!
# remove the “<?xml” line, it disturbed hocr2df
grep -v "<?xml" ${f%.*}.html > ${f%.*}.noxml
rm ${f%.*}.html
# create a searchable page
hocr2pdf -i $f -s -o ${f%.*}.pdf < ${f%.*}.noxml
rm ${f%.*}.noxml
rm $f
done
# combine all pages back to a single file
# from http://www.ehow.com/how_6874571_merge-pdf-files-ghostscript.html
gs -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=../${name}_searchable.pdf *.pdf
cd ..
rm -rf $name
I just want to echo which page being completed out of the total pages of the input pdf file?
for f in *.jpginstead offor f in $( ls *.jpg )you'll thank me later. Your approach will break if any of your file names contain spaces for example.