I want to create a custom list of (scientific) words for purposes like spell checking and OCR based on my collection of scientific papers in pdf format. Using pdftotext I can easily create a text file which contains the wanted words for my scientific field. However the file will be polluted with
- words which are not specific for science (and which would also be contained in a common dictionary)
- words which result from improper conversion of e.g. formulas (including words which include special characters etc.)
I want to get rid of the later by requiring that individual words have a minimum length, contain no special characters and appear several times in the list. Secondly I want to get rid of the former by comparing with a second word list. My questions:
Does this sound like a good plan to you? Are there existing tools for this task? How would you do it?