Search in PDF

You can ask general questions, share opinions or advices about doPDF.
Post Reply
aras
Posts: 1
Joined: Mon Oct 24, 2011 6:00 am

Post by aras »

Hi all,
My Client requirement is to do a PDF search (non-english) in the Search module of his e-learning website. When i try to extract the contents of PDF for indexing, some of the characters are neglected during extraction (empty spaces in that area,when i view the indexed contents in Luke). I am getting these problem for languages like Tamil/Hindi.
The Client is very adamant that he wants PDF search.
What is the solution for this...Please give me a ray of light or guidelines.
Thanks and Regards,

aras


Claudiu (Softland)
Posts: 1500
Joined: Thu May 23, 2013 7:19 am

Post by Claudiu (Softland) »

Hello,
Unfortunately the PDF format supports by default only latin characters. The other characters are added in the PDF as embedded CID font subsets, with Unicode CMaps. You have to use a search text module capable to read this type of text from the PDF files to be able to extract all the characters correctly.
Thank you for understanding.


Post Reply