Search in PDF

You can ask general questions, share opinions or advices about doPDF.
aras
Posts: 1
Joined: Mon Oct 24, 2011 6:00 am

Postby aras » Mon Oct 24, 2011 1:12 pm

Hi all,


My Client requirement is to do a PDF search (non-english) in the Search module of his e-learning website. When i try to extract the contents of PDF for indexing, some of the characters are neglected during extraction (empty spaces in that area,when i view the indexed contents in Luke). I am getting these problem for languages like Tamil/Hindi.


The Client is very adamant that he wants PDF search.


What is the solution for this...Please give me a ray of light or guidelines.


Thanks and Regards,

aras



Softland
Posts: 1498
Joined: Thu May 23, 2013 7:19 am

Postby Softland » Mon Oct 24, 2011 5:55 pm

Hello,


Unfortunately the PDF format supports by default only latin characters. The other characters are added in the PDF as embedded CID font subsets, with Unicode CMaps. You have to use a search text module capable to read this type of text from the PDF files to be able to extract all the characters correctly.


Thank you for understanding.




Return to “General”