The content of this blogpost is moved to the new domain. Please visit http://shijualex.in/digitization-of-books-in-wikisource-using-djvu/ to read this blogpost.
-
Recent Posts
- Analysis of the Indic Language Wikipedia Statistical Report 2012
- Indic language wikipedias – Statistical report – 2012
- Malayalam Wikipedia Education Program: August to October Updates
- Building community for Punjabi Wikipedia – My experience
- Punjabi Wikipedia Workshop at Amritsar
- Punjabi Wikipedia Workshop at Punjabi University, Patiala
- The First Punjabi Wikipedia Workshop
- Indic language wikipedias – Statistical report – 2012 January – 2012 June
- Numerals in Indic Languages & Indic language Wikipedias
- Digitization of books in Wikisource using DjVu
Archives
Recent Comments

A really good post and a learning curve for most Indic language Wikilibrarians.
I don’t understand, doesn’t ABBYY work with Malayam?
What’s the output of texts like http://archive.org/search.php?query=language%3A%28mal%29 ?
Awesome to hear about what the non-english projects are doing
BTW: it appears there’s something wrong with how your blog is set up for en.planet.wikimedia.org . The links on the en planet for your post don’t go to your blog
su said
//I don’t understand, doesn’t ABBYY work with Malayam?//
No. Malayalam support is not there in ABBYY. Infact it doesn’t have support for any indic language. See the supported languages of ABBYY here: http://finereader.abbyy.com/recognition_languages/?width=850&height=600px&blocks=Content
bawolff said
//BTW: it appears there’s something wrong with how your blog is set up for en.planet.wikimedia.org . The links on the en planet for your post don’t go to your blog//
I do not have any idea about the settings there. Hope the blog planet admins will fix the issues.
Are you aware that Tesseract (an open source OCR engine) does support Malayalam? I believe there’s both a builtin support in the newest beta version (3.02, , which hasn’t been officially released but you can compile from the source code from https://code.google.com/p/tesseract-ocr/ if you’re feeling technical) and a community driven project (https://code.google.com/p/parichit/ for version 3.01). There’s also support for some other Indian languages. I can’t vouch for the quality though, as I can’t read any of them.
Can I know, if any OCR is present for Sanskrit Language also? If present, can I know where can it be procured?
Sorry, but I don’t know of any support for Sanskrit. It might be worth the effort to try to get in contact with the folks behind the Parichit project to hear what they are missing in order to support Sanskrit and if there’s a way you can help, though. Their stated goal is to support each of the Indian languages, so maybe — if we’re lucky — they have something on the way. (While training tesseract for new languages is rather tricky, it does not require programming skills.)
Dear Alberti, It would be better if I can get the contacts of Parichit project people, and would help them with the language part.. But the PDFs of many Sanskrit works might be in a font which maynot be open sourced. So, how do we overcome this issue??
The only contact information I have for the Parichit project people is in the link I gave above. There’s a google groups mailing list, just not a very active one, it seems.
Training tesseract requires images of text, it does not use fonts directly (there’s a technical guide at https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 ). So it is possible to use scans of old texts, although that is probably tricky for connected scripts. And since an image of text generated using a proprietary font is usually not a derivative work of that font, unless maybe if it is particularly decorative or something, it is also possible to use fonts that aren’t open source.
Pingback: DjVu « દૃષ્ટિકોણ