First of all, what is a Gazetteer?
Here is the definition from the Karnataka Government Gazetteer website from which I have downloaded all the gazetteers:
Earlier, a Gazetteer signified a geographical index or geographical dictionary or guidebook of important places and people. But with the passage of time its range has vastly widened and it had come to mean a veritable mine of knowledge about the numerous aspects of life of the people and of the country or region they inhabit.
Till I looked up the definition for this article, I thought all these are gazettes. But as mentioned on the Gazetteer website, “Gazetteers are distinctly reference volumes of lasting value while the Gazettes are official newspapers or bulletins.“
Benefits
Gazetteers, as mentioned earlier, give extensive details of a region. Coming from the government makes them a well-researched, credible, and authoritative source.
The historical insights, right from the story behind the name of an area to the chronology of events and important people who shaped history help us in appreciating the rich past and its journey.
Population and its composition, languages, occupations, cultural beliefs, tools and techniques, demographics, etc., provide a wealth of information to those interested in knowing about their heritage and diversity.
Administrative details, government policies, divisions and sub-divisions of a region, maps, public services, etc., help in knowing the evolution of boundaries, infrastructure, and decision-making process of the bygone era.
Agriculture and irrigation methods give us information on what worked earlier and why.
Town planning, revenue streams, transportation and communication, health etc., give us vital geographical details of an area and help in making informed decisions about its future.
Researchers and students will find this information useful for their projects.
Treasure trove for quizzers.
Outcome
Total PDFs downloaded: 1200
Total non-converted files: 127
Total searchable files: 1073
I have shared all the 1200 files in my One Drive..
Notes on non-converted files are available in respective folders. Some gazetteers are made searchable by the government itself. I have included them here for easy access.
Rest is my story about this project. If you are only interested in files, you can stop reading here.
Why did I take up this task?
I was curious about freedom fighters from my hometown, Pavagada, and the role my native villages played in the freedom struggle. Around the same time, I had lost my book on Hoysala temples. So, I started looking for sources that would provide me with reliable information on these. Whenever I need such reliable or authentic sources, I generally look for official documents. In this case, it was a gazetteer.
(Why was I curious about freedom fighters from Pavagada? That is for another day, another story.)
Soon, I realized I had to read through several pages in multiple PDFs to get even a small piece of information. This was the trigger. I had to make them searchable. I had the right tools to do that. But time was a challenge. So, I set myself a very generous deadline for completing this project by the end of the year. My passion for accessibility and ensuring that all information is readily and easily available to everyone added to the cause.
Another, rather minor reason, is that I don’t want these documents to disappear when government websites migrate or update. (See the challenges later in this article for an example.)
Process
The process is simple but monotonous.
- Access the Gazetteer website.
- Click the links.
- Download PDFs.
- Verify all PDFs are downloaded.
- Run Scan and OCR in Adobe Acrobat Pro.
- Do random checks in converted PDFs by searching for text.
- Make a note about the ones that are not converted along with the reason/s.
That’s it.
Challenges
There are hundreds of PDFs, 1200 to be precise. Each PDF has to be downloaded separately. No bulk download options. (My techie friends might have easier methods like scraping to download all PDFs from a webpage in one go.)
Some links are broken on the website. So, PDFs are missing.
Some links are broken in the Kannada version of the website but work in the English version.
Accessing each district/ taluk/ publication year, clicking the chapter’s PDF link to download, and running the OCR in Acrobat Pro is excruciatingly time-consuming. I started in mid-2023 and completed it on 18th Feb 2024. There were days I downloaded or converted 100+ PDFs. There were months without a single download/ conversion.
Fonts are a nightmare for any document digitization project. These gazetteers use a whole lot of Kannada fonts – Nudi non-unicode fonts, Baraha fonts, and some unrecognizable ones. Generally old Kannada documents use Shree Lipi fonts for Kannada. But even the official Shree Lipi to Unicode converter didn’t recognize these. No clue what those fonts are.
But what kept me going? It was helping me. So had to do it. Also, I had spoken about this project with too many people. I couldn’t embarrass myself after that. 🙂
Next steps (or pending tasks)
OCRs are not perfect. While I have randomly checked the conversion accuracy, I know there will be some misread content. So, all these PDFs must be proofread.
About 100 PDFs use Non-unicode Nudi or Baraha fonts. Currently, their content can be searched only after converting to Word format. But I guess these will be searchable PDFs if the font is converted to Unicode. And, the only tool I know for this job is Aravinda’s Sanka Unicode Converter. Converting this massive quantity of text is a mini-project in itself.
I have not downloaded/ converted special editions that are available on the Gazetteer website. They were not in my scope when I started. But they are precious and must be preserved.
Got any suggestions or feedback? Let me know in the comments. (Comments are moderated to avoid spam.)