Analysis of the Indic Language Wikipedia Statistical Report 2012

Last week we have seen the statistical report of Indic language wikipedias for the year 2012. That report was entirely based on the statistical data available at (If you haven’t read that report I strongly recommend you to read that since most information in this report has some connection to that report.) After publishing that report few readers asked for a detailed analysis of the report. Also more statistical data like the percentage of the number of articles above 2kB, the percentage of the number of articles above 5kB, and so on were asked. But unfortunately some of these data were not readily available. So I went on in search of these data. Malayalam wikipedian Jyothis has helped me to dig out the number of articles above 2kB, 5kB, and so on.  With this newly available data and some other data I have prepared a detailed analysis of the report that I published last week.


In the previous report we have seen the number of articles that each Indic language wikipedia has. In the wikimedia world, as we know even if we create an article with one sentence (or even less) it is still considered as an article. In the past many communities have used this feature to increase the number of articles. Now here I am analyzing the data that is digged out by Jyothis using the tool server account for our purpose. This data will help you to understand the actual situation in Indic wikipedias regarding the number of articles.

The following table gives you an executive summary of the percentage of number of articles above 2 kB, 5kB, and 10 kB. Actually 2 kB data more suits the latin scripts since an article with 2 kB data in English or other lantin languages will have around 6-7 sentences which gives a minimum level of information. For example, see this English wikipedia article which has 2 kB of content and for comparison see an article in Hindi Wikipedia with 2 kB content. Since the byte per character for Indic scripts is more, 2 kB size of the article will be achieveable with just two or three sentences. So to look into the actual state of the number of articles in Indic wikipedias we have captured information about number of articles above 2 kB, 5 kB, and 10 kB.


If we look at article count as mentioned in the last post Hindi wikipedia is the biggest Indic language wikipedia with about 1 lakh articles.

  • Now if we look at the percentage of articles above 2 kB we could see that BishnuPriya Manipuri Wikipedia comes first. More than 96 % of its articles are above 2 kb. Assamese comes second with 88% and Malayalam comes third with 83%. Oriya has 82% and Bengali has 81%.
  • This also gives some interesting facts about some languages that has upper hand just due to number of articles. For example, even though Hindi has close to 1 lakh articles only 58% of its articles are above 2 kB. Which means that more than 40,000 articles of Hindi wikipedia are one (or two) liner articles or of even less size. If any one interested to improve all those articles you can find those using this link.
  • The case of Telugu (the third biggest Indic wikipedia) is even bad. Only 27% of its articles are above 2 kB, which means almost 73% of Telugu Wikipedia consists of articles which has just one or two sentences. If any one interested to improve all those articles you can find those using this link.

NOTE: While using tool server to extract different type of data we found that for Hindi Wikipedia there is a difference of almost 8000 articles between the article count shown in Hindi wikipedia and the number of articles shown while we use tool server. I think this is the reason why in all Wikimedia stats (For example, this Hindi wikipedia article number is around 96,000. This is indicating that there are some hidden pages in Hindi wikipedia which is adding to its article count. I request Hindi wikipedians to look into this.

As mentioned above, to compare Indic content with that of Latin languages we have captured 5kB and 10 kB data also. Analyzing that data give some interesting and eye opening facts. We could see that many things are unfolding for Indic wikis.

  • Here Assamese tops the list with more than 67% of its articles are above 5 kB. Malayalam comes second with 44%, then comes Bengali and Kannada with 36% articles, and then comes Tamil with 33% of its articles above 5 kB.
  • Assamese is really impressing me in this category with 67% of its articles above 5 kB. This shows that community is putting lot of effort to make sure they are adding sufficient content to all the articles they create.
  • Malayalam with more than 44% of its articles above 5 kB is also showing a impressive performance.
  • But a surprise entry in this category is Kannada with 36% of its articles above 5 kB. In the recent past we haven’t seen Kannada in the top position in quality metrics. So this is a bit of surprise for me . Bengali also has close to 36% of its article above 5 kB.
  • Another notable language is Tamil with 33% of its articles above 5 kB.
  • Even though Bishupriya Manipuri has 96 % in the 2 kB category, the percentage dropped to just 2% when it comes to 5 kB articles. This clearly shows that a whopping 98% of articles in it are stub articles.
  • Personally I know the story of Hindi, Telugu, and Bishnupriya Manipuri Wikipedias where bots were used to create thousands (actually ten-thousands) of stub articles to increase the number of articles, with the hope that can be used as a selling point to build the community. (This was happened in 2006-2007 for Telugu and Bishnupriya Manipuri and in 2010 for Hindi) But now 4-5 years down the line, this has become a burden for these 3 languages. Bishnupriya Manipuri do not have community at all. And the community strength in Hindi and Telugu not at all increased (actually it decreased) after this mass creation of articles. This clearly shows the urgent need to grow the community for all these languages to take care of all these ten-thousands of stub articles.

Now to understand more deeper we digged out the > 10 kB data also. Here is the information we got. (This is not that much important for Indic wikis since still all our wikis are small. But this is also analayzed to understand the situation better)

  • Even in the > 10 kB category Assamese has 40% of its articles above 10 kB. This is really impressive considering the fact that Assamese wikipedia community became active only in 2011 and it has around 20 active users now. This data clearly shows that Assamese wikipedians are writing really big artticles. The existence of stub there is very less compared to other Indic language wikipedias. But this also tells me the reason why Assamese Wikipedia community haven’t grown much in 2012. I feel community is putting too much stress on the length of the article and this might have prevented many new users from joining Assamese Wikipedia. This is not good especially considering the fact that Assamese wikipedia is small and community is also small. It is good to maintain a reasonable level of quality for all articles created in Wiki, but make sure that we are not stressing new users for that. If we do like that it will affect community growth. So I request Assamese wikipedians to find a reasonable balance between the emphasis on the length of the article and the stubs that new users create in Wiki. having said this, I would like to congratulate them for reaching top position in this category. It is a good reason to celebrate.
  • Here also Kannada has 24% of its articles above 10 kB. A part of this might be related to Google’s translation project. But the fact that Kannada has a reasonable percentage of its articles above 10 kB is a good news for the readers of Kannada wikipedia. So why can’t the community use this fact about Kannada wikipedia to attract more Kannada speakers to Kannada wikipedia? But as mentioned above for Assamese, for Kannada also the need of the hour is to build community for Kannada.
  • Malayalam and Bengali has close to 17 % and Tamil has 14% of its articles above 10 kB.


In the previous report we have seen that Malayalam with almost 120 users (as of December 2012) is the biggest Inidc language wiki community. In the following table I have considered the active users for all the months of 2012 and took the average. Also editors per crore speakers is calculated. Let us see what we can make out of this.


  • In the report we saw that Malayalam with 120 users is the biggest Indic language wiki community. When we take the average of all the months the active users is reduced to 92. This shows there is fluctuation in the number of active users of Malayalam wikipedia throughout the year. Even then with 92 active users, Malayalam still continues to be the biggest Indic wiki community. The fluctuation is closely related to the hype that is created among Malalayee population when community does some major events (like wiki conference, 10th anniversary, and so on). So in short Malayalam still continues to be the biggest Indic wiki community.
  • In the previous report we have seen that the number of active users of Tamil is reduced from last year. When we normalize this for the whole year we could see that Tamil has almost 80 active users, almost same as Dec 2012 figure. This shows that there is no fluctuation in the number of active users as there is in Malayalam. Tamil community has maintained the number of active users through out the year.
  • Bengali with an average active user base of 58 comes next.

Now when we come to number of editors per crore population we are getting some interesting statistics.

  • Sanskrit with hardly 50,000 speaker base (first language speakers) has a whopping 2700 editors (as per Indian standards) per crore population. The reason for this might be the fact that in India there is keen interest in Sanskrit language through out the country (due to historic reasons). For example, we have, Samskrita Bharati, an organization itself working for the progress of Sanskrit wiki projects. So no wonder Sanskrit wiki projets are making reasonably good progress when compared to many other big and living languages of India.
  • Among other languages Malayalam tops the list with 24 editors per crore population. This makes senses considering the efforts put by Malayalam community to build its community. But we should understand that we were able to convert only 24 people into wikipedia editors out of 1 crore Malayalam speakers. from this we can understand how much more we need to work further to grow community. It will be nice if are able to make it at least 50 users per crore speakers by the end of this year.
  • Another language (like we saw in other parameters) which is showing impressive conversion rate is Assamese. I have seen the effort (especially online efforts) put by Assamese Wikipedians to build its community. I hope they will be able to increase it further this year.

All other languages are showing even less editor-speaker ratio, which is showing that we are very bad in reaching out to our own speakers.

Page views

In the last report we have seen that Hindi with 78 lakh page views per month is the Indic wikipedia with most number of views. Here we are trying to analayze the same information by finding out the page vews per lakh speakers to understand the usage of wikipedia among the respective language speakers.


  • Here also Sanskrit is leading with page views almost nine times of its speaker base. This is mainly due to the fact that Sanskrit has many followers among the other language speakers. Bishnupriya Manipuri and Newari Wikipedias also has page views more than its speaker base.
  • Among other languages of India, Tamil with 9091 page views, Malayalam with 7838 page views, and Nepali with 6962 page views per lakh population comes in the top position. For all other languages page views per lakh population is below 5000.


Let me end this report with a general advice that came out of my experience with wikimedia movement. Give proper attention to the old and existing articles of wikipedia. Other wise later it will become a burden for us. This is important to build community for Wikipedia. To find the small or stub articles in your wikipedia you can use the short pages link. For example, Tamil wikipedia short pages are available here:

In this report and in the previous report I have presented my views after analyzing the various available and generated data. It will be really nice if other wikipedians also do the same since they will be able to find many other things that I missed.

Thanks for your time.

This entry was posted in Indian Language Wikipedia Statistics, Indian Language Wikipedias, Wikimedia, Wikipedia, WMF and tagged , . Bookmark the permalink.

8 Responses to Analysis of the Indic Language Wikipedia Statistical Report 2012

  1. Vishnu says:

    Hi Shiju, Thanks for this very useful analytical report. The 2kb, 5kb, 10kb parameter does bring in a critical perspective on the quality of the content on Indian Language wikipedias. Thanks to Jyothis for the effort. You mentioned about the problem of stubs by bots for Hindi, Telugu and Bishnupriya Manipuri. Is there anyway we can look at the performance of these language wikis say over the last 3 years to get a clear picture? Could not agree more with you about the required impetus that need to be given on community building. But you also say (in the earlier report) and I quote “Even though many outreach programs had happened across country, that is not showing up in terms of number of active editors”. I guess this means that we need to re-look at the way we are doing the Outreach programmes in building the community. I think this is an important learning and needs critical reflection from all the Indian language communities so that we could give larger data chunk for Shiju to analyse, at least by this time next year. Thanks again Shiju!

  2. Shiju Alex says:

    \\You mentioned about the problem of stubs by bots for Hindi, Telugu and Bishnupriya Manipuri. Is there anyway we can look at the performance of these language wikis say over the last 3 years to get a clear picture? \\

    Please see the last 2 years report in this blog itself.
    – 2010 report –
    – 2011 report –

    \\I guess this means that we need to re-look at the way we are doing the Outreach programmes in building the community. \\

    yes. More focus need to be on online outreach. But physical outreach should also happen when the need for it arises.

  3. uddip talukdar says:

    With experience from the Assamese Wikipedia, I have seen that online outreach activities actually converts to actual editors, which does not happen much through physical outreach activities. The reason may lie with actual reach of internet and computers to the audience, The online viewers definitely have a upper hand to try instantly.

    In Assamese Wikipedia, we have a policy against stubs, and I do admit that a few editors might have left for this. But, at the same time, many have actually understood the problem and instead has focused on writing better articles. Also, most the veteran wikipedians are trying to check qualities of each articles as much as possible. One side-effect of this quality concern has turned up as inhibition in very new editors to write articles. Because, they feel that they have to know all of this to begin an article! Now, I admit this is probably a very awkward situation to deal with. But, with more online programs we would be able to overcome this.

    About, outreach programs, I believe if goals of both online and physical outreach programs are kept separate, we can get better results. For online programs the goal should be to get new editors, while physical outreach programs should be directed towards generating enthusiasm and usability of Wikipedia. Once, people realize the importance of Wikipedia in their own language, then they will start contributing too.

  4. Shiju Alex says:

    Yes Uddip, you said it rightly regarding the importance of online outreach. And the goals of online and offline outreach need to redefined.

    Regarding the policy against stubs in Assamese wikipedia, I suggest to have a balance since otherwise it will affect the community growth. For Assamese, community building is very very important at this point of time. So we cannot lose any probabaly good wikipedian. May be to keep a balance you can make a policy like atleast 5 sentences should be there in the articles that users create. It is not good to stress users beyond that.

  5. Uddip,
    In Tamil Wikipedia, we have a rule of minimum 3 sentences for an article for newcomers. Once a newcomer crosses around 50 articles, we gently remind her to write more. Until the newcomer is ready, it is the responsibility of the community to ensure the quality.

  6. uddip talukdar says:

    We don’t have a minimum sentences criteria. We usually give a very gentle notice and willingness to help by veteran members. Most, new members almost readily understands what is expected. Sometimes, some members does not follow any advice, in those cases admins transfers those articles to their user page, with a reassurance that once the user elaborates the articles to the bare minimum needed, then it would be transferred to main page again. This probably is not the main problem in getting new editors as very few continues like that, but what I noticed is that once all articles seem well-written, the new editor feels intimidated that he/she has to write such big articles to take part. I understand this because, I feel the same when I have to edit an already big article in English Wikipedia. We are planning to formulate a drive towards reducing this fear in new editors.

  7. uddip talukdar says:

    A little addendum, the first notice is usually given once it is seen that the user is continuing with such extreme stubs.

  8. Pingback: Wikimedia Research Newsletter, January 2013 — Wikimedia blog

Comments are closed.