Last week we have seen the statistical report of Indic language wikipedias for the year 2012. That report was entirely based on the statistical data available at http://stats.wikimedia.org. (If you haven’t read that report I strongly recommend you to read that since most information in this report has some connection to that report.) After publishing that report few readers asked for a detailed analysis of the report. Also more statistical data like the percentage of the number of articles above 2kB, the percentage of the number of articles above 5kB, and so on were asked. But unfortunately some of these data were not readily available. So I went on in search of these data. Malayalam wikipedian Jyothis has helped me to dig out the number of articles above 2kB, 5kB, and so on. With this newly available data and some other data I have prepared a detailed analysis of the report that I published last week.
In the previous report we have seen the number of articles that each Indic language wikipedia has. In the wikimedia world, as we know even if we create an article with one sentence (or even less) it is still considered as an article. In the past many communities have used this feature to increase the number of articles. Now here I am analyzing the data that is digged out by Jyothis using the tool server account for our purpose. This data will help you to understand the actual situation in Indic wikipedias regarding the number of articles.
The following table gives you an executive summary of the percentage of number of articles above 2 kB, 5kB, and 10 kB. Actually 2 kB data more suits the latin scripts since an article with 2 kB data in English or other lantin languages will have around 6-7 sentences which gives a minimum level of information. For example, see this English wikipedia article which has 2 kB of content and for comparison see an article in Hindi Wikipedia with 2 kB content. Since the byte per character for Indic scripts is more, 2 kB size of the article will be achieveable with just two or three sentences. So to look into the actual state of the number of articles in Indic wikipedias we have captured information about number of articles above 2 kB, 5 kB, and 10 kB.
If we look at article count as mentioned in the last post Hindi wikipedia is the biggest Indic language wikipedia with about 1 lakh articles.
- Now if we look at the percentage of articles above 2 kB we could see that BishnuPriya Manipuri Wikipedia comes first. More than 96 % of its articles are above 2 kb. Assamese comes second with 88% and Malayalam comes third with 83%. Oriya has 82% and Bengali has 81%.
- This also gives some interesting facts about some languages that has upper hand just due to number of articles. For example, even though Hindi has close to 1 lakh articles only 58% of its articles are above 2 kB. Which means that more than 40,000 articles of Hindi wikipedia are one (or two) liner articles or of even less size. If any one interested to improve all those articles you can find those using this link.
- The case of Telugu (the third biggest Indic wikipedia) is even bad. Only 27% of its articles are above 2 kB, which means almost 73% of Telugu Wikipedia consists of articles which has just one or two sentences. If any one interested to improve all those articles you can find those using this link.
NOTE: While using tool server to extract different type of data we found that for Hindi Wikipedia there is a difference of almost 8000 articles between the article count shown in Hindi wikipedia and the number of articles shown while we use tool server. I think this is the reason why in all Wikimedia stats (For example, this http://stats.wikimedia.org/EN/SummaryHI.htm) Hindi wikipedia article number is around 96,000. This is indicating that there are some hidden pages in Hindi wikipedia which is adding to its article count. I request Hindi wikipedians to look into this.
As mentioned above, to compare Indic content with that of Latin languages we have captured 5kB and 10 kB data also. Analyzing that data give some interesting and eye opening facts. We could see that many things are unfolding for Indic wikis.
- Here Assamese tops the list with more than 67% of its articles are above 5 kB. Malayalam comes second with 44%, then comes Bengali and Kannada with 36% articles, and then comes Tamil with 33% of its articles above 5 kB.
- Assamese is really impressing me in this category with 67% of its articles above 5 kB. This shows that community is putting lot of effort to make sure they are adding sufficient content to all the articles they create.
- Malayalam with more than 44% of its articles above 5 kB is also showing a impressive performance.
- But a surprise entry in this category is Kannada with 36% of its articles above 5 kB. In the recent past we haven’t seen Kannada in the top position in quality metrics. So this is a bit of surprise for me . Bengali also has close to 36% of its article above 5 kB.
- Another notable language is Tamil with 33% of its articles above 5 kB.
- Even though Bishupriya Manipuri has 96 % in the 2 kB category, the percentage dropped to just 2% when it comes to 5 kB articles. This clearly shows that a whopping 98% of articles in it are stub articles.
- Personally I know the story of Hindi, Telugu, and Bishnupriya Manipuri Wikipedias where bots were used to create thousands (actually ten-thousands) of stub articles to increase the number of articles, with the hope that can be used as a selling point to build the community. (This was happened in 2006-2007 for Telugu and Bishnupriya Manipuri and in 2010 for Hindi) But now 4-5 years down the line, this has become a burden for these 3 languages. Bishnupriya Manipuri do not have community at all. And the community strength in Hindi and Telugu not at all increased (actually it decreased) after this mass creation of articles. This clearly shows the urgent need to grow the community for all these languages to take care of all these ten-thousands of stub articles.
Now to understand more deeper we digged out the > 10 kB data also. Here is the information we got. (This is not that much important for Indic wikis since still all our wikis are small. But this is also analayzed to understand the situation better)
- Even in the > 10 kB category Assamese has 40% of its articles above 10 kB. This is really impressive considering the fact that Assamese wikipedia community became active only in 2011 and it has around 20 active users now. This data clearly shows that Assamese wikipedians are writing really big artticles. The existence of stub there is very less compared to other Indic language wikipedias. But this also tells me the reason why Assamese Wikipedia community haven’t grown much in 2012. I feel community is putting too much stress on the length of the article and this might have prevented many new users from joining Assamese Wikipedia. This is not good especially considering the fact that Assamese wikipedia is small and community is also small. It is good to maintain a reasonable level of quality for all articles created in Wiki, but make sure that we are not stressing new users for that. If we do like that it will affect community growth. So I request Assamese wikipedians to find a reasonable balance between the emphasis on the length of the article and the stubs that new users create in Wiki. having said this, I would like to congratulate them for reaching top position in this category. It is a good reason to celebrate.
- Here also Kannada has 24% of its articles above 10 kB. A part of this might be related to Google’s translation project. But the fact that Kannada has a reasonable percentage of its articles above 10 kB is a good news for the readers of Kannada wikipedia. So why can’t the community use this fact about Kannada wikipedia to attract more Kannada speakers to Kannada wikipedia? But as mentioned above for Assamese, for Kannada also the need of the hour is to build community for Kannada.
- Malayalam and Bengali has close to 17 % and Tamil has 14% of its articles above 10 kB.
In the previous report we have seen that Malayalam with almost 120 users (as of December 2012) is the biggest Inidc language wiki community. In the following table I have considered the active users for all the months of 2012 and took the average. Also editors per crore speakers is calculated. Let us see what we can make out of this.
- In the report we saw that Malayalam with 120 users is the biggest Indic language wiki community. When we take the average of all the months the active users is reduced to 92. This shows there is fluctuation in the number of active users of Malayalam wikipedia throughout the year. Even then with 92 active users, Malayalam still continues to be the biggest Indic wiki community. The fluctuation is closely related to the hype that is created among Malalayee population when community does some major events (like wiki conference, 10th anniversary, and so on). So in short Malayalam still continues to be the biggest Indic wiki community.
- In the previous report we have seen that the number of active users of Tamil is reduced from last year. When we normalize this for the whole year we could see that Tamil has almost 80 active users, almost same as Dec 2012 figure. This shows that there is no fluctuation in the number of active users as there is in Malayalam. Tamil community has maintained the number of active users through out the year.
- Bengali with an average active user base of 58 comes next.
Now when we come to number of editors per crore population we are getting some interesting statistics.
- Sanskrit with hardly 50,000 speaker base (first language speakers) has a whopping 2700 editors (as per Indian standards) per crore population. The reason for this might be the fact that in India there is keen interest in Sanskrit language through out the country (due to historic reasons). For example, we have, Samskrita Bharati, an organization itself working for the progress of Sanskrit wiki projects. So no wonder Sanskrit wiki projets are making reasonably good progress when compared to many other big and living languages of India.
- Among other languages Malayalam tops the list with 24 editors per crore population. This makes senses considering the efforts put by Malayalam community to build its community. But we should understand that we were able to convert only 24 people into wikipedia editors out of 1 crore Malayalam speakers. from this we can understand how much more we need to work further to grow community. It will be nice if are able to make it at least 50 users per crore speakers by the end of this year.
- Another language (like we saw in other parameters) which is showing impressive conversion rate is Assamese. I have seen the effort (especially online efforts) put by Assamese Wikipedians to build its community. I hope they will be able to increase it further this year.
All other languages are showing even less editor-speaker ratio, which is showing that we are very bad in reaching out to our own speakers.
In the last report we have seen that Hindi with 78 lakh page views per month is the Indic wikipedia with most number of views. Here we are trying to analayze the same information by finding out the page vews per lakh speakers to understand the usage of wikipedia among the respective language speakers.
- Here also Sanskrit is leading with page views almost nine times of its speaker base. This is mainly due to the fact that Sanskrit has many followers among the other language speakers. Bishnupriya Manipuri and Newari Wikipedias also has page views more than its speaker base.
- Among other languages of India, Tamil with 9091 page views, Malayalam with 7838 page views, and Nepali with 6962 page views per lakh population comes in the top position. For all other languages page views per lakh population is below 5000.
Let me end this report with a general advice that came out of my experience with wikimedia movement. Give proper attention to the old and existing articles of wikipedia. Other wise later it will become a burden for us. This is important to build community for Wikipedia. To find the small or stub articles in your wikipedia you can use the short pages link. For example, Tamil wikipedia short pages are available here: http://ta.wikipedia.org/w/index.php?title=Special:ShortPages
In this report and in the previous report I have presented my views after analyzing the various available and generated data. It will be really nice if other wikipedians also do the same since they will be able to find many other things that I missed.
Thanks for your time.