Saturday, February 2, 2008

The TIOBE index is meaningless

The TIOBE index ranks programming languages. It claims to be based "on the world-wide availability of skilled engineers, courses and third party vendors". But how can they reliably and automatically mine such infomration using just search engine results?

Actually, not only is their data not very reliable, but it is also prone to "spamming", because search engines are! And this is why we see a totally obscure experimental Forth-like language such as "Factor" get in the top 50. There is only one explanation: the TIOBE index is simply a combination of the number of results of some search queries at major search engines; as a handful of persons regularly post articles about Factor at social bookmarking sites such as Reddit or at Wikipedia, this artificially inflates their position.

The other explanation is that Factor is legitimately getting a lot of web attention. But that's absurd, since it doesn't deserve any serious attention. I mean, it is on the same level as Brainfuck. Brainfuck is interesting to programming language geeks. Factor can be interesting to Forth geeks, or compilation geeks. But that's not what TIOBE is about.

In the real world, there is no Factor. It is just a virtually unknown obscure experimental language with a small fandom that managed to get into a mostly meaningless index. You want proof?

There is not a single scholarly article about it, not a single PhD about it, actually not a single known application written in Factor, no single school giving courses in Factor; in fact, Factor isn't even in the Debian distribution, while Brainfuck, which is also an obscure language, is. How many persons in the world are paid to write Factor code?

But then it could be that Factor is the language of the future, and TIOBE is very good at picking languages of the future?

It seems that TIOBE is just very good at picking spamming effort. Consider the following important languages, which are not in the top 50.

Let's show that the rankings at the TIOBE index do not map to language importance according to any criteria other than web hype:
  • VHDL, an industry-standard hardware description language, is not even in the top 50. Verilog isn't even mentioned on the TIOBE page.
  • Ocaml is a well-known, academically developed state-of-the art functional language that has been around for ten years (and much more if you count its direct ancestor Caml). Typing ocaml OR "objective caml" OR caml at Google scholar returns about ten thousand results. Ocaml is also used as a language in 173 Debian packages, of which 40 are end-user applications (i.e., not dependencies). Ocaml has thousands of users, is teached at hundreds of schools, and has Intel, Dassault Systems and Microsoft in its consortium. F# is an Ocaml derivative for .NET. Yet, Ocaml is not in the top 50, while the obscure Factor is. This simply means that the TIOBE metric is absolutely meaningless.
  • Actually there is an ML at position 42, but which ML is that? SML? XML? HTML? YaML? But that doesn't include Ocaml, since it's mentioned elsewhere.
Languages which legitimately have buzz around them include Scala, which is academically developed, and has many posts about it at Reddit. Still not in top 50.

The other languages cited in the top 50 are usually vendor-specific languages of products that have some momentum; for many of those languages, knowledge of the language is indistinguishable from knowledge of the particular software product. And what the hell is PL fucking I doing in a 2008 list of the top 50 languages?

So, while obscure experimental languages and vendor-specific scripting frameworks clutter the top 50 list, industrially and academically important real-world languages such as VHDL, Verilog or Ocaml are relegated to the end or not mentioned at all.