Saturday, February 2, 2008

The TIOBE index is meaningless

The TIOBE index ranks programming languages. It claims to be based "on the world-wide availability of skilled engineers, courses and third party vendors". But how can they reliably and automatically mine such infomration using just search engine results?

Actually, not only is their data not very reliable, but it is also prone to "spamming", because search engines are! And this is why we see a totally obscure experimental Forth-like language such as "Factor" get in the top 50. There is only one explanation: the TIOBE index is simply a combination of the number of results of some search queries at major search engines; as a handful of persons regularly post articles about Factor at social bookmarking sites such as Reddit or at Wikipedia, this artificially inflates their position.

The other explanation is that Factor is legitimately getting a lot of web attention. But that's absurd, since it doesn't deserve any serious attention. I mean, it is on the same level as Brainfuck. Brainfuck is interesting to programming language geeks. Factor can be interesting to Forth geeks, or compilation geeks. But that's not what TIOBE is about.

In the real world, there is no Factor. It is just a virtually unknown obscure experimental language with a small fandom that managed to get into a mostly meaningless index. You want proof?

There is not a single scholarly article about it, not a single PhD about it, actually not a single known application written in Factor, no single school giving courses in Factor; in fact, Factor isn't even in the Debian distribution, while Brainfuck, which is also an obscure language, is. How many persons in the world are paid to write Factor code?

But then it could be that Factor is the language of the future, and TIOBE is very good at picking languages of the future?

It seems that TIOBE is just very good at picking spamming effort. Consider the following important languages, which are not in the top 50.

Let's show that the rankings at the TIOBE index do not map to language importance according to any criteria other than web hype:
  • VHDL, an industry-standard hardware description language, is not even in the top 50. Verilog isn't even mentioned on the TIOBE page.
  • Ocaml is a well-known, academically developed state-of-the art functional language that has been around for ten years (and much more if you count its direct ancestor Caml). Typing ocaml OR "objective caml" OR caml at Google scholar returns about ten thousand results. Ocaml is also used as a language in 173 Debian packages, of which 40 are end-user applications (i.e., not dependencies). Ocaml has thousands of users, is teached at hundreds of schools, and has Intel, Dassault Systems and Microsoft in its consortium. F# is an Ocaml derivative for .NET. Yet, Ocaml is not in the top 50, while the obscure Factor is. This simply means that the TIOBE metric is absolutely meaningless.
  • Actually there is an ML at position 42, but which ML is that? SML? XML? HTML? YaML? But that doesn't include Ocaml, since it's mentioned elsewhere.
Languages which legitimately have buzz around them include Scala, which is academically developed, and has many posts about it at Reddit. Still not in top 50.

The other languages cited in the top 50 are usually vendor-specific languages of products that have some momentum; for many of those languages, knowledge of the language is indistinguishable from knowledge of the particular software product. And what the hell is PL fucking I doing in a 2008 list of the top 50 languages?

So, while obscure experimental languages and vendor-specific scripting frameworks clutter the top 50 list, industrially and academically important real-world languages such as VHDL, Verilog or Ocaml are relegated to the end or not mentioned at all.

8 comments:

Unknown said...

This is an incredibly weak, and probably personal-peeve-motivated attack rather than being a serious article. As with all polls, ratings and rankings, the accuracy decreases as you go further down the list as the margins of difference approach a margin of error. Quibbling about something not being in the top 50 is just stupid. There are lots of problems with the TIOBE rankings (to mention just one, the failure to excluse "Java Script" from Java's results).

Maybe they should put in some mention of potential margins of error, but then they'd lose the publicity they get from moronic fanboys on Reddit I guess.

semmelweis said...

As with all polls, ratings and rankings, the accuracy decreases as you go further down the list as the margins of difference approach a margin of error.


It's not presented as a simple Google fight, but as a serious index whose complete version sells for $1,500; but it has none of the characteristics of a semi-serious poll or survey. Verilog isn't even talked of, VHDL isn't ranked, and ridiculous languages make the top 50 because their proponents focused on the exact search terms used by the index.

Nikhil said...

I totally agree with your gripe with the TIOBE rankings, but I'm disappointed with your picking on Factor. It is not something like brainfuck. It is a usable, powerful language, with a growing number of standard libraries. Even if it may not dominate, it is not a turing tarpit

semmelweis said...

@nikhil:
I'm disappointed with your picking on Factor

Well they brought it on themselves by their constant marketing efforts.

It is a usable, powerful language, with a growing number of standard libraries.
It is an untyped, hype-based Forth derivative with no interesting features and which addresses none of the problems of existing programming languages; no one sane would develop anything serious in it. No one uses it, and no one will use it. It is a toy, and I don't say it's a bad thing to have toys; but it has no place in a list of real languages. It is also irritating because its proponents are incessantly marketing it.

Mark Lee Smith said...

I have to agree, though I would have left out my personal gripes. It just detracts from the real message.


I think I realized TIOBEs results were utter shit when I noticed that Objective-C barely ranks. That seems very strange since it's be principle language behind OS X (Desktop, Server and Mobile) and Mac applications in general.

Apples market share isn't huge, but it's more than enough to make Objective-C count.


Mark.


Oh yeah, and Factor outranks Objective-C... lol.

semmelweis said...

Well I sound quite like a Factor-hater.
Experimental languages are nice, experimenting is nice, and tinkering is nice. Factor is certainly an entertaining learning experiment for its participants. But they must not buy into their own hype and commit the folly of developing serious applications in it. This kind of mistake can be very, very costly, and I know what I'm talking about. Now I don't know who, but some fanboys are hyping that thing as if it was a language meant for real development. That is very wrong, and it's that fanboy buzz that drives me crazy. You can hack a language all you want, but you don't market it to innocent people who might buy it and actually use it for something real.

Dejan Lekić said...

I think the story is much more complex than how you wanted to present it. I think Factor language got into top50 so quickly because TIOBE could not distinguish the language from the famous "X Factor" TV show.
Sure, I might be wrong here, but this is what I suspect is the reason why Factor got so fast into the top 50.
VHDL is indeed a standard language, but it is hardly believable that programmers talk more about VHDL on the NET than about other (read top50) TIOBE languages.
TIOBE is all about popularity, talks, discussions, blogs, rumors, etc. This is why VisualBASIC is so high - there is a lot of talk about it. Same with JAVA and C#.

Justin George said...

I'm going to make a language called "sex", which will automatically be the top language of all time, forever. It'll be a scheme derivative, "SEX" stands for "SEX equals Xtreme".