Large corpus Lolcat research
Aug. 29th, 2014 09:34 pmA question arose in comments to a previous (friends-locked) post. I'm going to paraphrase it as asking which of these is standard Lolcat:
a. I R SRS [NP1], THIS R SRS [NP2]
b. I IZ SRS [NP1], THIS IZ SRS [NP2]
My instinct was to go for a., and
diatom was going for b. I realized this was empirically testable, because we have access to a massive Lolcat dataset (aka "Google"). Wondering is not a virtue in this situation! Here's what I found:
That's an R/IZ ration of 1.48 for the first pair, and 6.32 for the second.
At first glance, it looks like the 'R' variant is somewhat more common. However, the full Google search might not be a good test, since Google's advanced search doesn't let you restrict results to Lolcat. So some of these hits are probably English speakers borrowing what they think is a catchy Lolcat phrase, regardless of whether an actual L1 Lolcat speaker would ever utter it.
So I decided to not only deliberately restrict it to Lolcat, but to break it out by register. I think we can agree that cheezburger.com is a good source for Standard Lolcat — the written equivalent to the kind you'd hear on TV.
Here are the results after adding site:cheezburger.com to the search:
Those are dramatic results. Now we see an R/IZ value of 5.23 for the first pair and 1.14 for the second. Given the low N and small ratio, the "THIS" phrases seem like a statistical tie, and the first pair has the lopsided ratio here. The high ratio of "I R" to "THIS R" might suggest that the original example a. is a colloquial catchphrase or just isn't idiomatic in Standard Lolcat. I'm not sure how to analyze that — it would be nice to have a larger corpus here.
So, that's standard Lolcat. However, we also have a very nice corpus of Literary Lolcat in the form of the Lolcat Bible. If we restrict the search to lolcatbible.com, however, we get one instance only, in the sentence "i r srs huzband." from 1 Samuel 1:8.
Well, maybe this is unfair, since lot of the Bible is in the past tense. So if we search for the word "srs", do we get any other instances at all of the form "[Pronoun] [DO+TNS] SRS [NP]"? I would only judge Acts 25:7 to count ("Paul came n Joos from Jooroosulum were srs cat and charged Paul n sed he wuz bad."). Literary Lolcat seems to have much more varied syntax than Standard, and consequently it's harder to find multiple instances of any given n-gram in there. I think we'd need a larger corpus to draw any conclusions about Literary Lolcat.
Anyway, to repeat my disclaimer: THIS R NOT SRS BLOG. THIS R NOT SRS RESIRCH.
a. I R SRS [NP1], THIS R SRS [NP2]
b. I IZ SRS [NP1], THIS IZ SRS [NP2]
My instinct was to go for a., and
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
I R SRS | 9550 |
I IZ SRS | 6460 |
THIS R SRS | 62100 |
THIS IZ SRS | 9820 |
That's an R/IZ ration of 1.48 for the first pair, and 6.32 for the second.
At first glance, it looks like the 'R' variant is somewhat more common. However, the full Google search might not be a good test, since Google's advanced search doesn't let you restrict results to Lolcat. So some of these hits are probably English speakers borrowing what they think is a catchy Lolcat phrase, regardless of whether an actual L1 Lolcat speaker would ever utter it.
So I decided to not only deliberately restrict it to Lolcat, but to break it out by register. I think we can agree that cheezburger.com is a good source for Standard Lolcat — the written equivalent to the kind you'd hear on TV.
Here are the results after adding site:cheezburger.com to the search:
I R SRS | 68 |
I IZ SRS | 13 |
THIS R SRS | 16 |
THIS IZ SRS | 14 |
Those are dramatic results. Now we see an R/IZ value of 5.23 for the first pair and 1.14 for the second. Given the low N and small ratio, the "THIS" phrases seem like a statistical tie, and the first pair has the lopsided ratio here. The high ratio of "I R" to "THIS R" might suggest that the original example a. is a colloquial catchphrase or just isn't idiomatic in Standard Lolcat. I'm not sure how to analyze that — it would be nice to have a larger corpus here.
So, that's standard Lolcat. However, we also have a very nice corpus of Literary Lolcat in the form of the Lolcat Bible. If we restrict the search to lolcatbible.com, however, we get one instance only, in the sentence "i r srs huzband." from 1 Samuel 1:8.
Well, maybe this is unfair, since lot of the Bible is in the past tense. So if we search for the word "srs", do we get any other instances at all of the form "[Pronoun] [DO+TNS] SRS [NP]"? I would only judge Acts 25:7 to count ("Paul came n Joos from Jooroosulum were srs cat and charged Paul n sed he wuz bad."). Literary Lolcat seems to have much more varied syntax than Standard, and consequently it's harder to find multiple instances of any given n-gram in there. I think we'd need a larger corpus to draw any conclusions about Literary Lolcat.
Anyway, to repeat my disclaimer: THIS R NOT SRS BLOG. THIS R NOT SRS RESIRCH.