Your web-browser is very outdated, and as such, this website may not display properly. Please consider upgrading to a modern, faster and more secure browser. Click here to do so.
The diver Tom Daley posted a video on YouTube in which he told his viewers that he has a boyfriend (if he made any more revelations in the last minute or so, I missed them because I stopped watching - he doesn’t half go on). Some delightful person called RogerGT posted a comment saying ‘Of course he’s gay! He has a sibilant ‘s’.’:
This is obviously a gross stereotype. While there may be a stereotypical ‘gay’ manner of speech, this is about campness rather than gayness, and Daley does not have camp mannerisms. This is likely to be a post-hoc application of new knowledge causing delusions of prior knowledge. Similarly, an acquaintance of mine said people often tell her they could hear the German in her accent after she reveals her nationality, despite the fact that she sounds entirely English.
More to the point, though, is that everyone’s ’s’ is sibilant: that’s the term for that particular type of ‘hissing’ sound. ‘S’, as normally pronounced, is the canonical example of a sibilant. So, not a good diagnostic for someone’s sexuality then.
What RogerGT is probably referring to is the camp stereotype of ‘lisping’: producing a sound more like ‘th’ /θ/ instead of ‘s’ /s/. Daley does actually do this a tiny bit, though I’d never noticed it before I went back and listened out for it.
So what have we learnt today? Well, from a linguistic point of view, we’ve learnt that people are quick to stereotype or judge people based on their accent, that people don’t use linguistic terminology right (jk we all knew that already), and that people are really extraordinarily good at spotting even tiny differences in the production of speech sounds.
Bolded for emphasis.
Some tangential notes: Not only are we very sensitive to variation in speech sounds, but we’re generally quick to assign social markers to these features. Sometimes, communities adopt these features as sort of social markers and identification of within-group members. Sometimes markers are used by a group to identify people outside the group. This is natural, normal, and not something we can judge from a moral standpoint.
What we can do, however, is to double check why we attribute certain features to certain groups, and whether our assumptions are harmful. Sometimes, they’re neutral, and that’s fine. But when you hear a speech feature that you attribute to a certain group, its always best to check where that attribution comes from, and whether the association is harmful or neutral. You know, “check your linguistic privilege” like tumblr has taught us.
Hi, new followers! I’m suddenly inundated with alerts, so I must ask, where are you all coming from?
So I found this site today, which procedurally generates headlines. It’s a great way to waste a few [dozen] hours, and occasionally produces results like “UPDATED: There Are No Vampires at the Denver International Airport” or (my favorite) “It’s Time to Get a Rental Boyfriend”.
So how does it work?
Well, they don’t say, so I’m making assumptions here based on an analysis of the headlines it’s produced. But I’d be very surprised if I was wrong. It looks like Headline Smasher works through an algorithm called “Dissociated Press”, assembling text chains from a corpus gathered from the sites you pick in the sidebar. Dissociated Press is one of my favorite applications of computer science. It’s fun to play with, can eat up hours of your life, and are occasionally actually useful if your imagination is temporarily out of service.
Procedural generators based on Dissociated Press work like this: First, you feed in a corpus of text. The analysis phase of the program iterates over every item (it can operate on a character level or a word level; the character level is great for creating new words that kinda-sorta obey the phonotactics of the source language and the word level is great for producing headlines). It adds item n to a database, then moves on to item n+1, and adds a link to n+1 to the entry for n.
In the generation phase, the program picks a random entry from the database, prints it, makes some kind of determination as to whether to terminate, and if not, starts over, picking the next word from the list attached to the previous word.
Conceptually, it’s extremely elegant. Implementation-wise, not so much; unless you’re using a high-level language with support for strings and hash maps, expect to either have a cartoonishly inefficient program or spend several hours working out the best data structures to use and implementing them. This is one of the few circumstances when even I’d probably resort to using Ruby, although I wrote a shitty, segfaulty Dissociated Press program in C++ in my incredibly boring and painful astronomy class last year out of a desperate need for survival, and I probably wouldn’t trust or distribute any implementation I hadn’t carefully put together by hand in C.
So why does this work so well for language generation? Well, syntax essentially works by constructing phrases according to set syntactic rules. Let’s take a language where the following things are grammatical utterances:
- Slumbat plót.
- Bulgrim plót.
- Dirgól plótag.
- Slumbat bo Bulgrim plótun.
- Slumbat sterka plót.
- Slumbat krenka zí sterka plót.
- Slumbat si plongo sterka plót.
- Slumbat si sterka plót.
- Sterka creltar.
- Sterka bo plongo creltuna.
- Slumbat ning Dirgól sterka plótag pungot.
And the following are ungrammatical:
- *Dirgól plót.
- *Slumbat zi sterka plót.
- *Dirgól plótag bo Bulgrim.
- *Sterka crelt.
- *Sterka bo plongo creltun.
There’s a lot going on here. But based on this data, it’s fair to make these assumptions about the grammar:
- SOV order. Potentially also OSV, but given that proper nouns seem to show up clause-initially pretty often, it’s reasonable to assume for now that they’re agents. [(2), (4), (6)] (Note that even if we didn’t know which words were proper nouns, we could still guess based on “Slumbat” and gender agreements, which is an Austronesian name, and quite possibly not a native of whatever culture produced this language, since this language is demonstrably not Austronesian.)
- At least three grammatical genders, possibly male/female/neuter or male/female/inanimate; or depending on the species, possibly something like queen/worker/inanimate. [(2), (4), (17)]
- Nominative-accusative alignment - this is a bit trickier, but the verb conjugates for gender, and conjugates the same in (2) and (6), even though Slumbat and sterka are different genders.
- It has some level of recursion. [(12)] This is normally a given for human languages, but Pirahã is believed by some to lack recursion, so it’s worth keeping in mind. Also, this might not be a human language!
With these out of the way, we can start assembling some syntax rules. Parentheses indicate that an element is optional.
- S → (NP) VP [evident in (1), (2), (6)]
- NP → NP (CC NP)
- | NP → Comp S
- | NP → PN
- VP → (AdP) (NP) V
- AdP → PrP | PoP
- PrP → Pr NP
- PoP → NP Po
(Here S means “clause”, N “noun”, NP “noun phrase”, PN “Proper Noun”, V “verb”, VP “verb phrase”, Comp “complementizer,” CC “coordinating conjunction”, AdP “adpositional phrase”, Pr “preposition”, Po “postposition”, PrP “prepositional phrase”, and PoP “postpositional phrase”. | marks alternatives).
The individuals lexemes would then be divided up like so:NPNVPrPoCCComp
If the flu hasn’t eaten the part of my brain that does linguistics, that should be a complete syntax for that language (we’ll call it West Smoppian), at least based on these examples. Note that there are other ways we could analyze this, and they might be more or less correct. You’d have to amass and analyze a significant West Smoppian corpus to make more definitive judgments about the syntax.
An aside: All it takes is one piece of data to completely upset the syntax analysis. Say we add the following to our Smoppian corpus:
- *Slumbat sterka bo plongo plóte.
- *Slumbat sterka vu plongo plót.
- Slumbat sterka vu plongo plóte.
Everything instantly falls apart. “Bo” doesn’t function the way we thought it did. There are a couple ways we could analyze this, but the simplest is that West Smoppian uses a different coordinating conjunction in certain circumstances. We’d need more data to tell what those circumstances are, but a reasonable guess is that “bo” is used in the subject NP and “vu” is used in the object NP. We know it’s not gender, because “sterka bo plongo” is attested in the subject position. There’s also an indication that verbs mark for object number at the very least.
Anyway. A speaker of West Smoppian (I’d say a headline writer, but West Smoppians don’t have newspapers; they have shouting and smoke signals) produces hir language by filling in these rules with the appropriate words or phrases. So if the subject is Slumbat, the object is sterka, and the action is plót-ing, the Smoppian will say “Slumbat sterka plót.”
Here is where the problems arise. Humans (even West Smoppians, when they’re not too drunk on cactus vodka) are capable of recursion, but DP is strictly linear in the generation phase, with no recursion to speak of. This is evident in the output. Syntax is extremely recursive, so the sentence “Bob and Jane attacked and devoured the convoy of merchants and pilgrims” breaks down like this: (I think. Please please please tell me if I got something wrong, because I’m still kind of new at syntax trees.)
Dissociated Press, however, doesn’t see any of that. It just reads the whole sentence as one flat structure. It doesn’t have the rich syntax system that humans have for our languages, it just has a list of words and potential links between them. So if you feed it that English sentence, you can and most certainly will get results like “Bob and devoured,” “Jane attacked and Jane,” and the like. And the words “devoured,” “the,” “convoy,” “of,” and “merchants” will always be treated as a unit, as there are no other contexts in which they appear, though they might get truncated if the algorithm decides to stop before reaching the end of the chain.
You also won’t get a huge number of variations like “Jane devoured the merchants and attacked Bob and the pilgrims,” because there’s no path from the input data to this output. DP generally has to compensate for this with very large corpuses (corpi?), which alleviates the problem to an extent, but as long as there are still points of overlap like conjunctions, DP will never be limited to producing things that are grammatical.
An interesting implication is that the grammar of the source language can hugely affect the potential results of a DP engine. In an SVO language like English, the verb will be determined by the subject, and the object by the verb. But in an SOV language, the subject will determine the object and the object or the subject will determine the verb. And of course, DP works better for languages with less inflection: the less inflection the words have, the more grammatical the DP results will be.
Linguistics people! I am doing research on Multiple Modal constructions and would greatly appreciate responses to this survey. It only takes 5ish minutes and it is 99% painless :) Many thanks!
Page 1 of 72