Teaching Robots to Feel: Emoji & Deep Learning ๐Ÿ‘พ ๐Ÿ’ญ ๐Ÿ’•

Recently, neural networks have become the tool of choice for a variety of tough computer-science problems: Facebook uses them to identify faces in photos, Google uses them to identify everything in photos. Apple uses them to figure out what youโ€™re saying to Siri, and IBM uses them for operationalizing business unit synergies.

Itโ€™s all very impressive. But what about the real problems? Can neural networks help you find the ๐Ÿ’ฏ emoji when you really need it?

Why, yes. Yes they can. ๐Ÿ˜

This post will outline some of the engineering behind Dango, allowing us to automatically learn from hundreds of millions of real-world uses of emoji, and distill this down to a tool small and fast enough to predict emoji for you in real time on your phone.๐Ÿ“ฑ ๐Ÿ’ญ ๐Ÿ’ก ๐Ÿ“ฒ ๐Ÿก

What is Dango?

Dango is a floating assistant that runs on your phone and predicts emoji, stickers and GIFs based on what you and your friends are writing in any app. This lets you have the same rich conversations everywhere: Messenger, Kik, Whatsapp, Snapchat, whatever. (just making this possible in every app is an engineering challenge of its own, but thatโ€™s another story).

A conversation with Dango

Suggesting emoji is hard: it means Dango needs to understand the meaning of what youโ€™re writing in order to suggest emoji you might want to use. At its core, Dango’s predictions are powered by a neural network. Neural nets are computational structures with millions of adjustable parameters connected in ways that are loosely inspired by neurons in the brain.

A neural network is taught by randomly initializing these parameters and then showing the network millions of real-world examples of emoji use taken from across the web, like Hey how’s it going ๐Ÿ‘‹, Want to grab a ๐Ÿป tonight?, Ugh ๐Ÿ˜ก, and so on. At first the network just guesses randomly, but over time with each new training example, it slightly adjusts its millions of parameters so it performs better on that example. After a few days on a top-of-the-line GPU, the network starts outputting more meaningful suggestions:

Want to grab a drink tonight? ๐Ÿน ๐Ÿบ ๐Ÿท ๐Ÿธ ๐Ÿ˜

Things we’ve learned about emoji

The data-driven approach to emoji prediction means that Dango is smarter about emoji than we are. Dango has taught us new slang, and inventive ways that people around the world tell stories with emoji.

For instance: if you write “Kanye is the”, Dango will predict the ๐Ÿ emoji . This goat of course represents Greatest of All Time (G.O.A.T), a title Kanye bestowed upon himself earlier this year:

Dango can express things that aren’t represented by any single emoji. For instance if you’re a resident of B.C. or Colorado, and enjoy “relaxing”, Dango speaks your language.

420 tonight? ๐Ÿ˜™ ๐Ÿ’จ ๐Ÿšฌ ๐Ÿ

If you’re mad at someone and just want them to GTFO. Dango will helpfully show them the door:

GTFO ๐Ÿ‘‰๐Ÿšช๐Ÿ‘ˆ.

Dango has also learned plenty from internet culture. It understands memes and trends. For instance, if youโ€™ve seen the “but that’s none of my business” image of Kermit the Frog sipping tea:

Kermit the frog, sitting by the window happily lifting a mug of tea to his eager lips. The light catches in the beautiful ochre of the liquid for a moment. A glimpse of the sublime in the quotidian.

but that's none of my business ๐Ÿธ โ˜•

There are many other subtle references and jokes that Dango understands, and it’s always learning to make sure that it keeps up to date

Beyonce ๐Ÿ‘‘ ๐Ÿ (she's the Queen Bee).
Pizza rat ๐Ÿญ ๐Ÿ• ayy lmao ๐Ÿ‘ฝ

And certainly many we’ve not yet discovered.

More than just emoji

Given that Dango is trained on emoji, it might at first seem that the number of concepts it can understand and represent are small โ€” as of this writing the Unicode Consortium has standardized 1624 emoji which, despite being a headache for font designers, is still a relatively small number.

However this doesnโ€™t mean that there are only 1624 meanings. When you use emoji, their meaning is determined by how they look and the context of their usage - which can be highly diverse. ๐Ÿ™ can mean “high-five” or “thank you” or “please”. ๐Ÿ† can mean… eggplant, exclusively. What’s more, emoji can be combined to express new concepts. ๐Ÿ˜™ is a kissing face, but ๐Ÿ˜™๐ŸŽถ is whistling, and ๐Ÿ˜™๐Ÿ’จ is exhaling smoke. These emoji combos can become quite elaborate:

at the dentist ๐Ÿ˜ท ๐Ÿ’‰ ๐Ÿ˜ฌ
stuck in traffic ๐Ÿšฆ ๐Ÿš— ๐Ÿš• ๐Ÿš™

All this means that the number of semantic concepts that Dango can represent is much greater than simply the number of individual emoji. This is a powerful concept, because it gives Dango a way of understanding a wide variety of general concepts, regardless of whether the Consortium has recognized them with their own symbols.

Dango is therefore also able to suggest stickers and GIFs. Since as shown earlier, Dango knows about get out ๐Ÿ‘‰ ๐Ÿšช ๐Ÿ‘ˆ, it can suggest this GIF for you as well:

get out
Man autodefenestrating with gusto

Going deeper

Let’s dig a little deeper into how that works.

A naรฏve approach to suggesting emoji (and the approach we first tried with Dango), would be to directly map some words to emoji: pizza ๐Ÿ•, dog ๐Ÿถ, etc. but this approach is limited and doesn’t reflect how emoji (and language) are actually used.

There are many examples where a subtle combination of words determines meaning in a way that is impossible to concisely describe with a simple mapping.

My girlfriend left ๐Ÿ’”
you got it ๐Ÿ‘ŒโœŠ
you know it ๐Ÿ˜
he's the one โค ๐Ÿ‘ซ
She said yes! ๐Ÿ˜ ๐Ÿ’ ๐Ÿ™Œ

To handle these cases, Dango uses a recurrent neural network (RNN). An RNN is a particular neural network architecture that is well suited to sequential input, and is therefore used in areas as diverse as natural language processing, speech processing, and financial time-series analysis. Weโ€™ll quickly go over a high level of what an RNN is here, but for a more in-depth introduction take a look at Andrej Karpathyโ€™s great overview.

RNNs handle sequential input by maintaining an internal state, a memory which lets them keep track of what they saw earlier. This is important to be able to tell the difference between I’m very happy ๐Ÿ˜Š ๐Ÿ˜„ ๐Ÿ˜ƒ and I’m not very happy ๐Ÿ˜” ๐Ÿ˜ž ๐Ÿ˜’.

Multiple RNNs can also be stacked on top of each other: each RNN layer takes its input sequence and transforms it into a new, more abstracted representation that is then fed into the next layer, and so on. The deeper you stack these networks, the more complex the sorts of functions they can represent. Incidentally, this is where the now popular term โ€œdeep learningโ€ comes from. Major breakthroughs on hard problems like computer vision have come partly from simply using deeper and deeper stacks of network layers.

Dangoโ€™s neural network ultimately spits out a list of hundreds of numbers. The list can be interpreted as a point in a higher-dimensional space, just as a list of three numbers can be interpreted as the x-, y-, and z-coordinates of a point in three-dimensional space.

We call the high-dimensional space semantic space, think of it as a multi-dimensional grid where various ideas exist at various points. In this space, similar ideas are close together. Deep learning pioneer Geoff Hinton evocatively refers to points in this space as โ€œthought vectorsโ€. What Dango learned during the training process was how to convert both natural language sentences and emoji into individual vectors in this semantic space.

So when Dango receives some text, it maps it into this semantic space. To decide which emojis to suggest, it then projects each emojiโ€™s vector onto this sentence vector. Projection is a simple operation which gives a measure of similarity between two vectors. Dango then suggests the emoji with the longest projection โ€” these are the ones closest in meaning to the input text.

Visualizing semantic space ๐Ÿ˜ฎ๐Ÿ’ญ๐ŸŒŒ

For those of us who are visual thinkers, this spatial metaphor is a powerful tool to help us intuit and talk about neural networks. (at Whirlscape we are addicted to spatial metaphors; see our earlier post about the algorithms of the Minuum keyboard).

To help us visualize Dangoโ€™s semantic space, we can use a popular technique for visualizing high-dimensional spaces called t-distributed stochastic neighbour embedding, or t-SNE. This technique tries to place each high dimensional point into two dimensions in such a way as to make sure that points that were close to each other in the original space remain close in the two-dimensional space. Although this mapping will be imperfect, it can still tell us a lot. Letโ€™s use t-SNE to visualize the emoji floating in semantic space:

Notice here how semantically similar emoji are clustered together automatically in this space. For example, most of the faces are clustered together in โ€œFace peninsulaโ€ The faces arrange with the happy ๐Ÿ˜€๐Ÿ˜๐Ÿ˜Š in one region, the angry ๐Ÿ˜ ๐Ÿ˜ก๐Ÿ˜– in another. All of the heart emojis are also clustered right nearby at the peak that we call โ€œPoint Loveโ€.

Further along the tail of the shape you can see other interesting groupings: ๐Ÿ€๐Ÿˆโšพโšฝ are all near each other, emoji faces-with-hair ๐Ÿ‘จ๐Ÿ‘ฉ๐Ÿ‘ง๐Ÿ‘ฆ are clustered in isolation away from the faces-without-hair (because why would they want to hang out?). Right towards the end you see a number of flags and less popular emoji like the filing cabinet and the fast-forward sign.

Again, Dango was never explicitly told that faces are somehow different from hearts, or beers, or farm animals. Dango generated this semantic map by training on hundreds of millions of examples of real-world emoji use taken from across the web. So what do we mean by training?

Before training, a neural network is initialized; it is given a set of more-or-less random values; it is, essentially, a clean slate. Sentences map randomly into semantic space, wherein the emoji are randomly scattered.

To train a neural network, we define an objective function; this is essentially a way of grading the network’s performance on a given example. The objective function outputs a score telling us how well or badly Dango did predicting a given example. The smaller the score, the better. We then use a very simple algorithm called gradient descent. With each training example, gradient descent slightly modifies the value of all of the millions of parameters in the neural network in whatever direction that most reduces the objective function.

After several days of this procedure taking place on GPUs, the objective function cannot be improved any further โ€” Dango is fully trained and ready to take on the world!

The future of language

Language is becoming visual. Emoji, stickers, and GIFs are exploding in popularity, despite the fact that itโ€™s still labour-intensive to use them in an advanced way. Enthusiasts create personal collections of images for every situation and have memorized every page of the emoji keyboard, but the rest of us rely on using emoji immediately accessible on our โ€œmost usedโ€ menu and sometimes forward a GIF here and there.

This visual language has matured alongside technology, and this symbiotic relationship will continue, with new technology informing new language, which in turn informs the technology again. Communication in the future will have artificial intelligence tools adapted to you, helping you seamlessly weave imagery with text, and Dango is proud to be at the cutting edge of this progression.

Hopefully youโ€™ve been inspired by this under-the-hood look, and now like us, youโ€™ll picture your every sentence floating somewhere in semantic space, surrounded by hundreds of emoji. Maybe youโ€™ll start playing around with neural networks yourself. Let us know!

And, of course, please try Dango and give us feedback. So that whenever you ask yourself: What emoji should I use? Dango will be there with the answer.

Download me please! ๐Ÿ™ ๐Ÿ˜ญ ๐Ÿ“ฒ