Analyzing visual content (frames?) using Google Vision API and topic modeling

The aim of my study on „visual self-presentation of politicians on Facebook“ was to find out to what extent relatively simple and low-threshold tools (such as the Google Vision API) can support the analysis of visual framing. Hence, in this blog post I want to recap on this topic.

The term “frame” is quite problematic as it is used across multiple disciplines and with many differing meanings. Even within communication we are far from a unified definition. In my study, frames are understood as “horizons of meaning” that emphasize certain information and/or positions and suppress others (see also Matthes 2014, p. 10). The term “horizons of meaning” seems a little esoteric at first. However, it makes it possible to apply the same concept of framing to political and social actors, media content as well as the audience.

I my case study I focus on the strategic positioning of politicians on social media. In contrast to the analysis of the media content that they publish, it would have been an alternative option to roll out interviews with the political actors and their social media teams to gain insights to their strategic framing. But the advantage in analyzing the manifest social media content are to twofold: First, in analyzing the manifest output I get an understanding on what they actually do and not what they strategically think they do. Second, this manifest output directly links to the audience who receives the messages posted on Facebook and probably bases their perception on that content.

Of course, it has been acknowledged before that media frames are often multimodal in nature and that it is kind of artificial to just focus on either text or visuals. Nonetheless, most past and current framing studies focus primarily on textual content, which I think is quite problematic. Nonetheless, I do something similar here in ignoring the textual parts of the posts and just taking a closer look at the pictures. But there is more to come in future analysis of the sample.

But how do we identify frames in the first place? Matthes (2014, p 39) who refers primarily to the analysis of texts differentiates between four approaches to identify frames: (1) qualitative (2) manual holistic, (3) manual clustering and (4) computer-assisted. Of course, it was my aim to apply a computer-assisted approach her, and thus I will briefly expand on that. The computer assisted approach (also called frame mapping) is based on the assumption that frames are manifested in the word choice of texts. The co-occurrence of particular words can be quantified and text can be clustered according to this co-occurrences (Miller & Riechert 2001, p. 63). An advantage is that this automated analysis is less vulnerable to subjective interpretation by manual coders or qualitative analysis. I should be added however, that the automated approach is often criticized because it is kind of coarse. It is hard to identify important words, which do not necessarily occur very often or in the same context and with the same meaning. Certainly, an automated analysis cannot replace the interpretation of the researcher and needs to be handled with care and high analytic expertise.

So far so good, however, I collected more than 300,000 pictures. How to compress that? The first step was the automated analysis on all pictures using the Google Vision API. The R script which sends the visuals to Google and receives the keywords back is quite simple and is published on my github page already (here). The Google Vision API has several attributes which can be analyzed, like labels (automated content identification), text extraction, logo identification, web search and facial recognition. I chose to work with only the label identification which returns keywords of the content in the picture and the likelihood of the identification of that particular keyword. It costs 1.50 $ per 1000 API requests. The first 1000 requests per month are free of charge so everyone can use it in smaller studies. The collection took a while, but I finally have a dataset including 3,078,899 labels of 327,530 pictures. So its roughly 9,5 labels per picture.

Here are two examples:

Example 1:

Originally posted on Facbook by Jana Schimke (CDU), 2014-11-03

Labels: Communication  70%; Public Relations  64%; Conversation  62%; Collaboration  58%

Text of post: “ Heute Abend kommen alle ostdeutschen Abgeordneten von CDU und SPD zusammen. Gemeinsam beraten wir Fragen zur wirtschaftlichen Entwicklung und zum Länderfinanzausgleich, zur Rente Ost sowie zur Arbeitsmarktpolitik.“ [All East German members of the CDU and SPD will meet tonight. Together we advise on economic development, fiscal equalisation, pensions in the East and labour market policy“.]

Example 2:

Originally posted on Facebook by Luise Amtsberg (Grüne), 2017-09-01

Labels: Plant  98%; Tree  96%; Garden  92%; Woody Plant  92%; Community  85%; Grass  84%; Yard  79%; Soil  79%; Shrub  75%; Backyard  73%; Lawn  64%; Gardener  63%; Recreation  61%; Gardening  60%; Plantation  59%; Grass Family  59%; Play  57%; Fun  55%

Text of post: „In Steinbergkirche gärtnern deutsche und geflüchtete Familien gemeinsam. Über die Arbeit im Garten wird nicht nur frisches Gemüse und Obst produziert, es werden auch Sprachbarrieren überwunden und Integration gelebt. Schönes Projekt!“ [In Steinbergkirche, German families and refugees garden together. Working in the garden not only produces fresh fruit and vegetables, it also overcomes language barriers and integration is lived. Great project!]

The examples show that the labels are of course very much bound to the concrete image content and represent the mere facts that are shown. Moreover, if someone reads only the lables, s_he could reconstruct the image only vaguely, especially the labels of the first example are rather generic to my mind. Also, the many keywords regarding plants and gardening in the second picture highlight that google is very attached to the factual level and sets very different priorities than a human observer would probably do. Nonetheless the labels are still meaningful and appropriate to differentiate between the pictures on a factual level.

Ok, now we have many pictures and we have even more labels. Time to reduce the number of relevant units of meaning! But how do we do that? Since I attended the Topic Modeling Workshop held by Wouter van Attefeldt at our CCS-Conference last spring, this was a nearby approach. The workshop materials are online here. Broadly speaking topic models cluster “documents” simultaneously with “topics” accruing in that document. The topics could be interpreted as frames.

A key parameter of a topic model is the number of topics that should be extracted called k. I started my analysis with very few topics (k = 14 seemed like a reasonable number according to a perplexity plot). But it turned out that this led to rather mixed results and very diverse pictures were put together in one topic. Thus, I change my strategy. Instead of extracting only few topics, I decided to extract a lot more. I was experimenting with 100 and even more topics. The aim was to see if this will lead to more homogenous topics that could probably be reunited manually afterwards. Also I wanted to filter out topics that are not interpretable. This strategy worked out quite well, although I didn´t find the perfect number of topics yet. It´probably somewere between 80 and 150. But each calculation takes, well, long (up to 24 hours).

To give a little insight into my analysis so far, I showcase some of the Frames which were extracted in the k = 100 solution.

Topic 27: Cat Topic

Top 10 keywords: “green“, „product“, „grass“, „advertising“, „organism“, „font“, „photo_caption“, „text“, „snout“. „fauna“


Topic 5: Eyeware Topic

Top10 keywords: “vision_care“, „glasses“, „eyewear“, „smile“, „product“, „sunglasses“, „facial_hair“, „selfie“, „fun“, „cool“

Naturally, on a factual level the classification of an “eyewear topic” or “cats” makes sense but it is not really useful in strategical framing analysis. In the next step it would probably a good idea to filter out keywords which lead to meaningless frames (meaningless always meant in regard to the purpose of the study). Or probably topics which are not interpretable could be dropped from the analysis.

But it is not al disappointing. There are also topics that can be interpreted as strategic political frames. For instance:

Topic 9: “Me at important meetings or in parliament”

Top 10 keywords: “seminar“, „meeting“, „academic_conference“, „institution“, „convention“, „audience“, „city_council“, „conference_hall“, „public_speaking“, „lecture“

Topic 53: “Visiting children”

Top 10 keywords: “institution“, „professional“, „communication“, „learning“, „education“, „student“, „classroom“, „class“, „school“, „room“

So all in all the answer to the question whether low threshold automation tools can be used for framing analysis is: Yes! But framing analysis is still a lot of work and even automated analysis can support you, there will be a lot of thinking and manual correction involved.



Matthes, J. (2014). Framing. Baden-Baden: Nomos

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.