Analyzing visual content (frames?) using Google Vision API and topic modeling

The aim of my study on „visual self-presentation of politicians on Facebook“ was to find out to what extent relatively simple and low-threshold tools (such as the Google Vision API) can support the analysis of visual framing. Hence, in this blog post I want to recap on this topic.

The term “frame” is quite problematic as it is used across multiple disciplines and with many differing meanings. Even within communication we are far from a unified definition. In my study, frames are understood as “horizons of meaning” that emphasize certain information and/or positions and suppress others (see also Matthes 2014, p. 10). The term “horizons of meaning” seems a little esoteric at first. However, it makes it possible to apply the same concept of framing to political and social actors, media content as well as the audience.

I my case study I focus on the strategic positioning of politicians on social media. In contrast to the analysis of the media content that they publish, it would have been an alternative option to roll out interviews with the political actors and their social media teams to gain insights to their strategic framing. But the advantage in analyzing the manifest social media content are to twofold: First, in analyzing the manifest output I get an understanding on what they actually do and not what they strategically think they do. Second, this manifest output directly links to the audience who receives the messages posted on Facebook and probably bases their perception on that content.

Of course, it has been acknowledged before that media frames are often multimodal in nature and that it is kind of artificial to just focus on either text or visuals. Nonetheless, most past and current framing studies focus primarily on textual content, which I think is quite problematic. Nonetheless, I do something similar here in ignoring the textual parts of the posts and just taking a closer look at the pictures. But there is more to come in future analysis of the sample.

But how do we identify frames in the first place? Matthes (2014, p 39) who refers primarily to the analysis of texts differentiates between four approaches to identify frames: (1) qualitative (2) manual holistic, (3) manual clustering and (4) computer-assisted. Of course, it was my aim to apply a computer-assisted approach her, and thus I will briefly expand on that. The computer assisted approach (also called frame mapping) is based on the assumption that frames are manifested in the word choice of texts. The co-occurrence of particular words can be quantified and text can be clustered according to this co-occurrences (Miller & Riechert 2001, p. 63). An advantage is that this automated analysis is less vulnerable to subjective interpretation by manual coders or qualitative analysis. I should be added however, that the automated approach is often criticized because it is kind of coarse. It is hard to identify important words, which do not necessarily occur very often or in the same context and with the same meaning. Certainly, an automated analysis cannot replace the interpretation of the researcher and needs to be handled with care and high analytic expertise.

So far so good, however, I collected more than 300,000 pictures. How to compress that? The first step was the automated analysis on all pictures using the Google Vision API. The R script which sends the visuals to Google and receives the keywords back is quite simple and is published on my github page already (here). The Google Vision API has several attributes which can be analyzed, like labels (automated content identification), text extraction, logo identification, web search and facial recognition. I chose to work with only the label identification which returns keywords of the content in the picture and the likelihood of the identification of that particular keyword. It costs 1.50 $ per 1000 API requests. The first 1000 requests per month are free of charge so everyone can use it in smaller studies. The collection took a while, but I finally have a dataset including 3,078,899 labels of 327,530 pictures. So its roughly 9,5 labels per picture.

Here are two examples:

Example 1:

Originally posted on Facbook by Jana Schimke (CDU), 2014-11-03

Labels: Communication  70%; Public Relations  64%; Conversation  62%; Collaboration  58%

Text of post: “ Heute Abend kommen alle ostdeutschen Abgeordneten von CDU und SPD zusammen. Gemeinsam beraten wir Fragen zur wirtschaftlichen Entwicklung und zum Länderfinanzausgleich, zur Rente Ost sowie zur Arbeitsmarktpolitik.“ [All East German members of the CDU and SPD will meet tonight. Together we advise on economic development, fiscal equalisation, pensions in the East and labour market policy“.]

Example 2:

Originally posted on Facebook by Luise Amtsberg (Grüne), 2017-09-01

Labels: Plant  98%; Tree  96%; Garden  92%; Woody Plant  92%; Community  85%; Grass  84%; Yard  79%; Soil  79%; Shrub  75%; Backyard  73%; Lawn  64%; Gardener  63%; Recreation  61%; Gardening  60%; Plantation  59%; Grass Family  59%; Play  57%; Fun  55%

Text of post: „In Steinbergkirche gärtnern deutsche und geflüchtete Familien gemeinsam. Über die Arbeit im Garten wird nicht nur frisches Gemüse und Obst produziert, es werden auch Sprachbarrieren überwunden und Integration gelebt. Schönes Projekt!“ [In Steinbergkirche, German families and refugees garden together. Working in the garden not only produces fresh fruit and vegetables, it also overcomes language barriers and integration is lived. Great project!]

The examples show that the labels are of course very much bound to the concrete image content and represent the mere facts that are shown. Moreover, if someone reads only the lables, s_he could reconstruct the image only vaguely, especially the labels of the first example are rather generic to my mind. Also, the many keywords regarding plants and gardening in the second picture highlight that google is very attached to the factual level and sets very different priorities than a human observer would probably do. Nonetheless the labels are still meaningful and appropriate to differentiate between the pictures on a factual level.

Ok, now we have many pictures and we have even more labels. Time to reduce the number of relevant units of meaning! But how do we do that? Since I attended the Topic Modeling Workshop held by Wouter van Attefeldt at our CCS-Conference last spring, this was a nearby approach. The workshop materials are online here. Broadly speaking topic models cluster “documents” simultaneously with “topics” accruing in that document. The topics could be interpreted as frames.

A key parameter of a topic model is the number of topics that should be extracted called k. I started my analysis with very few topics (k = 14 seemed like a reasonable number according to a perplexity plot). But it turned out that this led to rather mixed results and very diverse pictures were put together in one topic. Thus, I change my strategy. Instead of extracting only few topics, I decided to extract a lot more. I was experimenting with 100 and even more topics. The aim was to see if this will lead to more homogenous topics that could probably be reunited manually afterwards. Also I wanted to filter out topics that are not interpretable. This strategy worked out quite well, although I didn´t find the perfect number of topics yet. It´probably somewere between 80 and 150. But each calculation takes, well, long (up to 24 hours).

To give a little insight into my analysis so far, I showcase some of the Frames which were extracted in the k = 100 solution.

Topic 27: Cat Topic

Top 10 keywords: “green“, „product“, „grass“, „advertising“, „organism“, „font“, „photo_caption“, „text“, „snout“. „fauna“


Topic 5: Eyeware Topic

Top10 keywords: “vision_care“, „glasses“, „eyewear“, „smile“, „product“, „sunglasses“, „facial_hair“, „selfie“, „fun“, „cool“

Naturally, on a factual level the classification of an “eyewear topic” or “cats” makes sense but it is not really useful in strategical framing analysis. In the next step it would probably a good idea to filter out keywords which lead to meaningless frames (meaningless always meant in regard to the purpose of the study). Or probably topics which are not interpretable could be dropped from the analysis.

But it is not al disappointing. There are also topics that can be interpreted as strategic political frames. For instance:

Topic 9: “Me at important meetings or in parliament”

Top 10 keywords: “seminar“, „meeting“, „academic_conference“, „institution“, „convention“, „audience“, „city_council“, „conference_hall“, „public_speaking“, „lecture“

Topic 53: “Visiting children”

Top 10 keywords: “institution“, „professional“, „communication“, „learning“, „education“, „student“, „classroom“, „class“, „school“, „room“

So all in all the answer to the question whether low threshold automation tools can be used for framing analysis is: Yes! But framing analysis is still a lot of work and even automated analysis can support you, there will be a lot of thinking and manual correction involved.



Matthes, J. (2014). Framing. Baden-Baden: Nomos

How open can communication research data be?

A question that has been bothering me since the start of the Fellow Program is how communication scientists can handle the disclosure of data in a sensible way. The advantages of open data are obvious: Reproducibility and reusability of already processed data increase the quality of research and save resources. The demand for open data that flows over to us from the sciences and psychology is therefore understandable and the desire for disclosure is generally very welcome. In my experience, however, especially in our small discipline, ownership thinking („I collected it, thus this is MY data“), the fear of giving away competitive advantages and not wanting to be seen in the cards, is particularly strong. I can understand that, I have been socialized in this way, too.

But to accuse communication science of simply insisting on old ways of thinking and a lack of innovative power would be too short-sighted. In fact, an “AG Forschungsdaten” in the DGPuK has been dealing with this topic since last year. So far, they have communicated a general confession to open data, but it is not very concrete yet, because they didn´t release any best practice examples or guideline by now (but that is certainly in planning). I think it is especially necessary to take a closer look to the specific kinds of data we process. Communication science data is very diverse and distinct from other social sciences. There are special challenges that cannot always be solved by long-term planning, like, for example, to obtain an informed consent from respondents to re-use and publish data in advance.

At the intersection at which I am currently working, i.e. the social media analysis of digital data traces, this obtaining of informed consent is unrealistic and would entail very strong limitations of representativeness as well as a high expenditure of resources. Moreover, a re-publication of social media posts is diametrically opposed to the “right to be forgotten”, and even anonymization strategies only help to a limited extent, since they have too often proved useless. Not to mention that there is already an ethical debate about whether public posts by private users on Facebook, Instagram, Twitter and in blogs should be investigated at all without the users consent (boyd & Crawford 2012, Zimmer 2010). Even if institutions or non-private individuals are of interst, as for example in my fellowship project, there are legal hurdles: The copyright of the content usually lies with the platform operator or the user himself.

Since I am very interested in a solution of the problem – after all I would like to make my data from the “Visual framing of politicians on Facebook” project available openly – I was looking for positive examples to guide me. And I actually found something: The data on the „Social Media Monitoring of the German Federal Election Campaign 2017“ project has been online since the end of February in the Gesis datorium. You can find it here: I was curious about the datasets because Facebook and Twitter posts were collected here, even if they only cared about textual content and not the pictures. Unfortunately, the result is somewhat disappointing. For legal reasons, the Facebook posts were not published at all and only the IDs of the Twitter posts. With the help of the IDs you can of course try to reconstruct the tweets afterwards. Due to the volatility of the data, however, this is a crutch and will become less and less practicable with increasing time intervals (cf. Bachl 2018). Of course, the data set is still great, it offers a good starting point when planning a study on political actors and media on social media channels. However, to me it would make even more sense to maintain a database in addition to the original data, which could be updated and expanded collaboratively (just thinking aloud).

Of course, this whole problem of the impossibility of publishing communication data would not arise at all if the platform providers worked with scientists and made the data available on a broad basis. Demands for this collaboration were recently published in an open letter by Axel Bruns, AoIR president. I do not want to discuss the matter here, but I find the open letter entirely worthy of support. Nevertheless, I wonder how realistic the demands for unhindered access for scientists on a broad basis really are in the medium term. And what do we do in the meantime? Young scientists in particular have no time to wait for implementation by platform operators and politicians.

Perhaps we should set up a joint infrastructure in which the data is collected and stored? At the moment, in my estimation, many resources are bound by the fact that researchers and/or institutes create doubled structures here, or worse: are not able to carry out a comprehensive social media study alone. Thus, a joint data repository would be really desirable! This could perhaps work, at least at national level or within scientific societies.

Well. I still don’t have a solution for publishing my project data. At least it is a first step towards openness to know who has collected which data at all – perhaps this will at least make it possible to re-use the data, link relevant data sources or even to establish new cooperations.

Politicians Facebook Posts: First descriptive results on parties and politicians

During the last weeks I was busy getting an overview on the data. Being kind of a graphic nerd, I wanted to create not only functional but also aesthetic outputs. While ggplot2 certainly has it´s quirks, it was fun to puzzle out charts that worked for me. Here come the first resulta.

The basis of the data collection was a list of the 2653 politicians who were either members of the 18th German Bundestag or running as candidates for the 19th election period starting in 2017. Since the parties are free to nominate as many candidates as they like this distribution hardly reassembles the voting share of any election. The politicians are divided among the parties as follows:

As stated in a previous post, naturally not all politicians maintain Facebook profiles, nor are their profiles necessarily public and available for download. The final sample of politicians is shown below:

  Clearly, the politicians of some parties are much more involved in Facebook campaigning than others. For instance, nearly as many SPD politicians as CDU/CSU politicians have a public profile although there were fewer names on the list. Further, the FDP politicians seem to be quite present on the platform, while politicians of the Grüne are not so well represented in the Facebook sample- To get this insight more comparable, the following chart presents the ratio of politicians with Facebook profile per party.

The chart makes it clear that the FDP politicians maintain a profile above all others (65 percent). Looking at individual FDP profiles I think one can identify that the FDP social media team provided unitary visual material that was distributed vastly among the candidates. In addition to the likewise strong pretence of the SPD on Facebook, it may come as a surprise, that the AfD candidates are not so strongly represented.  Eventually, the party’s success was partly attributed to its successful presence in the social networks. At the level of individual politicians, however, there no particular quantity can be found.  What is also surprising is the present of the Greens in the sample below average. Although the party appeals to a rather young target group and the Green parliamentarians are also younger than their colleagues from other parties, the Greens have the fewest Facebook profiles, both in terms of absolute and relative share.

Well, although these findings reveal interesting background knowledge, they parties are not the main objective of the current study, which is concerned with strategic communication of single politicians. On politician level I first analyzed popularity in terms of fan count. Unsurprisingly, current chancellor Angela Merkel leads the Top10 with great distance more than 2.5 million fans. Second is not her main contender Martin Schulz (SPD) but former opposition leader Gregor Gysi (Die Linke). Both candidates pair close to 0.5 million fans and are followed by the current parliamentary party leader of Die Linke, Sarah Wagenknecht (0.4 million fans). The leading candidate of the FDP, Christian Lindner cannot contend with these numbers. At least he has 0.2 million fans. On rank 6 Frauke Petry follows, the former party spokeswoman of the AfD who left the party right after the election. Because she was still am member for the main part of the enquiry period, I will keep her as an AfD member. Next, on rank 7 is the first leading politician of the Greens, Cem Özdemir (0.1 million fans). This completes all the factions now represented in the Bundestag. On rank 8 to 10 more leading candidates from AfD (Alice Weidel), CDU (Jens Span) and SPD (Sigmar Gabriel) follow. From 10th place in this ranking, the number of fans drops below the mark of 100000.

1 Angela Merkel CDU 2,522,139
2 Gregor Gysi DIE LINKE 476,350
3 Martin Schulz SPD 470,114
4 Sahra Wagenknecht DIE LINKE 405,792
5 Christian Lindner FDP 243,822
6 Frauke Petry AfD/independent 214,936
7 Cem Özdemir GRÜNE 142,475
8 Alice Weidel AfD 108,334
9 Jens Spahn CDU 105,997
10 Sigmar Gabriel SPD 83,299

The next step will be to analyze the > 710,000 posts which were collected in regard to the buzz they created as well as their textual as well as visual aspects. I already calculated some results but I´m still searching for a good way to present them online. This is difficult because so many pictures are involved…

Politicians Facebook Posts: Lab report on data collection

Since the data collection for my project on strategic communication of politicians on Facebook has been completed, it´s about time I write an extensive lab report on how it went. I have experimented with web scraping in R and Python for a while now, but this was by far the most extensive data collection I have ever conducted. In total, I collected the Facebook posts of 1,398 political candidates during the last 4 years, covering the whole 18th election period of the German Bundestag. The total sample resulted in about 710,000 posts. Of those, ca. 390,000 were classified as photo posts and for the purpose of considering also the visual aspects of strategic framing I also collected those.

The starting point of my data collection was a list of the 2653 politicians who were running in the German federal election in 2017 (19th election period of the German parliament) as well as those parliamentarians who were members of the preceding election period but did not compete in the current election. This list was compiled using several online sources like the website of the German Bundestag, the website (an initiative comparable to Wahl-O-Mat, where voters can check on single candidates opinions not only on parties),  as well as Wikipedia. These sources supply the social media links for some of the politicians, but it turned out that many Facebook profiles were missing in automated data collection and some Facebook links seemed to be outdated. Thus, I corrected and supplemented the lists with the politicians Facebook identifiers in a manual search (*phew!*). The search resulted in 2066 Facebook profiles of election candidates.  In reverse, this means that for 587 politicians no Facebook was available (22.1 percent of all candidates). Although one could assume that a Facebook is a standard instrument in modern campaigning, even some very prominent politicians do not maintain a profile. – For instance, Federal Minister of the Interior Thomas de Maizière, CDU, dropped out of the sample for this reason.

My list of election candidates contains other social media links, but I did not systematically check those, because they are not relevant for the specific research purpose of the current project. Anyway, for those who are interested in conducting a similar analysis of German parliamentarians in social media, the list can be downloaded here and it will also be available on git.

Next, I had to choose an appropriate storage for the data. Since all individual records are relatively homogenous in their attributes and the proposed data records have a relational relationship to each other (politicians Facebook profiles, their posts as well as visuals contained in theses posts) a relational SQL database was the natural choice. Moreover SQL databases can inherit not only textual but also visual data in blob objects, which was a further advantage. Thus, I installed MariaDB as well as phpMyAdmin on my server and was ready for data collection.

I chose to conduct the scraping as well as the analysis in R and not in Python. Besides a general preference for the R language which has a low threshold and is very flexible, my main reason to take hold on R is because it is by far more compatible to my colleagues in communication then python is. Moreover, this project should serve as a proof of what the R language is capable of in regard to openness: I wanted my own showcase how R can be used for the entire research process from data collection, to analysis as well as publishing. And last but not least there is already a package to access the Facebook API via R: Rfacebook. Although it doesn´t solve every problem (more about this later), the package considerably facilitated my data collection (for this project I used the latest stable version 0.6.15).

The first step in the data collection was to store the .csv file of politicians into the database. Doesn’t sound too difficult, does it? Well, it nearly freaked me out!  The challenge was to get the encoding right. I don’t know if this is a problem that only Windows users will encounter. I finally I found a workaround which I will document here to remember it on future occasions:

  • Save the .csv file from Excel “separated by separators” (“mit Trennzeichen getrennt”).
  • Open the .csv in the simple editor provided by Windows and save it again with encoding = “UTF 8”.
  • When importing this .csv file in R, I set attribute encoding to „UTF-8_bin“ in the read.csv2(). Weirdly, when I check the dataframe with the view() function in R after this procedure it seems to be all messed up. But what is more important, the import to the SQL database works correctly.
  • Put dataframe into the database using the RMySQL::dbWriteTable function.

The next step and the beginning of the actual data collection was to check if the politicians Facebook profiles I collected manually were a) publicly available via the API and B) if they were conceived as “user” or as “page”. Although these infos are already listed in the table of politicians provided above, they might change over time so it´s worth considering to redo this check if the list of politicians is used in another context.

  1. Regarding the publicity of Facebook profiles, site admins may set the visibility of their profile to public or private. Of course, only those profiles whose owners have chosen to make their content available can be accessed via the Facebook API. Nonetheless some profiles are still accessible via manual search on the platform. Self-evidently, I respect user privacy. Nonetheless, I quarrel with this situation since some of these profiles are obviously not personal or private by content but clearly aim at a broader public. Hence, I guess that some of the politicians and/or their social media staff members are not aware of the fact that their Facebook profile is not completely public (and thus cannot be found via search engines etc.) or they do not care. In total, 671 profiles are configured as private and thus dropped out of data collection.
  2. Most politicians (n = 1315; 94.3 %) in the remaining sample created their personal Facebook representation as Facebook “page”. This makes sense since Facebook “pages” distinguish professional or business accounts from ordinary “user” profiles. Nonetheless the sample still contains 80 non-private “user” profiles (5.7 %). This has consequences for the profiles attributes, but not for the posts on theses profiles, so it is only marginally relevant for the project: A user profile does not contain or reveal as many attributes in data collection via API: E. g. information on affiliation, birthday, biography or category of the profile cannot be downloaded from user profiles. Since the collection of posts is not affected by this differentiation, it does not really matter but it needs to be taken into account when politicians profile information should be downloaded (which I did, seen next step).

The third step of the data collection was to access the politician´s profiles. I wanted to collect them to gather some background information on the sample as well as to crosscheck whether I got the “right” profiles. Some politicians have names which are very common and there are even duplicated names within the sample (like two “Michael Meisters” one CDU, one AfD). I plan another report on the crosschecks of the data that I did. But for now let´s get back to data collection. The accessing of the profiles was the first challenge for the Rfacebook package. Actually I didn´t find a function which exactly extracted all the info I wanted. Hence, I wrote a simple GET request which returned the specific fields I was interested in. Next challenge was again to store the newly encountered data in the database and keep the right encoding. This was ensured in allocating the Encoding to „UTF-8_bin“ for every non-English text variable. In total, I collected the profiles of 1395 campaigning politicians.

Until now, the data is neither very big nor does the collection take very long. This changed in the next steps, the collection of the posts and the collection of visuals, because my aim was to download all posts from all campaigning politicians with Facebook profiles during the hole 18th election period of the German Bundestag (4 years). I decided to separate these steps from each other and to use two tables in the database to collect the posts and the visuals. The script to collect posts on Facebook is not very notable; again I had to do several checks on the encoding before everything worked fine. Moreover I decided to collect the download errors in a separate database table to gain control over them. The script was running for several days (or weeks?), it was a bit annoying though, since I had to restart the script every two hours because the token to access the API was only valid for so long. Also I encountered that some politicians had changed or deleted their Facebook profiles in the meantime, which forced me to update the sample all along. To be able to trace when I have saved a certain post, I wrote the download time into the database.

Deletion of profiles or single posts was also a problem in the final step, the collection of the visuals that could not be resolved. For data safety and practical reasons, I decided to save the visuals in two ways: First as a blob object in the database and second as a .jpg on my local hard drive. I also decided to collect only visuals which were posted in “photo posts”, video material and visuals were left out due to practical as well as conceptual reasons. In total, 389,741 pictures were downloaded, which take up nearly 30 GB of data. Given this amount of data, I will probably have to rethink the scope of this project and reduce the sample to maybe only one year of posts. I know this project cannot be considered really big data, but for me this is quite an impressive number!

All in all I´m pretty pleased with how the data collection went. I learned a lot on R, the Facebook API, as well as SQL databases. The next task will be to describe and visualize characteristic features of the sample. Of course, I will proudly present some of the insights here soon. Before I close this post which has become incredibly long, I would like to mention and remember the five most annoying things I encountered during data collection. – Since it is good practice to document not only the triumphs but also the failures. So here is the top five of what annoyed me out during data collection:

  1. Bad Encoding. It took me a while, but now I found some working solutions, although they feel kind of wonky.
  2. Politicians changing their Facebook profiles or deleting their profiles, posts and/or visuals.
  3. Caching of the phpMyAdmin interface (due to the caching issues I was not able to log into my account for nearly a day – Of course I didn´t know it was a caching issue then…)
  4. Renewing the Facebook token over and over again… and again…
  5. Excels nasty habit to display and save large integers in scientific format. Of course, the Facebook identifier can be seen as a large integer (it has 15 digits or so). Well, but feeding the Facebook API with 1,2345E+14 and similar does not really work…

Fellow program Freies Wissen

The fellow program Freies Wissen (free knowledge) founded by Stifterverband, Wikimedia and Volkswagenstiftung starts into its second round and I´m on board! The program will support 20 young scholars to make their own research as well as their teaching open and transparent. Further, the young scholars are encouraged to take a leading role in the open science movement and spread the word into their scientific discipline. The program includes expert talks, workshops, webinars, a mentoring program as well as exchange with the other fellows. I´m excited to take part and I´m looking forward to learn more on the ideas behind open science and get to know some useful operative tools.

Also, as part of the program, I will challenge myself to design my current research project as open as possible. The project is concerned with the (visual) communication of politicians on Facebook, also computational methods will be applied (here is the link to the project´s outline on wikiversity, in German). In the project does not broach the issue open science in itself, like some of the fellow´s projects do. Rather, I want to apply and evaluate open science ideas and tools to my workflow and take a look at how open science tools can be integrated into my workflow as a communication researcher. I want to evaluate what works for me and what doesn´t. Ideally, I will create a “best practice” example I can refer to. As part of this I plan to document my progress and my thoughts on open science here on the blog. So stay tuned 🙂

Computational Communication Science – Towards A Strategic Roadmap

We are happy that the VolkswagenStiftung finally accepted our proposal for a one week conference event on computational communication science!  In February 2018, the Department of Journalism and Communication Research at Hanover University of Music, Drama, and Media cordially organizes an event that brings together young scholars as well as experts from the field. Our aims are twofold: First, we want to qualify young scholars so that they can adopt computational method in their research as well as their teaching. Thus, various training courses on computational research methods will be organized. Second, a workshop event aims to explore and elucidate the challenges that hinder communication scientists to apply the new methods in their work. Together, we will craft a strategic roadmap that shapes the future of computational communication science. More Information can be found on our new website.