A comparison of German health insurance supplemental premia

Sep 1, 2018 webscraping

Admittedly, the post at hand emerged from a very plain issue. In Germany, statutory health insurances charge a supplemental premium which differs between the respective providers. Recently, my current provider increased its’ premium and I was confronted with the option to switch to a potentially less expensive one. To make an informed decision, I tried to compare the supplemental premia across providers available at my place of residence (Berlin). Data on the respective premia are available at the website of the “National Association of Statutory Health Insurance Funds”. The website even provides a table that can be filtered by location. Unfortunately, it is not possible to arrange the table by supplemental premia. This, however, is an easy task in R. The task was then to grab the data and apply some basic data wrangling which I present in the remainder of this post.

Is scraping permitted?

Getting data from a website is an easy task for rvest. Still, before attempting to scrape a site, we should also check whether an API is available. In addition, we should consult the robot.txt in order to check whether scraping is in line with the terms of the provider.

But let’s first load the packages needed for the task at hand.

if (!require("pacman")) install.packages("pacman")

p_load(tidyverse, rvest, robotstxt, qdap, janitor, knitr)

I learned about the robotstxt package from a blogpost by Steven Mortimer on “Scraping responsibly with R” and wanted to test that package. So here we go:

Using paths_allowed from robotstxt, it is easy to check whether scraping is permitted:

paths_allowed(
  paths  = "/service/versicherten_service/krankenkassenliste/", 
  domain = "gkv-spitzenverband.de", 
  bot    = "*"
  )

## [1] TRUE

The function returns TRUE and we should be fine with scraping the table. Moreover, there is no API available. Accordingly, I proceed with the scraping procedure.

Setting up the scraper

The respective table is created dynamically which sometimes causes trouble. In this case, however, everything can be steered through the URL. At the moment, there are eight pages containing information on the statutory health insurances. The URL holds a place for the respective page number which we can use to loop over the individual pages. However, I learned that it is often easier to start with a single page before attempting to put everything into a loop. So, let’s go:

We need the URL and a CSS selector that identifies the relevant part of the website. Unfortunately, the ID of the table changes from page to page. Accordingly, we can’t use the convenient html_table function and must resort to an identifier for each row of the table. The respective CSS tags can be singled out using Selector Gadget or the developer tools of your browser.

url <- "https://www.gkv-spitzenverband.de/service/versicherten_service/krankenkassenliste/krankenkassen.jsp?pageNo=1&filter=0#krankenkassen"

# CSS selectors
insurance.css <- "tbody th"
premium.css <- "td.alignRight"
location.css <- "td:nth-child(3)"

Now, all the necessary arguments are safely stored in objects. Passing those arguments to the respective functions of rvest gives us the following results:

url %>%
  read_html() %>% # Create a html document
  html_nodes(css = insurance.css) %>% # Identify the information
  html_text() %>% # Get the text
  bracketX() %>% # Remove unnecessary strings (qdap package)
  head(n = 5) %>%
  kable(col.names = "Providers")

Providers
actimonda BKK
AOK - Die Gesundheitskasse für Niedersachsen
AOK - Die Gesundheitskasse in Hessen
AOK Baden-Württemberg
AOK Bayern - Die Gesundheitskasse

In the next step, we want to get all rows and transform them into a tibble. Now that we got an example code, this is again an easy task.

page <- read_html(url)

insurance <- html_nodes(page, css = insurance.css) %>%
  html_text() %>%
  bracketX()

premium <- html_nodes(page, css = premium.css) %>%
  html_text() %>%
  str_replace(",", ".") %>%
  str_extract_all("[:digit:].[:digit:]{2}") %>%
  as.numeric()

location <- html_nodes(page, css = location.css) %>%
  html_text() %>%
  bracketX()

Finally, we can create a tibble that contains the relevant information of the first page.

tibble(insurance, premium, location) %>%
  head(n = 10) %>%
  kable()

insurance	premium	location
actimonda BKK	1.0	bundesweit
AOK - Die Gesundheitskasse für Niedersachsen	0.8	Niedersachsen
AOK - Die Gesundheitskasse in Hessen	1.0	Hessen
AOK Baden-Württemberg	1.0	Baden-Württemberg
AOK Bayern - Die Gesundheitskasse	1.1	Bayern
AOK Bremen / Bremerhaven	0.8	Bremen
AOK Nordost - Die Gesundheitskasse	0.9	Berlin, Brandenburg, Mecklenburg-Vorpommern
AOK NordWest - Die Gesundheitskasse	0.9	Nordrhein-Westfalen, Schleswig-Holstein
AOK PLUS - Die Gesundheitskasse für Sachsen und Thüringen	0.6	Sachsen, Thüringen
AOK Rheinland/Hamburg - Die Gesundheitskasse	1.4	Hamburg, Nordrhein-Westfalen

Extracting several pages of information

Now that the information is available in R, it is a breeze to extract relevant information using the usual toolbox of packages. Before approaching this task, we have to grab the whole table which spans several pages on the website.

First, we need to create a vector that consists of the URLs that lead us to the individual pages. As aforementioned, the URL has a placeholder for the page number. This can be leveraged:

urls <- vector(mode = "character", length = 8)
for(i in seq_along(urls)) {
  urls[[i]] <- paste0("https://www.gkv-spitzenverband.de/service/versicherten_service/krankenkassenliste/krankenkassen.jsp?pageNo=", i, "&filter=0#krankenkassen") 
}

The vector urls now contains the links to the respective pages. Next, we can wrap up everything in a loop that runs over the newly created vector of URLs.

tables <- vector(mode = "list", length = 8)

for (i in seq_along(urls)){
  page <- read_html(urls[i])
  insurance <- html_text(html_nodes(page, css = "tbody th"))
  premium <- html_text(html_nodes(page, css = "td.alignRight"))
  location <- html_text(html_nodes(page, css = "td:nth-child(3)"))
  
  table <- tibble(insurance = insurance,
                  premium = premium,
                  location = location
                  )
  
  tables[[i]] <- table
}

table.df <- as.tibble(
  bind_rows(tables)
)

Unfortunately, the column premium is a character vector but this issue can be resolved with some basic lines of dplyr. At the same time, I use the function scrubber from the package janitor to clean up the remaining character vectors.

Finally, we can subset the table according to our location and arrange it concerning the premia.

healthB.df <- table.df %>%
  filter(str_detect(location, "Berlin|bundesweit") == TRUE) %>%
  arrange(premium)

healthB.df %>%
  .[1:10,] %>%
  kable()

insurance	premium	location
hkk	0.59	bundesweit
BKK firmus	0.60	bundesweit
IKK gesund plus	0.60	bundesweit
Audi BKK	0.70	bundesweit
BKK VerbundPlus	0.70	bundesweit
SKD BKK	0.70	Baden-Württemberg, Bayern, Berlin, Bremen, Hamburg, Hessen, Niedersachsen, Nordrhein-Westfalen, Rheinland-Pfalz, Saarland, Sachsen
AOK Nordost - Die Gesundheitskasse	0.90	Berlin, Brandenburg, Mecklenburg-Vorpommern
BKK Freudenberg	0.90	Baden-Württemberg, Bayern, Berlin, Hessen, Niedersachsen, Nordrhein-Westfalen, Rheinland-Pfalz, Sachsen
Bosch BKK	0.90	Baden-Württemberg, Bayern, Berlin, Brandenburg, Hamburg, Hessen, Mecklenburg-Vorpommern, Niedersachsen, Nordrhein-Westfalen, Rheinland-Pfalz, Saarland, Sachsen, Sachsen-Anhalt, Thüringen
Debeka BKK	0.90	bundesweit

Of course, this post is not a recommendation. Health insurances differ quite a lot when it comes down to their service provision and quality. So one should always take further steps before choosing a health insurance. Still, the cheapest provider in Berlin charges a premium of 0.59%, while the most expensive one has a premium of 1.7%. That’s surely something to consider.