Preparing Data with Package stringr

by Karl-Kuno Kunze

In the beginning of most data science projetcs, data must be prepared for the task at hand. Very often only parts of entries in columns are necessary or entries must be re-formatted, especially for date and time entries. For this task you need good tools to manipulate text.

Even in the base version R comes packed with many functions for that purpose. However, these are partly inconsistent as syntax is concerned or run somewhat behind the abilities of languages like Python – or require quite some complex code. Package stingr by Hadley Wickham comes in quite handy to fill the gap.

Let me walk you through some simple examples where we use regular expressions. You can find a nice tool to learn and play around right here. You may also go for Wiki.

Preparation

First, we load the package:

This is our test-string (meaning: Kick-Off European Soccer-Championship):

Find and extract strings

Does the string contain a certain pattern?

You may also extract values, if they correspond to a pattern. In the example we would like to extract the time. For our purposes, time consists of two pairs of two digits, separated by a period and followed by ‘ Uhr’ (which means o’Clock):

Replace strings

You may also replace strings. Simple things first. Replace name of month by number:

Be careful when doing multiple replacements. For example, here:

Both rules have been applied, but each one to a different copy of the string.

If we prefer to apply both replacements to the same string we better write:

The replacements are input as a named vector. An extremely powerful concept – as you can see.

Functions for stringr: 101

For your first steps with stringr you may want to check out the following functions in addition to the ones above:

  • str_length()
  • str_locate()
  • str_match()
  • str_split()
  • str_sub()
  • str_trim()
  • str_wrap()

Summary

With the package stringr you have a pretty well filled toolbox that transforms cumbersome operations on data to a pleasure business. Have Fun!


Titelbild von Rainer Sturm  / pixelio.de.