Data Journalism on a Shoestring Budget: 5 Easy StepsBy Sam Petulla May 29th, 2014
Welcome to the latest post in our Contently Labs series, where we answer common questions we hear from current or prospective brand publishers. Today’s question: “How do I start telling cool stories with data?”
“Data science” is the buzzword du jour, with marquee publications such as The New York Times hiring physics Ph.D.’s to shuffle through haystacks of information and find the needles that will unlock major stories.
But not all publishers have those resources at their disposal—and not all content strategists have advanced degrees in probabilistic theory. Still, there are plenty of data-science tricks that can help those on a shoestring budget.
It all starts with understanding that data storytelling has a clear goal: To use simple rules of statistics to deliver valuable content to your readers. As the Harvard Business Review recently put it, “You’re going to draw someone in with a story so that they’re consuming the stuff they didn’t want to consume—the data.” In other words, it’s a way of conveying what was previously incommunicable.
Still, we know that data science can seem really scary. At high levels, it kind of is. Formulas like this one can give you nightmares:
But with some basic building blocks (no sigma notation required), you’ll be well on your way to sourcing great stories through numbers.
1. Finding data
The first thing you want to do is to think about where to find data to tell your story; curiosity is key here. The data should interest you enough that you want to explore it for stories, and you want to keep an open mind so that you don’t find yourself trying to find ways to make the data fit a preconceived theory.
With that in mind, go ahead and get started with large data sets. For example, New York City now releases Excel data files every month that include every single ride on its Citibike service, a public biking system used tens of thousands of times per month.
That may sound like a massive, unusable dataset, but it is actually manageable. You can create visuals like the one below. The datasets are in the tens of megabytes—large, but not so large that they hit the maximum size limit for Excel (65,000 rows for older versions of Excel; 1 million in the latest version) or require cloud computing resources.
You’ll want to find similarly sized datasets when you set out to tell your stories—or even much smaller. Datasets in the tens or hundreds of rows are fine.
So where are some places in your organization you might find data? There are plenty of open data resources you can go to. Opendata Stack Exchange is a good place to request data and learn where to find it, and Visually has aggregated a fantastic list of open data sources.
Also, never underestimate the stories that can be gleaned from your own data. Here are some examples of companies that have learned just that:
- We like to cover the work of PornHub’s data team, partially because if PornHub can tell data-driven stories, anyone can. PornHub uses its viewership data to tell stories about the how the political leanings, geography, and relation to U.S. holidays changes how much of its content is watched.
- IBM’s “A Boy and His Atom” film takes data science to an atomic level, winning many awards for its simple but effective storytelling.
- Foursquare provided an awesome series of maps showing check-ins in major cities over a year period. It’s a simple, available dataset for them that tells a fascinating story for the reader.
2. Using tools to organize data
In terms of labor, the hardest part of working with data is often organizing it into a usable format.
It’s not uncommon to find data with spelling or format variations like this:
As the visual shows, your computer reads these values differently depending on their variations.
Before you start on your analysis and visualization, you will need to set aside some clean-up time. Some basic best practices include: homogenizing spellings, data types, spacing, and ordering; avoiding blank entries or NaN values; and using a column and row format with a single header.
Noah Veltman, a data journalist at WNYC, also wrote a helpful blog post on tackling these tasks while working with data.
If your data requires more complex processing that can’t be done by hand, with Excel’s “Find and Replace,” or with other simple functions, there are free open-source tools that the data community uses to streamline cleaning, such as these:
- Kimono Labs recently released a data scraping tool that can pull information off of a website without writing code in a way that rivals high-level scripting. If you’re starting out, it’s a good way to get the data you need. The website offers video introductions.
- Google Refine is useful if you can’t pull your data into Excel because of its structure, or if you need to do large-scale transformations of data formatting. It’s best to Google Search your specific problem and find tutorial videos.
Also, remember to always step back as you are working with your data to spot-check your progress. It’s easy to get lost in the analysis stages without thinking about the broader story you’re uncovering. This can lead duplicate work or unnecessary steps.
3. Finding the story in the data
Once you have your data, you can start to find the stories inside it. This is the fun part! You often find insights you never expected. But remember that “the data doesn’t speak for itself“—that’s your role as a storyteller.
Thankfully, there are some basic best practices you can follow to find your stories.
Data science generally uses these techniques, which you execute without algorithms: Sorting, classifying and clustering, predicting, and finding relationships. For starters, stick to regression analysis, also known as a fit line, and looking for correlations. Build your own rows and columns.
In Excel, you can perform a basic regression analysis by using the “add fit line” function when plotting your data. Pivot tables are a good way to get this process started. This can be very useful if you are trying to show how two or more variables relate to each other. A trendline will also show how those variables change in relation to each other.
If you want to distill insights from a trendline, looking for correlations is your best bet. Try creating buckets for your data—by separating it out into different sections—and using the Correl function to see the relationship between your buckets. Lastly, calculate the variance and standard deviation for your data to ensure your conclusions are reasonably sound. Excel has built-in functions for these as well (Var and Stdev), which will vary based on your version number, but which are easy to locate with a Google Search.
If you want to try a ready-made solution to algorithmic clustering and correlation finding, take a look at Datahero, a startup trying to make products to find the relationships for you. But take it with a grain of salt that it is still only as good as its user.
4. Using tools to visualize data
Once you’ve explored your data and found your story, you can start making it engaging.
There are plenty of free charting tools you can use to add graphics, such as Chartbuilder—Quartz’s open tool, which is used by NPR, WSJ, and other outlets, and Plotly, a new set of charting tools with many options for visualization.
When you make charts, there are a couple of basic rules of thumb to follow:
- Avoid pie charts, donut charts, and the overuse of color. (See below.) Keep the focus on the insights.
- Use consistent axes to avoid misleading the reader into seeing visuals that are inconsistent with your data.
If you want to take things further, try D3.js, an open-source library created by a member of The New York Times graphics department. It’s not for the faint of heart, but you can use its pre-made examples with some minor tweaking by swapping your data out for the data used in the examples.
5. Telling stories with visualized data
Lastly, remember that your story is only as good as the visuals with which you you tell it. After all, the data is only a piece of your story, and it is up to you to communicate its importance and relate it to relevant insights.
A great recent example is this blog post on how one’s man life’s changed after he had a child, told through data.
He quantifies many of the changes in his life, such as the volume of email volume he sent and received (seen below) prior to his son’s birth.
It’s compelling because you can clearly understand the data, and it tells a vivid story about how the man’s life changed over time.
Bonus: Going full-on data scientist with next-level skills
This guide is meant to provide a solid overview of how to get started with data science, but when you’re ready to go further, consider tackling some of the tasks below:
- Learn to use APIs to pull in data from external feeds. APIs are the information infrastructure of the Internet. They’re how you can grab and post data these days. Learn to use them and you’ll open your possibilities.
- Break your data sample into test and training sets, and run more experiments to ensure accuracy. This is the next basic process for validating that your conclusions are sound. When you start to take your analyses further, you’ll need to do this.
- Perform network analysis using tools like Gephi or R.
- Subscribe to a data newsletter, such as O’Reilly’s Quick Bytes.
- Take open courseware such as CS109 or any of Coursera’s popular offerings.
- Read a book to improve your hacking or machine learning skills. Some good starting points are Machine Learning for Hackers and Mining the Social Web.
Image via FourSquare