Bringing science and development together through news and analysis

  • Data journalism: How to find stories in numbers

Image credit: Pablo Rojas, Wellcome Images

Speed read

  • Data journalism means finding stories by processing large datasets

  • Free online tools now make it easy to manipulate, interpret and present numerical information

  • It may seem daunting, but data journalism is here to stay

Colleagues often ask me what data journalism is. They're confused by why it needs its own name — don't all journalists use data?

The term is shorthand for 'database journalism' or 'data-driven journalism', where journalists find stories, or angles for stories, within large volumes of data.

It overlaps with investigative journalism in requiring lots of research, sometimes against people's wishes. It can also overlap with data visualisation, as it requires close collaboration between journalists and digital specialists to find the best ways of presenting data.

So why get involved with spreadsheets and visualisation tools? At its most basic, adding data can give a story a new, factual dimension. But delving into datasets can also reveal new stories, or new aspects to them, that may not have otherwise surfaced.

Data journalism can also sometimes tell complicated stories more easily or clearly than relying on words alone — so it's particularly useful for science journalists.

It can seem daunting if you're trained in print or broadcast media. But I'll introduce you to some new skills, and show you some excellent digital tools, so you too can soon find your feet as a data journalist.

Where to begin

Like all journalism, ideas for stories can come from many sources. A statistic might not sound quite right, tempting you to look at the data behind it. Or you might have a question to answer — how has science funding changed in the UK?, for example.

One way data journalism differs from other forms is that you may have no inkling of the story until well after you start investigating. That doesn't mean getting hold of any old data and expecting to find a story — rather that the story is what the data tells you. This presentation on The Guardian's Datablog gives an idea of the workflow in data journalism.

So how do you choose what to delve into? It's good to familiarise yourself with data types and sources in your 'beats' and when that data might be released, just as you would know conference or journal publication dates.

It's best to start small with your first data journalism projects, particularly while you get used to the data processing and using all the available tools. Your main challenge will probably be the time needed to process data. Peter Aldhous, the New Scientist San Francisco bureau chief, has produced a tutorial on how to approach science data journalism projects, and The Data Journalism Handbook also has tips on where to start.

Finding and accessing data

Data journalism experts say that journalists' roles are changing from hunting and gathering scarce information to processing information in 'an age of abundance'.

“Evidence suggests that data journalism is the journalism of the future”

Sandra Crucianelli

Data might be abundant, but some types of data are easier to get hold of than others. Governments are beginning to recognise the importance of releasing data — including research findings — but this varies from country to country, and even a government that believes in openness may lack adequate systems for making data accessible.

Some nations, such as Kenya, proactively make data available, while in others you'll have to ask — sometimes through systems such as India's Right to Information Act.

International bodies such as the World Bank release data, and projects such as Gapminder and Google Public Data Explorer collate data from various organisations. For science/health journalists, clinicaltrials.gov is a registry of clinical trial data. And environment or earth science reporters can access information from the US Geological Survey, for example.

You might even find some ready packaged data at your disposal. Data Dredger, a collaboration between Internews and Kenya's open government data initiative, provides links to Kenyan health reports and has infographics on health topics you can download and use in stories.

And the web is full of data — finding it just requires honing your search engine skills. Sometimes you can just search for a term plus 'data', or use a specialised academic search engine such as Google Scholar or Scirus. 'Semantic' web resources, such as Wolfram|Alpha, which search by extra data, not just the keywords within the page, are also useful.

Google's advanced search allows you to narrow your results by domain extension, helping you to search for academic or government data, and file format — such as the Excel files in which you're most likely to find tables of figures or statistics. Tables and graphics are often uploaded as an image, so your data hunt should include Flickr and Google Images.

You can even retrieve data that have been deleted from the web but were 'cached' or saved as screenshots. Try the Internet Archive and its Wayback Machine to recover old files or broken URLs.

Social media can also be a data source. Tools such as SocialMention, 48ers, Twitterfall Addictomatic, Boardreader and Whostalkin allow you to make searches by name, subject, time and geo-reference. An interesting example of social networks revealing news is the Eye on the Bailout project of ProPublica, an investigative journalism organisation, which has used social media mentions to alert journalists to new data on what has happened to the US 2008 bank bailout money.

Remember — it's good practice to link to, or state the sources of, your data.

Data handling

You've found the data, but can you use it? You'll need to import it into a spreadsheet such as those in Excel or Google Drive, so download data in a 'comma separated value', or CSV, format if possible.

You might have a table in a PDF file, or as a JPEG image file. Try a file converter like Zamzar to get these into spreadsheets. Optical character recognition software can also be a big help: a simple, free one is Free Ocr. As a last resort you may have to manually input data, which is time consuming and error prone.

Wherever your data comes from, it probably needs 'cleaning' to make it useful. This can mean anything from reorganising and deleting data you don't need, to using tools such as OpenRefine (formerly Google Refine) to make the data more consistent (watch the video tutorials for guidance on what this cleaning can mean). Science journalists at least should have access to well-kept scientific data that needs less cleaning.

You'll also need to start doing some basic processing. You might sort data from smallest to largest or by location. You might be looking for averages, or to join or compare two datasets.

Treat data as a 'source': ask it questions as your audience might. And ask it lots of questions — the answer might not be what you first think. For example, a spreadsheet of journal retractions might suggest rising fraud detection, but you also need to ask whether there are other interpretations.

Think carefully about your results — do they sound plausible? It's best to check and recheck calculations. Don't ruin your reputation for a basic error.

You can strengthen your conclusions or pinpoint new questions with simple statistical analyses. For example, you might spot more catastrophic storms in your country each year for 20 years. But is this a significant result or might it be chance natural variation? Tools such as the R-Project and RStudio can help you judge that. You might also want to check your conclusions with experts or other experienced data journalists, particularly when you're starting out.

Presenting the data

Your presentation will depend on the story. There may be very little to present; you could have slaved to get a single but important figure to report in a conventional news piece — that your government has spent half what it promised on science, for example.

Or you might use data visualisation as an integral part of the story. This investigation from The Seattle Times in the United States combines a written feature with supporting graphs, maps and source documents. One is an interactive map; elements like this can be used within larger stories and projects, or can be self-contained, like this visualisation of the causes of death hosted by the UK newspaper The Guardian.

Online tools such as Tableau Public and Many Eyes can visualise data in various ways, while Google Fusion Tables, Geocommons and Indiemapper produce good maps using longitude/latitude data or more complex GIS data. Many of these tools also let you add an animation layer to show timescales, for example.

Sometimes it's not just about presenting data, but letting your audience see what it means to them. This ProPublica project shows users whether their doctor receives drug company money, while this Texas Tribune effort shows you how US public money is spent.

Going further, this Guardian project asks readers to help analyse data on UK public spending. This kind of project, called a 'news app', requires collaboration between journalists and programmers to design and build applications that handle and analyse many variables within big databases or across many datasets.

I've been involved in a news app at Argentina's La Nación newspaper as part of my Knight International Journalism Fellowship. It uses national census information from 2001 and 2010, letting people explore how demographics have changed in their areas.

The website Information is Beautiful has examples of creative data visualisation, and shows how working with your publication's digital or graphics team can be productive.

You may need to persuade your editors to make time for data journalism. This gets easier when you see results, and this report (which I co-authored) on integrating data journalism into newsrooms might also help.

It might seem like a big ask, but evidence suggests that data journalism is the journalism of the future. If you can invest the time, you'll not only get better stories but you'll better serve your audience and the public interest.


Link to animation about data journalism in Argentina


Sandra Crucianelli is a Knight International Journalism Fellow. She is an investigative journalist and instructor, specialising in digital resources and data journalism. She is the founder and editor of Sololocal.info, an online magazine providing hyperlocal news from Bahía Blanca City, Argentina. See more: www.visualcv.com/sandracrucianelli