Speakers
-
Karthik Ram
testdat
-- Unit testing for tabular data -
Peter Murray-Rust
Liberating 100 million facts as CSV
-
Gabriela Rodriguez
Vozdata: Open data with the help of the community
-
Francisco Mekler
Public budget from PDF to CSV
-
Felienne Hermans
Spreadsheets Are Code
-
Alexandre Vallette
Hacking Public Infrastructures
-
Ingrid Burrington
Internet Groundtruth
-
Christopher Gandrud
Improving access to panel series data for social scientists: the `psData` package
-
Thomas Levine
Comma search, and Tom’s views on searching across data tables
-
Adam Retter
CSV Validation at the UK National Archives
-
Bernard Lambeau
Data Deserves a Language Too
-
Friedrich Lindenberg
Dr. Freezefile, or how I learned to stop rendering and freeze my apps
-
James Smith
CSVlint: publishing data that doesn't suck
-
Ed Freyfogle
A living hell: lessons learned in eight years of parsing the world's real estate data
-
Matt Senate
Data-Hacking with Wikimedia Projects: Learn by Example, Including Wikipedia, WikiData and Beyond!
-
Aaron Schumacher
Data and Truth
-
Olaf Veerman
Openrosa for call centers
-
Owen Jones
The COMPADRE and COMADRE population matrix databases
-
Javier Arce
So you have CSVs, now what?
-
Steven Beeckman
Opening Data Within Organisations
-
Brian Jacobs
Querying the sum of all human knowledge
-
Mitar Milutinovic
Collaborating on open dataset of all academic publications
-
Ana Carvalho, Sara Moreira, Ricardo Lafuente
Datacentral: Using Data Packages for static data portals
-
Nick Stenning
Data Packages: Put it in a box
-
Jeni Tennison
CSV on the Web
-
Jeremy Krinsley
Every Parse Is Sacred
-
Alf Eaton
VEGE-TABLE: the data table that grows
-
David McKee
XYPath and Messytables - Traversing Spreadsheets in Python
-
Paul De Schacht
Ease the pain of parsing data
-
Ashley Casovan, Antonio Acuna
How do we improve data quality internationally?
Presentations
-
10:10 - 10:30
CSV on the Web
GaleriaJeni Tennison
Jeni is the Technical Director at the Open Data Institute and the co-chair of the W3C’s CSV on the Web Working Group.The W3C CSV on the Web Working Group has been looking at how to more tightly specify CSV and the provision of metadata that makes it possible to validate CSV and convert it to other formats. This talk will describe the work of the Working Group and its current status.
-
Collaborating on open dataset of all academic publications
Room 1Mitar Milutinovic
Mitar Milutinovic is a PhD student at UC Berkeley researching ways to improve how researchers collaborate. Currently working on PeerLibrary, a platform for collaborative reading of academic publications.In PeerLibrary we are recognizing a need for an open dataset of all academic publications. Issues preventing building such dataset are multiple: multiple existing sources, many proprietary, conflicting entries, and legal issues with some parts of data (eg. full text). In addition, collaborative curation of this data is not yet possible. I would like to present what we are doing on this topic, but mostly learn from and about others.
-
The COMPADRE and COMADRE population matrix databases
Room 2Owen Jones
Owen is an Assistant Professor at the University of Southern Denmark where his research interests are in population biology and evolution. He co-manages the COMPADRE and COMADRE Plant and Animal Matrix Databases (www.compadre-db.org) that hold population dynamics data obtained from published literature and (will soon) make it available to all.Evolutionary biologists aim to make sense of population behaviour in species across the tree of life. However, the collection of animal and plant population data is laborious and costly so analyses that try to generalise across many species are not feasible unless data are shared among researchers, or obtained from the literature. I will report on the 20+ year journey of construction of two databases that collate demographic data from published literature on more than 2000 species with an aim of making it openly available to all. I will briefly outline why these data are important, describe the process of data production, and contemplate the lessons learned along the way.
-
10:35 - 10:55
Every Parse Is Sacred
GaleriaJeremy Krinsley
Jeremy was the first non-CEO/Co-founder to begin working with Enigma, and has been a part of the company from pre-seed money hacking to its current exciting position as a major public data interface. He is also a musician who performs and releases music regularly under various monikers.Lean ETL, unintentional ddos attacks, impossible parses, backdoor APIs, practices in humility and parsing tactics gleamed from three years as a lead parser at Enigma.io, a company that specializes in normalizing siloed public data.
-
Data Packages: put it in a box
Room 1Nick Stenning
Nick is a programmer. He is currently the technical director of Open Knowledge, and previously worked on the infrastructure of GOV.UKNumerous attempts have been made over the years to standardise data interchange formats. XBRL, ASN.1, X12, and other terrifying acronyms have caused thousands of developers years of pain and suffering. By contrast, CSV is simple, clear, and easy to use. But it can sometimes be useful to know a little more about the data you have at hand, ideally in a format that is machine-readable and standardised.
I work at Open Knowledge, where we have authored a number of simple standards to improve the portability and reusability of CSV-based data formats. I'll be talking about what data packages are, the Tabular Data Package specification, and some of the tools we've built to help with packaging and using data in these formats.
-
Hacking Public Infrastructures
Room 2Alexandre Vallette
Data-scientist, co-founder of SNIPS, his passion for machine-learning applied to geographical data and networks comes from his Phd in chaos theory. Open Data enthusiast, he is committed to show how open-innovation can lead to a better governance and economy, hence the creation of ANTS, an innovation lab for public institutions.Waste management is neither a sexy nor a scalable business. Most of the time these two facts are good enough to repel all high-tech startups, leaving a vital public service far away from what can be achieved with modern technology.
Recycling centers are overcrowded. People get angry and don't sort their waste correctly increasing the amount of what is buried or burned. Here comes a bunch of hackers, data-scientists and makers dedicated in producing the best open solution ever. From Open Data, they build a predictive model indicating the best time to got to the recycling center. The information is piped to a mobile app that crowdsources where the waste are produced, what recycling center can take care of it and all the alternatives of reuse. And when more precise data is needed, they'll hack a solution to produce it.
-
11:30 - 11:50
Internet Groundtruth
GaleriaIngrid Burrington
Ingrid Burrington lives on an island off the coast of America.Computers used to be the size of entire rooms. While the hardware has gotten smaller, the space the network occupies has gotten bigger: we're surrounded by hidden infrastructure, from cell towers to fiber lines to networked cameras. For the last few months I've been working on a field guide to seeing that infrastructure, using a mix of sometimes unlikely data sources, including spraypainted street markings for excavation work and DOT permits. This is a talk about finding data in unexpected places, why that data matters, and the value of fieldwork in data-making.
-
Improving access to panel series data for social scientists: the `psData` package
Room 1Christopher Gandrud
Christopher Gandrud is a post-doctoral fellow at the Hertie School of Governance and member of the rOpenGov community working to increase the openness of government and social science data.Social scientists have access to many electronically available panel series datasets. However, downloading, cleaning, and merging them together is time-consuming and error-prone: for example, using Reinhart and Rogoff's data on the fiscal costs of the financial crisis involves downloading, cleaning, and merging four Excel files with over 70 individual sheets, one for each country’s data. Furthermore, because such datasets are not bundled in a format that is easy to manipulate, many of them are not updated on a regular basis.
In this talk, we introduce the
psData
package for the R statistical software. This package is being developed under the rOpenGov framework to solve two problems:- Time wasted by social scientists downloading, cleaning, and transforming commonly used data sets for their own research
- Errors introduced by data import and transformation scripts that are written individually and never shared across researchers
The
psData
package aims to address these problems by distributing easy to use R functions for downloading, cleaning, and merging datasets used by social scientists. The package focuses on panel series data, which are frequently found in political science and macroeconomics. It is hosted on GitHub and can be easily added to and modified by the community, which will allow to fix and patch distributed datasets to all users simultaneously, improving overall data quality.The team behind the rOpenGov/psData project currently includes contributors from universities in five countries, and many will be present in Berlin at the time of the conference.
GitHub repository: https://github.com/rOpenGov/psData
-
Comma search, and Tom’s views on searching across data tables
Room 2Thomas Levine
Thomas Levine is a dada artist who has recently been doing silly things with lots of spreadsheets.I see two main issues in the common means of searching across data tables. One issue is that the search is localized to datasets that are published or otherwise managed by a particular entity, and another is that the search mechanism doesn’t use information relating to the tabular structure of these data tables.
We can do better! Comma search indexes as many data tables as you want, regardless of where they’re from, and it uses the tabular structure of these data tables to search for them. You provide a spreadsheet as input and receive a list of spreadsheets in return. In this talk, you’ll learn how it works and how to use it.
-
11:55 - 12:15
Querying the sum of all human knowledge
GaleriaBrian Jacobs
Brian Jacobs is a designer and interactive developer. Currently a Knight-Mozilla Fellow at ProPublica in New York.Wikipedia can be a powerful open data resource. The official Wikipedia API provides access to page content and metadata but there are other projects that go much further, offering access to structured and Linked Data knowledge bases that enable advanced querying of data extracted from Wikipedia and beyond. Some projects include DBPedia, Freebase, Wikidata and the Encyclopedia of Life. Each are geared towards different audiences and have varying levels of ease of use and maturity. I'll talk about the strengths and weaknesses of each project and how to join and query for data across spatiotemporal, governmental, cultural, and scientific domains using Linked Data query languages.
-
Data Deserves a Language Too
Room 1Bernard Lambeau
Bernard Lambeau is a postdoc researcher at University of Louvain (Belgium). His current interests include software engineering and databases. He continuously tries to build bridges between these two domains, and frequently writes open source code towards this goal.The advent of the Internet and heterogenous distributed software call for new tools for validating, documenting, coercing and transforming data. Surprisingly, conventional programming languages provide weak support for this because they strongly focus on behavior. Indeed, conventional type systems are designed towards fast code execution but only provides some type safety from a pure data perspective. In contrast, data requirements often require capturing precise data type definitions (e.g. the set of "integers less than 100", or the set {M,F}) but call for no behavior per se.
This talk is about Finitio (finitio.io), a language strongly biased towards data. Finitio has a dedicated type system for capturing, validating and coercing data, and an interoperability layer to play nice with existing programming languages. The talk is a guided tour to Finitio; it also briefly discusses its motivation and origin.
-
CSV Validation at the UK National Archives
Room 2Adam Retter
Adam is an Open Source hacker, data architect and freelance consultant. Recently working with The National Archives (UK) he developed a CSV Schema Language and Validation tool (https://github.com/digital-preservation/csv-validator). He tries to help several W3C Working Groups as an Invited Expert, and is currently writing a book for O'Reilly on eXist (http://shop.oreilly.com/product/0636920026525.do).At the National Archives we have developed a simple Schema language for describing the data within a CSV file. This schema language which we call CSV Schema, enables you to make quite complex rule based assertions about the data within a CSV file. We have then built a tool (which we call CSV Validator) which will take this CSV Schema and a CSV file and validate it, producing a report of any problems.
Both the CSV Schema language (http://digital-preservation.github.io/csv-schema/csv-schema-1.0.html) and the CSV Validator tool are released freely as Open Source projects on GitHub. We believe that our tools will be of interest to others, such as those attending this conference.
-
12:20 - 12:40
CSVlint: publishing data that doesn't suck
GaleriaJames Smith
Software engineer with a passion for using web technology and Open Data to make a better future for everyone, with a particular focus on the environmental, social and democratic benefits.When 2/3 of CSVs on data.gov.uk aren't easily machine-readable, we have a problem. We need new tools that help data publishers to make usable data, and help them maintain it over time. CSVlint.io is such a tool, that validates CSV files, suggests improvements, and checks against datapackage schemas. Come and learn how it can be used to make data publishing better.
-
A living hell: lessons learned in eight years of parsing the world's real estate data
Room 1Ed Freyfogle
Ed is a co-founder of Lokku, the company behind Nestoria, OpenCage Data, and #geomob.I'm one of the founders of Nestoria - a search engine for residential real estate operating in nine markets and used by millions of people. We process (error check, de-duplicate, geocode, etc) over 10 million listings a day. given that purchasing a property is the largest financial transaction most people will ever face, you might think the industry would value high quality data. You would be very wrong. I'll share some of the challenges we've faced, the lessons we've learned, and hopefully provide a few entertaining examples of data gone bad.
-
Data-Hacking with Wikimedia Projects: Learn by Example, Including Wikipedia, WikiData and Beyond!
Room 2Matt Senate
Matt believes in the moral imperative to share knowledge far and wide. He is a Californian; he lives in Oakland and collaborates at the Sudo Room, a creative community and hacker space.How do Wikimedia project communities work? How do data hackers interface and interact with these communities? What is at stake and who are the stakeholders?
Join this talk to learn by example, through the story of the Open Access Signalling Project. This project's focus is to improve existing Wikipedia citations of Open Access research articles and other such academic works. This is one path among parallel initiatives (past and present) to improve how references work on Wikipedia, and across Wikimedia projects.
"A fact is only as reliable as the ability to source that fact, and the ability to weigh carefully that source." - WikiScholar proposal 'A free and universal bibliography for the world' (circa 2006 - 2010, status: closed)
-
14:00 - 14:20
Vozdata: Open data with the help of the community
GaleriaGabriela Rodriguez
Gabriela Rodriguez is an activist and hacker who loves the intersection between media and technology. She grew up in Uruguay and is passionate about free software and open knowledge. She co-founded the Uruguayan nonprofit DATA that works with open data and transparency in South America and is now a fellow for Knight-Mozilla Open News at La Nacion in Argentina.In 2014, La Nacion in Argentina launched VozData, a website to crowdsource senate spendings by asking people to transcribe information from 6000 scanned PDF documents from the senate. This is a talk about the code that created that website and it can be used with any document set and any data you may need to take from them.
-
Liberating 100 million facts as CSV
Room 1Peter Murray-Rust
Peter MR from Cambridge, a chemist, bulds systems and communities to liberate knowledge. Currently a Shuttleworth Fellow, sponsored for contentmine.org liberating 100 million facts from scientific literature.The ContentMine aims to liberate facts from the scientific literature through Natural Language Processing, Computer Vision and good old Regular Expressions. At present we are liberating chemistry, biodiversity, species, places, dates, not only from text but also diagrams (PDF, PNG, etc.) Everything is Free/Open (software, specifications and output) and we are working to build a bottom-up community where domain experts can build on our infrastructure.
-
So you have CSVs, now what?
Room 2Javier Arce
Half web developer, half illustrator, half bad at dividing. Map maker at @CartoDB.We have built an online mapping service that lives and bleeds CSVs. CartoDB supports drag and drop import of CSVs and in seconds gives you interactive maps of your data. From there, you can share the maps or share links to export your data as new CSVs! It also gives you SQL access to your data, so you can filter and manipulate data and then provide those results as CSV exports! Getting data into CartoDB isn't limited to CSVs on your desktop, you can also point the service at files hosted online and tell it to sync your maps to that source file. This can give you a real-time map of data hosted on servers all around the world! Add on top of it the ability to georeference your CSVs by administrative areas, named places, IP addresses, and many other attributes, and you can turn almost any CSV into a mappable dataset. Here, I'm going to show you how CartoDB is built on and contributes to an ecosystem of CSVs.
-
14:25 - 14:45
Dr. Freezefile, or how I learned to stop rendering and freeze my apps
GaleriaFriedrich Lindenberg
Friedrich is a Knight International Journalism Fellow with Code for Africa, where he works with investigative journalists to develop tools and resources for data-driven journalism.We would all like for our applications to run off dat-powered real-time feeds. More often than not, however, the data we serve doesn't change that often - or our data apps are one-off presentations anyway.
In those cases, a dynamic server often isn't needed to power an application - they can just as well run off flat files and a cheap CDN. In this workshop I want to present some recipes and tools for making static file applications, but also have a discussion to learn from others and discuss some of the missing bits for making truly great flat-file apps.
-
Opening Data Within Organisations
Room 1Steven Beeckman
Steven is a technical project officer and analyst at the belgian Ministry of Defence and responsible for an internal open data platform. To keep sane he hacks on node.js and Android app prototypes which he somehow manages to almost never ship. He is the founder of the Node.js User Group Belgium and a conductor for StartupBus.Even within government agencies opening data is a hot issue. Confronted with functional silos and difficult access to another silos data, the belgian Ministry of Defence started building an internal open data platform powered by open source software and csv files in 2009.
This talk will provide a brief technical overview of the platform used to distribute the data (including the role of csv files), the volume of data we ship nightly, whether or not we killed the silos and what it's like to operate within a special kind of enterprise.
-
Data and Truth
Room 2Aaron Schumacher
Aaron lives and works in a younger Alexandria, teaches in the District of Columbia, and is more or less scientific with data. He enjoys tiramisu and breakdancing.If data is fallible, can anything be known? Moreover, how are we going to get a reasonable result from this new dump and still have time for lunch? What is the difference between "cleaning" and "analysis"? Will epistemological theorizing improve our quotidian experiences working with data, or is it all just sesquipedalian nonsense? Does skeptical coding bring us closer to the truth?
-
14:50 - 15:10
Datacentral: Using Data Packages for static data portals
GaleriaAna Carvalho, Sara Moreira, Ricardo Lafuente
Transparência Hackday is the first open-data collective in Portugal. Since 2010 we organize monthly hackdays to open up public data and ultimately make our society better informed.Datacentral is a simple system we made for CentralDeDados.pt, an independent public data hub in Portugal. Making use of the Data Package standard, we wrote a proof-of-concept static site generator that pulls each CSV dataset from a Git repository and generates the site based on the contained metadata (using daily cron jobs to re-generate and update the site). The result is an easily deployable, replicable and complete HTML5 data portal site.
-
Hacking Education: Public budget from PDF to CSV
Room 1Francisco Mekler
Telemático. Trabajo en http://www.imco.org.mx. Paco = policía en Chile. Proud to work directly in @mejoratuescuela.As part of mejoratuescuela.org, we opened public data that was given in PDF's and obstrusive websites. We made an study based on this data and we achieved public policy changes, we helped the open data movement and community in México. Its incredible the impact we made only by a team of 3 people and I want to share my side of the story.
-
Room 2testdat
-- Unit testing for tabular dataKarthik Ram
Karthik Ram is a co-founder of ROpenSci, and is currently a data science fellow at the University of California's Berkeley Institute for Data Science.In this talk I will demo a new R package,
testdat
, that provides a suite of functions that allow users to unit test tabular data, much like unit testing for code. Our package allows researchers to write expectations, as one would do with code, and quickly identify cryptic issues, especially when reading large numbers of files. We also describe the major functionality oftestdat
along with a few use-cases and related tools. (full abstract) -
15:45 - 16:05
Openrosa for call centers
GaleriaOlaf Veerman
Olaf is the founder of Flipside, a Lisbon based agency that, using open source technology, builds web-based tools to help organizations create social impact.Openrosa has emerged as a popular standard for (mobile) data collection with a growing eco-system of applications and services. The proposed talk will focus on Airwolf, an Openrosa compatible tool developed for Text to Change, an organization that among others operates a small call center in Uganda. It will explain for what kind of data collection efforts a call center is the appropriate method and detail some of the challenges that Airwolf has to tackle. These include offline capabilities, anonymization of results and data export that is both useful for statisticians and readable for less technical people.
-
XYPath and Messytables - Traversing Spreadsheets in Python
Room 1Dave McKee
Dave - also known as Dragon - is repeatedly told he's a Data Scientist, but still doesn't believe it. He pulls data out of websites at ScraperWiki for a living and dresses up as a wizard at the weekend.Real-world spreadsheets of data come in many formats and with widely varying headers; we need a language to navigate through them to locate the data which we need.
Inspired by XPath and building upon OKFN's Messytables, XYPath isn't quite that language, but is a working prototype to explore what it might look like.
-
VEGE-TABLE: the data table that grows
Room 2Alf Eaton
Writing software to improve the process of publishing, finding, collecting and reading scientific literature; currently senior developer at open access scientific journal PeerJ.A vege-table is a data table: each row of the table is an item in a collection, and each column is a property of those items. These properties are all JavaScript functions, so their values can be computed or fetched from remote resources, and merged from different sources. This talk will describe the project’s background - including limitations of current tools for building, manipulating and publishing data tables - and examples of its use with real data.
-
16:10 - 16:50
Spreadsheets Are Code
GaleriaFelienne Hermans
Felienne is a professor in software engineering. Her PhD dissertation explores the idea that spreadsheets are code, and should be treated as such: with tests, refactoring and quality metrics. In weekends, she teaches kids how to program Lego robots.Many people consider spreadsheets to be data, but more often, they are actual programs. In this talk Felienne explains why and also provides help in how to deal with 'legacy' spreadsheets.
-
Ease the pain of parsing data
Room 1Paul De Schacht
Paul De Schacht is a research engineer at Amadeus, where he brings travel data to life. He is happy grokking large amounts of data using emerging technologies.Open data does not mean that the data can easily be consumed. Every source has a different format that requires a dedicated parser. I've recently open sourced 2 tools (https://github.com/pauldeschacht/pdf2csv and https://github.com/pauldeschacht/paxparser) that help the extraction and standardization of diverse formats. I will briefly discuss these projects, but mostly I hope to build a community around such tools.
-
How do we improve data quality internationally?
Room 2Ashley Casovan and Antonio Acuna
This is a joint presentation from the Government of Canada and the Government of the United Kingdom.Open data isn’t a new concept, at least to those attending this conference. Now that more governments, agencies, and organizations are on board to release valuable data and information, what standards, guidelines, and other policy instruments need to be established in order to make that data useful to the public? Given that nations have similar data types, creating these policies with an international lens would create greater interoperability and lead to increased use. How do we get here? Who is the governance body? Come to this interactive presentation to share, learn, and move this conversation forward!