Data Ethics and Best Practices: Highlights from Strata Data Conference

March 16, 2018

Amber is the Chief Data Scientist at Terbium Labs. If she’s not working on her latest model, she can probably be found gardening with her kids or getting weird looks for knitting in a sports bar.

Members of the Terbium Labs technical team were in San Jose last week, attending the business-focused data science conference Strata Data Conference.

Strata was founded in 2012, formerly known as Strata + Hadoop World, combining O’Reilly’s and Cloudera’s big data conferences. Currently the largest data conference series in the world, Strata covers a full range of big data tools and technologies, while keeping a casual and informal approach.

Terbium CTO, Clare Gollnick, presented a thoughtful talk on the limits of inference, in which she shared a framework for avoiding common data science and machine learning pitfalls. She argued that the best way to have a successful machine learning project is to recognize early which questions can be answered with data, and which cannot, and provided examples on how to do so.

Alongside the myriad technical topics at Strata, there were two notable threads that seemed to have particular importance in light of recent breaches and upcoming regulations: data ethics and regulation compliance.

Natalie Evans Harris of BrightHive had a keynote talk and a brainstorming session on defining responsible data practices. Both highlighted work over the past year that has culminated in two ethics documents: community principles on ethical data practices and a manifesto for data practices. (These documents can be signed by practitioners at, and interested volunteers can sign up to help the project at

We at Terbium consider these efforts to be particularly important. There is no industry standard for use of sensitive user data, and we heard at least one attendee (to remain anonymous) arguing that all user data is company property. Ethics discussions and agreements like these can guide data practitioners to see why that view might be problematic.

We also saw several presentations about upcoming data regulations, specifically the EU General Data Protection Regulation (GDPR) which will affect all companies handling EU-citizen data. Many companies are preparing for when these regulations go into effect in May and are retooling technologies for more robust encryption, easier identification of sensitive data, and ensuring that sensitive data can be anonymized or fully deleted on request.

We did notice, however, that there was a lack of discussion about monitoring for data breaches, which is a requirement of GDPR, and may have inspired us to start thinking about talk topics for next year.

