May Data Digest

Our May digest covers recent news, such as the implementation of GDPR and what it means for the machine learning, and various articles and tutorials published in the last month. Learn more about data preprocessing, establishing fairness in ML models, and topic modeling and enjoy research highlights: from GANs to reinforcement learning.

News

GDPR and future of Machine Learning

Much has been made about the potential impact of the EU’s General Data Protection Regulation (GDPR) on data science programs. But there’s perhaps no more important—or uncertain—question than how the regulation will impact machine learning (ML) and enterprise data science. This article aims to demystify this intersection between ML and the GDPR, focusing on the three big questions: Does the GDPR prohibit machine learning? Is there a “right to explainability” from ML? Do data subjects have the ability to demand that models be retrained without their data?

Google AI Duplex

The highlight of Google’s I/O keynote earlier this month was the reveal of Duplex, a system that can make calls to set up a salon appointment or a restaurant reservation for you by calling those places, chatting with a human and getting the job done. That demo drew lots of laughs at the keynote, but after the dust settled, plenty of ethical questions popped up because of how Duplex tries to fake being human. Here's the brief overview of the Google Duplex and here's a deeper inquiry into Duplex' ethics.

Tutorials & overviews

  • Dirty datasets for data preprocessing practice Looking for datasets to practice data cleaning or preprocessing on? Look no further! Each of these datasets needs a little clean-up before it’s ready for different analysis techniques. For each dataset, there's a link to where you can access it, a brief description of what’s in it, and an “issues” section describing what needs to be done or fixed in order for it to fit easily into a data analysis pipeline.

  • Improve your training data There is a difference between deep learning research and production, and the difference is often in how many resources are spent on improving models versus preprocessing datasets. There are lots of good reasons why researchers are so fixated on model architectures, but it does mean that there are very few resources available to guide people who are focused on deploying machine learning in production. To address that, Pete Warden' talk at the Train AI conference was on “the unreasonable effectiveness of training data”, and in this blog post he expanded on the topic, explaining why data is so important along with some practical tips on improving it.

  • Why machine learning is hard. While various online cources and manuals made machine learning widely accessible for anyone, the machine learning remains quite a hard area, and not only because of the math involved. Read this essay to see what makes ML problems tough.

  • Topic modelling. The process of learning, recognizing, and extracting topics across a collection of documents is called topic modeling - one of the most useful ways to understand text in documents. In a comprehensive overview, authors explore topic modeling and its associated techniques through 4 of the most popular techniques today: LSA, pLSA, LDA, and the deep learning-based lda2vec.

  • Fairness in ML with PyTorch Generative adversarial networks come to save the day when you need to ensure fairness in the predictions you model makes. Just add the adversary module that’ll try to predict whether the classifier unit is unfair to some sensitive data (like gander or race) & let adversary and classification units play a zero-sum game where the classifier has to make good predictions but is being penalized if the adversary detects unfair decisions. The end-result of this game is, hopefully, a fair classifier that is also good at predicting. See the overview of this approach and check the PyTorch guide.

Research

Primal-Dual Wasserstein GAN

Authors introduce Primal-Dual Wasserstein Generative Adversarial Network, a new learning algorithm for building latent variable models of the data distribution based on the primal and the dual formulations of the optimal transport problem. In order to learn the generative model, the model uses the dual formulation and the decoder trains adversarially through a critic network regularized by the approximate coupling obtained from the primal. To avoid violation of various properties of the optimal critic, authors regularize norm and direction of the gradients of the critic function. As a result, Primal-Dual Wasserstein GAN utilizes benefits of auto-encoding models in terms of mode coverage and latent structure while avoiding their undesirable averaging properties like the inability to capture sharp visual features when modeling real images.

Self-Attention GANs

A new model, Self-Attention Generative Adversarial Network (SAGAN), allows attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other.

Meta-Gradient Reinforcement Learning

In reinforcement learning, there’s no teacher available to estimate the value function as in supervised learning. The only option available is a proxy for the value function - usually a sampled and bootstrapped approximation to the true value function, known as a return. In the recent article, authors propose a gradient-based meta-learning algorithm that is able to adapt the nature of the return online, whilst interacting and learning from the environment. Such online approach enabled a new state-of-the-art performance when applied to 57 games on the Atari 2600 environment over 200 million frames.

How Many Samples are Needed to Learn a Convolutional Neural Network?

A rigorous study of the sample complexity required to properly train convolutional neural networks (CNNs) follows a widespread assumption that CNN is a more compact representation than the fully connected neural network (FNN) and thus requires fewer samples for learning. Concentrating on sizes of the input and convolutional layers, authors calculate the sample complexity of achieving population prediction error for both CNN and FNN - and proceed with calculating the sample complexity of training a one-hidden-layer CNN with linear activation with unknown weights of convolutional and output layers with preset sizes. They figure the sample complexity as a function of sizes of convolutional and output layers and prediction error - and believe these tools may inspire further developments in understanding CNN.

Adding One Neuron Can Eliminate All Bad Local Minima

One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima. In the recent paper, authors study the landscape of neural networks for binary classification tasks. Under mild assumptions, they prove that after adding one special neuron with a skip connection to the output, or one special neuron per layer, every local minimum is a global minimum.

Show Comments