## Neural and Evolutionary Computing

Sparsely-Connected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks

Comments: 13 pages, 3 figures

**Subjects**

:

Neural and Evolutionary Computing (cs.NE)

; Learning (cs.LG)

Recently deep neural networks have received considerable attention due to

their ability to extract and represent high-level abstractions in data sets.

Deep neural networks such as fully-connected and convolutional neural networks

have shown excellent performance on a wide range of recognition and

classification tasks. However, their hardware implementations currently suffer

from large silicon area and high power consumption due to the their high degree

of complexity. The power/energy consumption of neural networks is dominated by

memory accesses, the majority of which occur in fully-connected networks. In

fact, they contain most of the deep neural network parameters. In this paper,

we propose sparsely-connected networks, by showing that the number of

connections in fully-connected networks can be reduced by up to 90% while

improving the accuracy performance on three popular datasets (MNIST, CIFAR10

and SVHN). We then propose an efficient hardware architecture based on

linear-feedback shift registers to reduce the memory requirements of the

proposed sparsely-connected networks. The proposed architecture can save up to

90% of memory compared to the conventional implementations of fully-connected

neural networks. Moreover, implementation results show up to 84% reduction in

the energy consumption of a single neuron of the proposed sparsely-connected

networks compared to a single neuron of fully-connected neural networks.

### RenderGAN: Generating Realistic Labeled Data

Leon Sixt , Benjamin Wild , Tim Landgraf **Subjects** : Neural and Evolutionary Computing (cs.NE) ; Computer Vision and Pattern Recognition (cs.CV)

Deep Convolutional Neuronal Networks (DCNN) are showing remarkable

performance on many computer vision tasks. Due to their large parameter space,

they require many labeled samples when trained in a supervised setting. The

costs of annotating data manually can render the usage of DCNNs infeasible. We

present a novel framework called RenderGAN that can generate large amounts of

realistic, labeled images by combining a 3D model and the Generative

Adversarial Network framework. In our approach, image augmentations (e.g.

lighting, background, and detail) are learned from unlabeled data such that the

generated images are strikingly realistic while preserving the labels known

from the 3D model. We apply the RenderGAN framework to generate images of

barcode-like markers that are attached to honeybees. A DCNN is trained on this

data only. It performs better on a test set of real data than an equal DCNN

trained on the limited amounts of real data available.

### A Self-Driving Robot Using Deep Convolutional Neural Networks on Neuromorphic Hardware

Comments: 6 pages, 8 figures

**Subjects**

:

Neural and Evolutionary Computing (cs.NE)

; Robotics (cs.RO)

Neuromorphic computing is a promising solution for reducing the size, weight

and power of mobile embedded systems. In this paper, we introduce a realization

of such a system by creating the first closed-loop battery-powered

communication system between an IBM TrueNorth NS1e and an autonomous

Android-Based Robotics platform. Using this system, we constructed a dataset of

path following behavior by manually driving the Android-Based robot along steep

mountain trails and recording video frames from the camera mounted on the robot

along with the corresponding motor commands. We used this dataset to train a

deep convolutional neural network implemented on the TrueNorth NS1e. The NS1e,

which was mounted on the robot and powered by the robot’s battery, resulted in

a self-driving robot that could successfully traverse a steep mountain path in

real time. To our knowledge, this represents the first time the TrueNorth NS1e

neuromorphic chip has been embedded on a mobile platform under closed-loop

control.

### Demystifying ResNet

Sihan Li , Jiantao Jiao , Yanjun Han , Tsachy Weissman **Subjects** : Neural and Evolutionary Computing (cs.NE) ; Learning (cs.LG); Machine Learning (stat.ML)

We provide a theoretical explanation for the superb performance of ResNet via

the study of deep linear networks and some nonlinear variants. We show that

with or without nonlinearities, by adding shortcuts that have depth two, the

condition number of the Hessian of the loss function at the zero initial point

is depth-invariant, which makes training very deep models no more difficult

than shallow ones. Shortcuts of higher depth result in an extremely flat

(high-order) stationary point initially, from which the optimization algorithm

is hard to escape. The 1-shortcut, however, is essentially equivalent to no

shortcuts. Extensive experiments are provided accompanying our theoretical

results. We show that initializing the network to small weights with

2-shortcuts achieves significantly better results than random Gaussian (Xavier)

initialization, orthogonal initialization, and shortcuts of deeper depth, from

various perspectives ranging from final loss, learning dynamics and stability,

to the behavior of the Hessian along the learning process.

### Semantic Noise Modeling for Better Representation Learning

Hyo-Eun Kim , Sangheum Hwang , Kyunghyun Cho **Subjects** : Learning (cs.LG) ; Neural and Evolutionary Computing (cs.NE)

Latent representation learned from multi-layered neural networks via

hierarchical feature abstraction enables recent success of deep learning. Under

the deep learning framework, generalization performance highly depends on the

learned latent representation which is obtained from an appropriate training

scenario with a task-specific objective on a designed network model. In this

work, we propose a novel latent space modeling method to learn better latent

representation. We designed a neural network model based on the assumption that

good base representation can be attained by maximizing the total correlation

between the input, latent, and output variables. From the base model, we

introduce a semantic noise modeling method which enables class-conditional

perturbation on latent space to enhance the representational power of learned

latent feature. During training, latent vector representation can be

stochastically perturbed by a modeled class-conditional additive noise while

maintaining its original semantic feature. It implicitly brings the effect of

semantic augmentation on the latent space. The proposed model can be easily

learned by back-propagation with common gradient-based optimization algorithms.

Experimental results show that the proposed method helps to achieve performance

benefits against various previous approaches. We also provide the empirical

analyses for the proposed class-conditional perturbation process including

t-SNE visualization.

### Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear

Zachary C. Lipton , Jianfeng Gao , Lihong Li , Jianshu Chen , Li Deng **Subjects** : Learning (cs.LG) ; Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

To use deep reinforcement learning in the wild, we might hope for an agent

that would never make catastrophic mistakes. At the very least, we could hope

that an agent would eventually learn to avoid old mistakes. Unfortunately, even

in simple environments, modern deep reinforcement learning techniques are

doomed by a Sisyphean curse. Owing to the use of function approximation, these

agents eventually forget experiences as they become exceedingly unlikely under

a new policy. Consequently, for as long as they continue to train,

state-aggregating agents may periodically relive catastrophic mistakes. We

demonstrate unacceptable performance of deep Q-networks on two toy problems. We

then introduce intrinsic fear, a method that mitigates these problems by

avoiding dangerous states.

## Computer Vision and Pattern Recognition

### UMDFaces: An Annotated Face Dataset for Training Deep Networks

Comments: 9 pages, submitted to WACV 2017

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

Recent progress in face detection (including keypoint detection), and

recognition is mainly being driven by (i) deeper convolutional neural network

architectures, and (ii) larger datasets. However, most of the largest datasets

are maintained by private companies and are not publicly available. The

academic computer vision community needs larger and more varied datasets to

make further progress.

In this paper we introduce a new face dataset, called UMDFaces, which has

367,920 face annotations of 8,501 subjects. We discuss how a large dataset can

be collected and annotated using human annotators and deep networks. We provide

human curated bounding boxes for faces. We also provide estimated pose (roll,

pitch and yaw), locations of twenty-one key-points and gender information

generated by a pre-trained neural network. Finally, we compare the quality of

the dataset with other publicly available face datasets at similar scales.

### STDP-based spiking deep neural networks for object recognition

Saeed Reza Kheradpisheh , Mohammad Ganjtabesh , Simon J Thorpe , Timothée Masquelier **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

Previous studies have shown that spike-timing-dependent plasticity (STDP) can

be used in spiking neural networks (SNN) to extract visual features of low or

intermediate complexity in an unsupervised manner. These studies, however, used

relatively shallow architectures, and only one layer was trainable. Another

line of research has demonstrated – using rate-based neural networks trained

with back-propagation – that having many layers increases the recognition

robustness, an approach known as deep learning. We thus designed a deep SNN,

comprising several convolutional (trainable with STDP) and pooling layers. We

used a temporal coding scheme where the most strongly activated neurons fire

first, and less activated neurons fire later or not at all. The network was

exposed to natural images. Thanks to STDP, neurons progressively learned

features corresponding to prototypical patterns that were both salient and

frequent. Only a few tens of examples per category were required and no label

was needed. After learning, the complexity of the extracted features increased

along the hierarchy, from edge detectors in the first layer to object

prototypes in the last layer. Coding was very sparse, with only a few thousands

spikes per image, and in some cases the object category could be reasonably

well inferred from the activity of a single higher-order neuron. More

generally, the activity of a few hundreds of such neurons contained robust

category information, as demonstrated using a classifier on Caltech 101,

ETH-80, and MNIST databases. We think that the combination of STDP with latency

coding is key to understanding the way that the primate visual system learns,

its remarkable processing speed and its low energy consumption. These

mechanisms are also interesting for artificial vision systems, particularly for

hardware solutions.

### Nonnegative Matrix Underapproximation for Robust Multiple Model Fitting

Mariano Tepper , Guillermo Sapiro **Subjects** : Computer Vision and Pattern Recognition (cs.CV)

In this work, we introduce the first algorithm to truly address the

nonnegative matrix underapproximation (NMU) problem, i.e., nonnegative matrix

factorization (NMF) with an additional underapproximation constraint. NMU

results are interesting as, compared to traditional NMF, they present

additional sparsity and part-based behavior, explaining unique data features.

To show an example of these features in practice, we first present an

application to the analysis of climate data. We then present an NMU-based

algorithm to robustly fit multiple parametric models to a dataset. The proposed

approach delivers state-of-the-art results for the estimation of multiple

fundamental matrices and homographies, outperforming other alternatives in the

literature and exemplifying the use of efficient NMU computations.

Comments: 8 pages, 6 figures in Proceedings of the XVI Brazilian Symposium on Computer Graphics and Image Processing, 2003. SIBGRAPI 2003. IEEE. arXiv admin note: text overlap with arXiv:1403.7365 , arXiv:1611.00960

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

The computation of 2-D optical flow by means of regularized pel-recursive

algorithms raises a host of issues, which include the treatment of outliers,

motion discontinuities and occlusion among other problems. We propose a new

approach which allows us to deal with these issues within a common framework.

Our approach is based on the use of a technique called Generalized

Cross-Validation to estimate the best regularization scheme for a given pixel.

In our model, the regularization parameter is a matrix whose entries can

account for diverse sources of error. The estimation of the motion vectors

takes into consideration local properties of the image following a spatially

adaptive approach where each moving pixel is supposed to have its own

regularization matrix. Preliminary experiments indicate that this approach

provides robust estimates of the optical flow.

### Learning Identity Mappings with Residual Gates

Pedro H. P. Savarese **Subjects** : Computer Vision and Pattern Recognition (cs.CV) ; Learning (cs.LG)

We propose a technique to augment network layers by adding a linear gating

mechanism, which provides a way to learn identity mappings by optimizing only

one parameter. We also introduce a new metric which served as basis for the

technique. It captures the difficulty involved in learning identity mappings

for different types of network models, and provides a new theoretical intuition

for the increased depths of models such as Highway and Residual Networks. We

propose a new model, the Gated Residual Network, which is the result when

augmenting Residual Networks. Experimental results show that augmenting layers

grants increased performance, less issues with depth, and more layer

independence — fully removing them does not cripple the model. We evaluate our

method on MNIST using fully-connected networks and on CIFAR-10 using Wide

ResNets, achieving a relative error reduction of more than 8% in the latter

when compared to the original model.

### Adversarial Machine Learning at Scale

Comments: 15 pages, 5 figures

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

; Cryptography and Security (cs.CR); Learning (cs.LG); Machine Learning (stat.ML)

Adversarial examples are malicious inputs designed to fool machine learning

models. They often transfer from one model to another, allowing attackers to

mount black box attacks without knowledge of the target model’s parameters.

Adversarial training is the process of explicitly training a model on

adversarial examples, in order to make it more robust to attack or to reduce

its test error on clean inputs. So far, adversarial training has primarily been

applied to small problems. In this research, we apply adversarial training to

ImageNet. Our contributions include: (1) recommendations for how to succesfully

scale adversarial training to large models and datasets, (2) the observation

that adversarial training confers robustness to single-step attack methods, (3)

the finding that multi-step attack methods are somewhat less transferable than

single-step attack methods, so single-step attacks are the best for mounting

black-box attacks, and (4) resolution of a “label leaking” effect that causes

adversarially trained models to perform better on adversarial examples than on

clean examples, because the adversarial example construction process uses the

true label and the model can learn to exploit regularities in the construction

process.

### Integrating Atlas and Graph Cut Methods for LV Segmentation from Cardiac Cine MRI

Comments: Statistical Atlases and Computational Modelling of Heart workshop 2016

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

Magnetic Resonance Imaging (MRI) has evolved as a clinical standard-of-care

imaging modality for cardiac morphology, function assessment, and guidance of

cardiac interventions. All these applications rely on accurate extraction of

the myocardial tissue and blood pool from the imaging data. Here we propose a

framework for left ventricle (LV) segmentation from cardiac cine-MRI. First, we

segment the LV blood pool using iterative graph cuts, and subsequently use this

information to segment the myocardium. We formulate the segmentation procedure

as an energy minimization problem in a graph subject to the shape prior

obtained by label propagation from an average atlas using affine registration.

The proposed framework has been validated on 30 patient cardiac cine-MRI

datasets available through the STACOM LV segmentation challenge and yielded

fast, robust, and accurate segmentation results.

### Bayesian Modeling of Motion Perception using Dynamical Stochastic Textures

Jonathan Vacher , Andrew Isaac Meso , Laurent U. Perrinet , Gabriel Peyré **Subjects** : Neurons and Cognition (q-bio.NC) ; Computer Vision and Pattern Recognition (cs.CV)

A common practice to account for psychophysical biases in vision is to frame

them as consequences of a dynamic process relying on optimal inference with

respect to a generative model. The present study details the complete

formulation of such a generative model intended to probe visual motion

perception. It is first derived in a set of axiomatic steps constrained by

biological plausibility. We then extend previous contributions by detailing

three equivalent formulations of the Gaussian dynamic texture model. First, the

composite dynamic textures are constructed by the random aggregation of warped

patterns, which can be viewed as 3D Gaussian fields. Second, these textures are

cast as solutions to a stochastic partial differential equation (sPDE). This

essential step enables real time, on-the-fly, texture synthesis using

time-discretized auto- regressive processes. It also allows for the derivation

of a local motion-energy model, which corresponds to the log-likelihood of the

probability density. The log-likelihoods are finally essential for the

construction of a Bayesian inference framework. We use the model to probe speed

perception in humans psychophysically using zoom-like changes in stimulus

spatial frequency content. The likelihood is contained within the genera- tive

model and we chose a slow speed prior consistent with previous literature. We

then validated the fitting process of the model using synthesized data. The

human data replicates previous findings that relative perceived speed is

positively biased by spatial frequency increments. The effect cannot be fully

accounted for by previous models, but the current prior acting on the

spatio-temporal likelihoods has proved necessary in accounting for the

perceptual bias.

### RenderGAN: Generating Realistic Labeled Data

Leon Sixt , Benjamin Wild , Tim Landgraf **Subjects** : Neural and Evolutionary Computing (cs.NE) ; Computer Vision and Pattern Recognition (cs.CV)

Deep Convolutional Neuronal Networks (DCNN) are showing remarkable

performance on many computer vision tasks. Due to their large parameter space,

they require many labeled samples when trained in a supervised setting. The

costs of annotating data manually can render the usage of DCNNs infeasible. We

present a novel framework called RenderGAN that can generate large amounts of

realistic, labeled images by combining a 3D model and the Generative

Adversarial Network framework. In our approach, image augmentations (e.g.

lighting, background, and detail) are learned from unlabeled data such that the

generated images are strikingly realistic while preserving the labels known

from the 3D model. We apply the RenderGAN framework to generate images of

barcode-like markers that are attached to honeybees. A DCNN is trained on this

data only. It performs better on a test set of real data than an equal DCNN

trained on the limited amounts of real data available.

### Statistical Inverse Formulation of Optical Flow with Uncertainty Quantification

Jie Sun , Erik Bollt **Subjects** : Machine Learning (stat.ML) ; Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)

Optical flow refers to the visual motion observed between two consecutive

images. Since the degree of freedom is typically much larger than the

constraints imposed by the image observations, the straightforward formulation

of optical flow inference is an ill-posed problem. By setting some type of

additional “regularity” constraints, classical approaches formulate a

well-posed optical flow inference problem in the form of a parameterized set of

variational equations. In this work we build a mathematical connection, focused

on optical flow methods, between classical variational optical flow approaches

and Bayesian statistical inversion. A classical optical flow solution is in

fact identical to a maximum a posteriori estimator under the assumptions of

linear model with additive independent Gaussian noise and a Gaussian prior

distribution. Unlike classical approaches, the statistical inversion approach

to optical flow estimation not only allows for “point” estimates, but also

provides a distribution of solutions which can be used for ensemble estimation

and in particular uncertainty quantification.

## Artificial Intelligence

### Estimating Causal Direction and Confounding of Two Discrete Variables

Krzysztof Chalupka , Frederick Eberhardt , Pietro Perona **Subjects** : Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Learning (cs.LG)

We propose a method to classify the causal relationship between two discrete

variables given only the joint distribution of the variables, acknowledging

that the method is subject to an inherent baseline error. We assume that the

causal system is acyclicity, but we do allow for hidden common causes. Our

algorithm presupposes that the probability distributions (P(C)) of a cause (C)

is independent from the probability distribution (P(Emid C)) of the

cause-effect mechanism. While our classifier is trained with a Bayesian

assumption of flat hyperpriors, we do not make this assumption about our test

data. This work connects to recent developments on the identifiability of

causal models over continuous variables under the assumption of “independent

mechanisms”. Carefully-commented Python notebooks that reproduce all our

experiments are available online at

### Understanding Deep Neural Networks with Rectified Linear Units

Raman Arora , Amitabh Basu , Poorya Mianjy , Anirbit Mukherjee **Subjects** : Learning (cs.LG) ; Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (stat.ML)

In this paper we investigate the family of functions representable by deep

neural networks (DNN) with rectified linear units (ReLU). We give the

first-ever polynomial time (in the size of data) algorithm to train a ReLU DNN

with one hidden layer to global optimality. This follows from our complete

characterization of the ReLU DNN function class whereby we show that a

(mathbb{R}^n o mathbb{R}) function is representable by a ReLU DNN if and

only if it is a continuous piecewise linear function. The main tool used to

prove this characterization is an elegant result from tropical geometry.

Further, for the (n=1) case, we show that a single hidden layer suffices to

express all piecewise linear functions, and we give tight bounds for the size

of such a ReLU DNN.We follow up with gap results showing that there is a

smoothly parameterized family of (mathbb{R} o mathbb{R}) “hard” functions

that lead to an exponential blow-up in size, if the number of layers is

decreased by a small amount. An example consequence of our gap theorem is that

for every natural number (N), there exists a function representable by a ReLU

DNN with depth (N^2+1) and total size (N^3), such that any ReLU DNN with depth

at most (N + 1) will require at least (frac12N^{N+1}-1) total nodes.

Finally, we construct a family of (mathbb{R}^n o mathbb{R}) functions for

(ngeq 2) (also smoothly parameterized), whose number of affine pieces scales

exponentially with the dimension (n) at any fixed size and depth. To the best

of our knowledge, such a construction with exponential dependence on (n) has

not been achieved by previous families of “hard” functions in the neural nets

literature.

### Ways of Conditioning Generative Adversarial Networks

Hanock Kwak , Byoung-Tak Zhang **Subjects** : Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

The GANs are generative models whose random samples realistically reflect

natural images. It also can generate samples with specific attributes by

concatenating a condition vector into the input, yet research on this field is

not well studied. We propose novel methods of conditioning generative

adversarial networks (GANs) that achieve state-of-the-art results on MNIST and

CIFAR-10. We mainly introduce two models: an information retrieving model that

extracts conditional information from the samples, and a spatial bilinear

pooling model that forms bilinear features derived from the spatial cross

product of an image and a condition vector. These methods significantly enhance

log-likelihood of test data under the conditional distributions compared to the

methods of concatenation.

### Learning Continuous Semantic Representations of Symbolic Expressions

Miltiadis Allamanis , Pankajan Chanthirasegaran , Pushmeet Kohli , Charles Sutton **Subjects** : Learning (cs.LG) ; Artificial Intelligence (cs.AI)

The question of how procedural knowledge is represented and inferred is a

fundamental problem in machine learning and artificial intelligence. Recent

work on program induction has proposed neural architectures, based on

abstractions like stacks, Turing machines, and interpreters, that operate on

abstract computational machines or on execution traces. But the recursive

abstraction that is central to procedural knowledge is perhaps most naturally

represented by symbolic representations that have syntactic structure, such as

logical expressions and source code. Combining abstract, symbolic reasoning

with continuous neural reasoning is a grand challenge of representation

learning. As a step in this direction, we propose a new architecture, called

neural equivalence networks, for the problem of learning continuous semantic

representations of mathematical and logical expressions. These networks are

trained to represent semantic equivalence, even of expressions that are

syntactically very different. The challenge is that semantic representations

must be computed in a syntax-directed manner, because semantics is

compositional, but at the same time, small changes in syntax can lead to very

large changes in semantics, which can be difficult for continuous neural

architectures. We perform an exhaustive evaluation on the task of checking

equivalence on a highly diverse class of symbolic algebraic and boolean

expression types, showing that our model significantly outperforms existing

architectures.

## Information Retrieval

### Learning to Rank Scientific Documents from the Crowd

Comments: 12 pages, 1 figure

**Subjects**

:

Information Retrieval (cs.IR)

; Computation and Language (cs.CL); Digital Libraries (cs.DL); Learning (cs.LG); Social and Information Networks (cs.SI)

Finding related published articles is an important task in any science, but

with the explosion of new work in the biomedical domain it has become

especially challenging. Most existing methodologies use text similarity metrics

to identify whether two articles are related or not. However biomedical

knowledge discovery is hypothesis-driven. The most related articles may not be

ones with the highest text similarities. In this study, we first develop an

innovative crowd-sourcing approach to build an expert-annotated

document-ranking corpus. Using this corpus as the gold standard, we then

evaluate the approaches of using text similarity to rank the relatedness of

articles. Finally, we develop and evaluate a new supervised model to

automatically rank related scientific articles. Our results show that authors’

ranking differ significantly from rankings by text-similarity-based models. By

training a learning-to-rank model on a subset of the annotated corpus, we found

the best supervised learning-to-rank model (SVM-Rank) significantly surpassed

state-of-the-art baseline systems.

### URL ordering policies for distributed crawlers: a review

Deepika ,

Comments: 6 Pages, 5 figures, 1 Table, International Conference on Recent Trends in Computer and Information Technology Research September 2015

**Subjects**

:

Information Retrieval (cs.IR)

With the increase in size of web, the information is also spreading at large

scale. Search Engines are the medium to access this information. Crawler is the

module of search engine which is responsible for download the web pages. In

order to download the fresh information and get the database rich, crawler

should crawl the web in some order. This is called as ordering of URLs. URL

ordering should be done in efficient and effective manner in order to crawl the

web in proficient manner. In this paper, a survey is done on some existing

methods of URL ordering and at the end of this paper comparison is also carried

out among them.

### Generalized Topic Modeling

Avrim Blum , Nika Haghtalab **Subjects** : Learning (cs.LG) ; Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)

Recently there has been significant activity in developing algorithms with

provable guarantees for topic modeling. In standard topic models, a topic (such

as sports, business, or politics) is viewed as a probability distribution (vec

a_i) over words, and a document is generated by first selecting a mixture (vec

w) over topics, and then generating words i.i.d. from the associated mixture

(A{vec w}). Given a large collection of such documents, the goal is to recover

the topic vectors and then to correctly classify new documents according to

their topic mixture.

In this work we consider a broad generalization of this framework in which

words are no longer assumed to be drawn i.i.d. and instead a topic is a complex

distribution over sequences of paragraphs. Since one could not hope to even

represent such a distribution in general (even if paragraphs are given using

some natural feature representation), we aim instead to directly learn a

document classifier. That is, we aim to learn a predictor that given a new

document, accurately predicts its topic mixture, without learning the

distributions explicitly. We present several natural conditions under which one

can do this efficiently and discuss issues such as noise tolerance and sample

complexity in this model. More generally, our model can be viewed as a

generalization of the multi-view or co-training setting in machine learning.

## Computation and Language

### Sequence to Sequence Transduction with Hard Monotonic Attention

Comments: Under review as a conference paper at ICLR 2017

**Subjects**

:

Computation and Language (cs.CL)

We present a supervised sequence to sequence transduction model with a hard

attention mechanism which combines the more traditional statistical alignment

methods with the power of recurrent neural networks. We evaluate the model on

the task of morphological inflection generation and show that it provides state

of the art results in various setups compared to the previous neural and

non-neural approaches. Eventually we present an analysis of the learned

representations for both hard and soft attention models, shedding light on the

features such models extract in order to solve the task.

### Learning Recurrent Span Representations for Extractive Question Answering

Kenton Lee , Tom Kwiatkowski , Ankur Parikh , Dipanjan Das **Subjects** : Computation and Language (cs.CL)

The reading comprehension task, that asks questions about a given evidence

document, is a central problem in natural language understanding. Recent

formulations of this task have typically focused on answer selection from a set

of candidates pre-defined manually or through the use of an external NLP

pipeline. However, Rajpurkar et al. (2016) recently released the SQuAD dataset

in which the answers can be arbitrary strings from the supplied text. In this

paper, we focus on this answer extraction task, presenting a novel model

architecture that efficiently builds fixed length representations of all spans

in the evidence document with a recurrent network. We show that scoring

explicit span representations significantly improves performance over other

approaches that factor the prediction into separate predictions about words or

start and end markers. Our approach improves upon the best published results of

Wang & Jiang (2016) by 5% and decreases the error of Rajpurkar et al.’s

baseline by > 50%.

### Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies

Comments: 15 pages; to appear in Transactions of the Association for Computational Linguistics

**Subjects**

:

Computation and Language (cs.CL)

The success of long short-term memory (LSTM) neural networks in language

processing is typically attributed to their ability to capture long-distance

statistical regularities. Linguistic regularities are often sensitive to

syntactic structure; can such dependencies be captured by LSTMs, which do not

have explicit structural representations? We begin addressing this question

using number agreement in English subject-verb dependencies. We probe the

architecture’s grammatical competence both using training objectives with an

explicit grammatical target (number prediction, grammaticality judgments) and

using language models. In the strongly supervised settings, the LSTM achieved

very high overall accuracy (less than 1% errors), but errors increased when

sequential and structural information conflicted. The frequency of such errors

rose sharply in the language-modeling setting. We conclude that LSTMs can

capture a non-trivial amount of grammatical structure given targeted

supervision, but stronger architectures may be required to further reduce

errors; furthermore, the language modeling signal is insufficient for capturing

syntax-sensitive dependencies, and should be supplemented with more direct

supervision if such dependencies need to be captured.

### Answering Complicated Question Intents Expressed in Decomposed Question Sequences

Mohit Iyyer , Wen-tau Yih , Ming-Wei Chang **Subjects** : Computation and Language (cs.CL)

Recent work in semantic parsing for question answering has focused on long

and complicated questions, many of which would seem unnatural if asked in a

normal conversation between two humans. In an effort to explore a

conversational QA setting, we present a more realistic task: answering

sequences of simple but inter-related questions. We collect a dataset of 6,066

question sequences that inquire about semi-structured tables from Wikipedia,

with 17,553 question-answer pairs in total. Existing QA systems face two major

problems when evaluated on our dataset: (1) handling questions that contain

coreferences to previous questions or answers, and (2) matching words or

phrases in a question to corresponding entries in the associated table. We

conclude by proposing strategies to handle both of these issues.

### Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

Comments: 10 pages, 2 figures, 3 tables

**Subjects**

:

Learning (cs.LG)

; Computation and Language (cs.CL); Machine Learning (stat.ML)

Recurrent neural networks have been very successful at predicting sequences

of words in tasks such as language modeling. However, all such models are based

on the conventional classification framework, where model is trained against

one-hot targets, and each word is represented both as an input and as an output

in isolation. This causes inefficiencies in learning both in terms of utilizing

all of the information and in terms of the number of parameters needed to

train. We introduce a novel theoretical framework that facilitates better

learning in language modeling, and show that our framework leads to tying

together the input embedding and the output projection matrices, greatly

reducing the number of trainable variables. Our LSTM model lowers the state of

the art word-level perplexity on the Penn Treebank to 68.5.

### Learning to Rank Scientific Documents from the Crowd

Comments: 12 pages, 1 figure

**Subjects**

:

Information Retrieval (cs.IR)

; Computation and Language (cs.CL); Digital Libraries (cs.DL); Learning (cs.LG); Social and Information Networks (cs.SI)

Finding related published articles is an important task in any science, but

with the explosion of new work in the biomedical domain it has become

especially challenging. Most existing methodologies use text similarity metrics

to identify whether two articles are related or not. However biomedical

knowledge discovery is hypothesis-driven. The most related articles may not be

ones with the highest text similarities. In this study, we first develop an

innovative crowd-sourcing approach to build an expert-annotated

document-ranking corpus. Using this corpus as the gold standard, we then

evaluate the approaches of using text similarity to rank the relatedness of

articles. Finally, we develop and evaluate a new supervised model to

automatically rank related scientific articles. Our results show that authors’

ranking differ significantly from rankings by text-similarity-based models. By

training a learning-to-rank model on a subset of the annotated corpus, we found

the best supervised learning-to-rank model (SVM-Rank) significantly surpassed

state-of-the-art baseline systems.

### Generalized Topic Modeling

Avrim Blum , Nika Haghtalab **Subjects** : Learning (cs.LG) ; Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)

Recently there has been significant activity in developing algorithms with

provable guarantees for topic modeling. In standard topic models, a topic (such

as sports, business, or politics) is viewed as a probability distribution (vec

a_i) over words, and a document is generated by first selecting a mixture (vec

w) over topics, and then generating words i.i.d. from the associated mixture

(A{vec w}). Given a large collection of such documents, the goal is to recover

the topic vectors and then to correctly classify new documents according to

their topic mixture.

In this work we consider a broad generalization of this framework in which

words are no longer assumed to be drawn i.i.d. and instead a topic is a complex

distribution over sequences of paragraphs. Since one could not hope to even

represent such a distribution in general (even if paragraphs are given using

some natural feature representation), we aim instead to directly learn a

document classifier. That is, we aim to learn a predictor that given a new

document, accurately predicts its topic mixture, without learning the

distributions explicitly. We present several natural conditions under which one

can do this efficiently and discuss issues such as noise tolerance and sample

complexity in this model. More generally, our model can be viewed as a

generalization of the multi-view or co-training setting in machine learning.

## Distributed, Parallel, and Cluster Computing

### Multi-level Simulation of Internet of Things on Smart Territories

Comments: To appear in Simulation Modelling Practice and Theory, Elsevier

**Subjects**

:

Performance (cs.PF)

; Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)

In this paper, a methodology is presented and employed for simulating the

Internet of Things (IoT). The requirement for scalability, due to the possibly

huge amount of involved sensors and devices, and the heterogeneous scenarios

that might occur, impose resorting to sophisticated modeling and simulation

techniques. In particular, multi-level simulation is regarded as a main

framework that allows simulating large-scale IoT environments while keeping

high levels of detail, when it is needed. We consider a use case based on the

deployment of smart services in decentralized territories. A two level

simulator is employed, which is based on a coarse agent-based, adaptive

parallel and distributed simulation approach to model the general life of

simulated entities. However, when needed a finer grained simulator (based on

OMNeT++) is triggered on a restricted portion of the simulated area, which

allows considering all issues concerned with wireless communications. Based on

this use case, it is confirmed that the ad-hoc wireless networking technologies

do represent a principle tool to deploy smart services over decentralized

countrysides. Moreover, the performance evaluation confirms the viability of

utilizing multi-level simulation for simulating large scale IoT environments.

## Learning

### Improving Stochastic Gradient Descent with Feedback

In this paper we propose a simple and efficient method for improving

stochastic gradient descent methods by using feedback from the objective

function. The method tracks the relative changes in the objective function with

a running average, and uses it to adaptively tune the learning rate in

stochastic gradient descent. We specifically apply this idea to modify Adam, a

popular algorithm for training deep neural networks. We conduct experiments to

compare the resulting algorithm, which we call Eve, with state of the art

methods used for training deep learning models. We train CNNs for image

classification, and RNNs for language modeling and question answering. Our

experiments show that Eve outperforms all other algorithms on these benchmark

tasks. We also analyze the behavior of the feedback mechanism during the

training process.

Comments: 10 pages, 2 figures, submitted to RECOMB 2017

**Subjects**

:

Learning (cs.LG)

; Biomolecules (q-bio.BM)

Recently developed deep learning techniques have significantly improved the

accuracy of various speech and image recognition systems. In this paper we

adapt some of these techniques for protein secondary structure prediction. We

first train a series of deep neural networks to predict eight-class secondary

structure labels given a protein’s amino acid sequence information and find

that using recent methods for regularization, such as dropout and weight-norm

constraining, leads to measurable gains in accuracy. We then adapt recent

convolutional neural network architectures–Inception, ReSNet, and DenseNet

with Batch Normalization–to the problem of protein structure prediction. These

convolutional architectures make heavy use of multi-scale filter layers that

simultaneously compute features on several scales, and use residual connections

to prevent underfitting. Using a carefully modified version of these

architectures, we achieve state-of-the-art performance of 70.0% per amino acid

accuracy on the public CB513 benchmark dataset. Finally, we explore additions

from sequence-to-sequence learning, altering the model to make its predictions

conditioned on both the protein’s amino acid sequence and its past secondary

structure labels. We introduce a new method of ensembling such a conditional

model with our convolutional model, an approach which reaches 70.6% Q8 accuracy

on CB513. We argue that these results can be further refined for larger boosts

in prediction accuracy through more sophisticated attempts to control

overfitting of conditional models. We aim to release the code for these

experiments as part of the TensorFlow repository.

### Understanding Deep Neural Networks with Rectified Linear Units

Raman Arora , Amitabh Basu , Poorya Mianjy , Anirbit Mukherjee **Subjects** : Learning (cs.LG) ; Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (stat.ML)

In this paper we investigate the family of functions representable by deep

neural networks (DNN) with rectified linear units (ReLU). We give the

first-ever polynomial time (in the size of data) algorithm to train a ReLU DNN

with one hidden layer to global optimality. This follows from our complete

characterization of the ReLU DNN function class whereby we show that a

(mathbb{R}^n o mathbb{R}) function is representable by a ReLU DNN if and

only if it is a continuous piecewise linear function. The main tool used to

prove this characterization is an elegant result from tropical geometry.

Further, for the (n=1) case, we show that a single hidden layer suffices to

express all piecewise linear functions, and we give tight bounds for the size

of such a ReLU DNN.We follow up with gap results showing that there is a

smoothly parameterized family of (mathbb{R} o mathbb{R}) “hard” functions

that lead to an exponential blow-up in size, if the number of layers is

decreased by a small amount. An example consequence of our gap theorem is that

for every natural number (N), there exists a function representable by a ReLU

DNN with depth (N^2+1) and total size (N^3), such that any ReLU DNN with depth

at most (N + 1) will require at least (frac12N^{N+1}-1) total nodes.

Finally, we construct a family of (mathbb{R}^n o mathbb{R}) functions for

(ngeq 2) (also smoothly parameterized), whose number of affine pieces scales

exponentially with the dimension (n) at any fixed size and depth. To the best

of our knowledge, such a construction with exponential dependence on (n) has

not been achieved by previous families of “hard” functions in the neural nets

literature.

### Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling

Comments: 10 pages, 2 figures, 3 tables

**Subjects**

:

Learning (cs.LG)

; Computation and Language (cs.CL); Machine Learning (stat.ML)

Recurrent neural networks have been very successful at predicting sequences

of words in tasks such as language modeling. However, all such models are based

on the conventional classification framework, where model is trained against

one-hot targets, and each word is represented both as an input and as an output

in isolation. This causes inefficiencies in learning both in terms of utilizing

all of the information and in terms of the number of parameters needed to

train. We introduce a novel theoretical framework that facilitates better

learning in language modeling, and show that our framework leads to tying

together the input embedding and the output projection matrices, greatly

reducing the number of trainable variables. Our LSTM model lowers the state of

the art word-level perplexity on the Penn Treebank to 68.5.

### Multi-task learning with deep model based reinforcement learning

Asier Mujika **Subjects** : Learning (cs.LG)

In recent years, model-free methods that use deep learning have achieved

great success in many different reinforcement learning environments. Most

successful approaches focus on solving a single task, while multi-task

reinforcement learning remains an open problem. In this paper, we present a

model based approach to deep reinforcement learning which we use to solve

different tasks simultaneously. We show that our approach not only does not

degrade but actually benefits from learning multiple tasks. For our model, we

also present a new kind of recurrent neural network inspired by residual

networks that decouples memory from computation allowing to model complex

environments that do not require lots of memory. The code will be released

before ICLR 2017.

### Learning heat diffusion graphs

Dorina Thanou , Xiaowen Dong , Daniel Kressner , Pascal Frossard **Subjects** : Learning (cs.LG) ; Social and Information Networks (cs.SI); Machine Learning (stat.ML)

Effective information analysis generally boils down to properly identifying

the structure or geometry of the data, which is often represented by a graph.

In some applications, this structure may be partly determined by design

constraints or pre-determined sensing arrangements, like in road transportation

networks for example. In general though, the data structure is not readily

available and becomes pretty difficult to define. In particular, the global

smoothness assumptions, that most of the existing works adopt, are often too

general and unable to properly capture localized properties of data. In this

paper, we go beyond this classical data model and rather propose to represent

information as a sparse combination of localized functions that live on a data

structure represented by a graph. Based on this model, we focus on the problem

of inferring the connectivity that best explains the data samples at different

vertices of a graph that is a priori unknown. We concentrate on the case where

the observed data is actually the sum of heat diffusion processes, which is a

quite common model for data on networks or other irregular structures. We cast

a new graph learning problem and solve it with an efficient nonconvex

optimization algorithm. Experiments on both synthetic and real world data

finally illustrate the benefits of the proposed graph learning framework and

confirm that the data structure can be efficiently learned from data

observations only. We believe that our algorithm will help solving key

questions in diverse application domains such as social and biological network

analysis where it is crucial to unveil proper geometry for data understanding

and inference.

### Ways of Conditioning Generative Adversarial Networks

Hanock Kwak , Byoung-Tak Zhang **Subjects** : Learning (cs.LG) ; Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

The GANs are generative models whose random samples realistically reflect

natural images. It also can generate samples with specific attributes by

concatenating a condition vector into the input, yet research on this field is

not well studied. We propose novel methods of conditioning generative

adversarial networks (GANs) that achieve state-of-the-art results on MNIST and

CIFAR-10. We mainly introduce two models: an information retrieving model that

extracts conditional information from the samples, and a spatial bilinear

pooling model that forms bilinear features derived from the spatial cross

product of an image and a condition vector. These methods significantly enhance

log-likelihood of test data under the conditional distributions compared to the

methods of concatenation.

### Semi-supervised deep learning by metric embedding

Elad Hoffer , Nir Ailon **Subjects** : Learning (cs.LG)

Deep networks are successfully used as classification models yielding

state-of-the-art results when trained on a large number of labeled samples.

These models, however, are usually much less suited for semi-supervised

problems because of their tendency to overfit easily when trained on small

amounts of data. In this work we will explore a new training objective that is

targeting a semi-supervised regime with only a small subset of labeled data.

This criterion is based on a deep metric embedding over distance relations

within the set of labeled samples, together with constraints over the

embeddings of the unlabeled set. The final learned representations are

discriminative in euclidean space, and hence can be used with subsequent

nearest-neighbor classification using the labeled samples.

### Learning Continuous Semantic Representations of Symbolic Expressions

Miltiadis Allamanis , Pankajan Chanthirasegaran , Pushmeet Kohli , Charles Sutton **Subjects** : Learning (cs.LG) ; Artificial Intelligence (cs.AI)

The question of how procedural knowledge is represented and inferred is a

fundamental problem in machine learning and artificial intelligence. Recent

work on program induction has proposed neural architectures, based on

abstractions like stacks, Turing machines, and interpreters, that operate on

abstract computational machines or on execution traces. But the recursive

abstraction that is central to procedural knowledge is perhaps most naturally

represented by symbolic representations that have syntactic structure, such as

logical expressions and source code. Combining abstract, symbolic reasoning

with continuous neural reasoning is a grand challenge of representation

learning. As a step in this direction, we propose a new architecture, called

neural equivalence networks, for the problem of learning continuous semantic

representations of mathematical and logical expressions. These networks are

trained to represent semantic equivalence, even of expressions that are

syntactically very different. The challenge is that semantic representations

must be computed in a syntax-directed manner, because semantics is

compositional, but at the same time, small changes in syntax can lead to very

large changes in semantics, which can be difficult for continuous neural

architectures. We perform an exhaustive evaluation on the task of checking

equivalence on a highly diverse class of symbolic algebraic and boolean

expression types, showing that our model significantly outperforms existing

architectures.

### A Communication-Efficient Parallel Algorithm for Decision Tree

Qi Meng , Guolin Ke , Taifeng Wang , Wei Chen , Qiwei Ye , Zhi-Ming Ma , Tie-Yan Liu **Subjects** : Learning (cs.LG)

Decision tree (and its extensions such as Gradient Boosting Decision Trees

and Random Forest) is a widely used machine learning algorithm, due to its

practical effectiveness and model interpretability. With the emergence of big

data, there is an increasing need to parallelize the training process of

decision tree. However, most existing attempts along this line suffer from high

communication costs. In this paper, we propose a new algorithm, called

emph{Parallel Voting Decision Tree (PV-Tree)}, to tackle this challenge. After

partitioning the training data onto a number of (e.g., (M)) machines, this

algorithm performs both local voting and global voting in each iteration. For

local voting, the top-(k) attributes are selected from each machine according

to its local data. Then, globally top-(2k) attributes are determined by a

majority voting among these local candidates. Finally, the full-grained

histograms of the globally top-(2k) attributes are collected from local

machines in order to identify the best (most informative) attribute and its

split point. PV-Tree can achieve a very low communication cost (independent of

the total number of attributes) and thus can scale out very well. Furthermore,

theoretical analysis shows that this algorithm can learn a near optimal

decision tree, since it can find the best attribute with a large probability.

Our experiments on real-world datasets show that PV-Tree significantly

outperforms the existing parallel decision tree algorithms in the trade-off

between accuracy and efficiency.

### Semantic Noise Modeling for Better Representation Learning

Hyo-Eun Kim , Sangheum Hwang , Kyunghyun Cho **Subjects** : Learning (cs.LG) ; Neural and Evolutionary Computing (cs.NE)

Latent representation learned from multi-layered neural networks via

hierarchical feature abstraction enables recent success of deep learning. Under

the deep learning framework, generalization performance highly depends on the

learned latent representation which is obtained from an appropriate training

scenario with a task-specific objective on a designed network model. In this

work, we propose a novel latent space modeling method to learn better latent

representation. We designed a neural network model based on the assumption that

good base representation can be attained by maximizing the total correlation

between the input, latent, and output variables. From the base model, we

introduce a semantic noise modeling method which enables class-conditional

perturbation on latent space to enhance the representational power of learned

latent feature. During training, latent vector representation can be

stochastically perturbed by a modeled class-conditional additive noise while

maintaining its original semantic feature. It implicitly brings the effect of

semantic augmentation on the latent space. The proposed model can be easily

learned by back-propagation with common gradient-based optimization algorithms.

Experimental results show that the proposed method helps to achieve performance

benefits against various previous approaches. We also provide the empirical

analyses for the proposed class-conditional perturbation process including

t-SNE visualization.

### Generalized Topic Modeling

Avrim Blum , Nika Haghtalab **Subjects** : Learning (cs.LG) ; Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)

Recently there has been significant activity in developing algorithms with

provable guarantees for topic modeling. In standard topic models, a topic (such

as sports, business, or politics) is viewed as a probability distribution (vec

a_i) over words, and a document is generated by first selecting a mixture (vec

w) over topics, and then generating words i.i.d. from the associated mixture

(A{vec w}). Given a large collection of such documents, the goal is to recover

the topic vectors and then to correctly classify new documents according to

their topic mixture.

In this work we consider a broad generalization of this framework in which

words are no longer assumed to be drawn i.i.d. and instead a topic is a complex

distribution over sequences of paragraphs. Since one could not hope to even

represent such a distribution in general (even if paragraphs are given using

some natural feature representation), we aim instead to directly learn a

document classifier. That is, we aim to learn a predictor that given a new

document, accurately predicts its topic mixture, without learning the

distributions explicitly. We present several natural conditions under which one

can do this efficiently and discuss issues such as noise tolerance and sample

complexity in this model. More generally, our model can be viewed as a

generalization of the multi-view or co-training setting in machine learning.

### Sample Efficient Actor-Critic with Experience Replay

Comments: 20 pages. Prepared for ICLR 2017

**Subjects**

:

Learning (cs.LG)

This paper presents an actor-critic deep reinforcement learning agent with

experience replay that is stable, sample efficient, and performs remarkably

well on challenging environments, including the discrete 57-game Atari domain

and several continuous control problems. To achieve this, the paper introduces

several innovations, including truncated importance sampling with bias

correction, stochastic dueling network architectures, and a new trust region

policy optimization method.

### Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear

Zachary C. Lipton , Jianfeng Gao , Lihong Li , Jianshu Chen , Li Deng **Subjects** : Learning (cs.LG) ; Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

To use deep reinforcement learning in the wild, we might hope for an agent

that would never make catastrophic mistakes. At the very least, we could hope

that an agent would eventually learn to avoid old mistakes. Unfortunately, even

in simple environments, modern deep reinforcement learning techniques are

doomed by a Sisyphean curse. Owing to the use of function approximation, these

agents eventually forget experiences as they become exceedingly unlikely under

a new policy. Consequently, for as long as they continue to train,

state-aggregating agents may periodically relive catastrophic mistakes. We

demonstrate unacceptable performance of deep Q-networks on two toy problems. We

then introduce intrinsic fear, a method that mitigates these problems by

avoiding dangerous states.

PrivLogit: Efficient Privacy-preserving Logistic Regression by Tailoring Numerical Optimizers

Wei Xie ,

Comments: 24 pages, 4 figures. Work done and circulated since 2015

**Subjects**

:

Learning (cs.LG)

; Cryptography and Security (cs.CR); Machine Learning (stat.ML)

Safeguarding privacy in machine learning is highly desirable, especially in

collaborative studies across many organizations. Privacy-preserving distributed

machine learning (based on cryptography) is popular to solve the problem.

However, existing cryptographic protocols still incur excess computational

overhead. Here, we make a novel observation that this is partially due to naive

adoption of mainstream numerical optimization (e.g., Newton method) and failing

to tailor for secure computing. This work presents a contrasting perspective:

customizing numerical optimization specifically for secure settings. We propose

a seemingly less-favorable optimization method that can in fact significantly

accelerate privacy-preserving logistic regression. Leveraging this new method,

we propose two new secure protocols for conducting logistic regression in a

privacy-preserving and distributed manner. Extensive theoretical and empirical

evaluations prove the competitive performance of our two secure proposals while

without compromising accuracy or privacy: with speedup up to 2.3x and 8.1x,

respectively, over state-of-the-art; and even faster as data scales up. Such

drastic speedup is on top of and in addition to performance improvements from

existing (and future) state-of-the-art cryptography. Our work provides a new

way towards efficient and practical privacy-preserving logistic regression for

large-scale studies which are common for modern science.

### Estimating Causal Direction and Confounding of Two Discrete Variables

Krzysztof Chalupka , Frederick Eberhardt , Pietro Perona **Subjects** : Machine Learning (stat.ML) ; Artificial Intelligence (cs.AI); Learning (cs.LG)

We propose a method to classify the causal relationship between two discrete

variables given only the joint distribution of the variables, acknowledging

that the method is subject to an inherent baseline error. We assume that the

causal system is acyclicity, but we do allow for hidden common causes. Our

algorithm presupposes that the probability distributions (P(C)) of a cause (C)

is independent from the probability distribution (P(Emid C)) of the

cause-effect mechanism. While our classifier is trained with a Bayesian

assumption of flat hyperpriors, we do not make this assumption about our test

data. This work connects to recent developments on the identifiability of

causal models over continuous variables under the assumption of “independent

mechanisms”. Carefully-commented Python notebooks that reproduce all our

experiments are available online at

Sparsely-Connected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks

Comments: 13 pages, 3 figures

**Subjects**

:

Neural and Evolutionary Computing (cs.NE)

; Learning (cs.LG)

Recently deep neural networks have received considerable attention due to

their ability to extract and represent high-level abstractions in data sets.

Deep neural networks such as fully-connected and convolutional neural networks

have shown excellent performance on a wide range of recognition and

classification tasks. However, their hardware implementations currently suffer

from large silicon area and high power consumption due to the their high degree

of complexity. The power/energy consumption of neural networks is dominated by

memory accesses, the majority of which occur in fully-connected networks. In

fact, they contain most of the deep neural network parameters. In this paper,

we propose sparsely-connected networks, by showing that the number of

connections in fully-connected networks can be reduced by up to 90% while

improving the accuracy performance on three popular datasets (MNIST, CIFAR10

and SVHN). We then propose an efficient hardware architecture based on

linear-feedback shift registers to reduce the memory requirements of the

proposed sparsely-connected networks. The proposed architecture can save up to

90% of memory compared to the conventional implementations of fully-connected

neural networks. Moreover, implementation results show up to 84% reduction in

the energy consumption of a single neuron of the proposed sparsely-connected

networks compared to a single neuron of fully-connected neural networks.

### Information-Theoretic Bounds and Approximations in Neural Population Coding

Wentao Huang , Kechen Zhang **Subjects** : Information Theory (cs.IT) ; Learning (cs.LG)

Information theory is a powerful tool for neuroscience and other disciplines.

Efficient calculation of Shannon’s mutual information (MI) is a key

computational step that often presents the biggest difficulty for practical

applications. In this paper, we propose effective approximation methods for

evaluating MI in the context of neural population coding, especially for

high-dimensional inputs. We prove several information-theoretic asymptotic

bounds and approximation formulas for large size neural populations. We also

discuss how optimization of neural population coding based on these

approximation formulas leads to a convex problem about the population density

distribution in neural population parameter space. Several useful techniques,

including variable transformation and dimensionality reduction, are proposed

for more efficient computation of the approximations. Our numerical simulation

results show that our asymptotic formulas are highly accurate for approximating

MI in neural populations. For some special cases, the approximations by our

formulas are exactly equal to the true MI. The approximation methods for MI may

have a wide range of applications in various disciplines.

### Learning to Rank Scientific Documents from the Crowd

Comments: 12 pages, 1 figure

**Subjects**

:

Information Retrieval (cs.IR)

; Computation and Language (cs.CL); Digital Libraries (cs.DL); Learning (cs.LG); Social and Information Networks (cs.SI)

Finding related published articles is an important task in any science, but

with the explosion of new work in the biomedical domain it has become

especially challenging. Most existing methodologies use text similarity metrics

to identify whether two articles are related or not. However biomedical

knowledge discovery is hypothesis-driven. The most related articles may not be

ones with the highest text similarities. In this study, we first develop an

innovative crowd-sourcing approach to build an expert-annotated

document-ranking corpus. Using this corpus as the gold standard, we then

evaluate the approaches of using text similarity to rank the relatedness of

articles. Finally, we develop and evaluate a new supervised model to

automatically rank related scientific articles. Our results show that authors’

ranking differ significantly from rankings by text-similarity-based models. By

training a learning-to-rank model on a subset of the annotated corpus, we found

the best supervised learning-to-rank model (SVM-Rank) significantly surpassed

state-of-the-art baseline systems.

### Information Dropout: learning optimal representations through noise

Alessandro Achille , Stefano Soatto **Subjects** : Machine Learning (stat.ML) ; Learning (cs.LG); Computation (stat.CO)

We introduce Information Dropout, a generalization of dropout that is

motivated by the Information Bottleneck principle and highlights the way in

which injecting noise in the activations can help in learning optimal

representations of the data. Information Dropout is rooted in information

theoretic principles, it includes as special cases several existing dropout

methods, like Gaussian Dropout and Variational Dropout, and, unlike classical

dropout, it can learn and build representations that are invariant to nuisances

of the data, like occlusions and clutter. When the task is the reconstruction

of the input, we show that the information dropout method yields a variational

autoencoder as a special case, thus providing a link between representation

learning, information theory and variational inference. Our experiments

validate the theoretical intuitions behind our method, and we find that

information dropout achieves a comparable or better generalization performance

than binary dropout, especially on smaller models, since it can automatically

adapt the noise to the structure of the network, as well as to the test sample.

### Learning Identity Mappings with Residual Gates

Pedro H. P. Savarese **Subjects** : Computer Vision and Pattern Recognition (cs.CV) ; Learning (cs.LG)

We propose a technique to augment network layers by adding a linear gating

mechanism, which provides a way to learn identity mappings by optimizing only

one parameter. We also introduce a new metric which served as basis for the

technique. It captures the difficulty involved in learning identity mappings

for different types of network models, and provides a new theoretical intuition

for the increased depths of models such as Highway and Residual Networks. We

propose a new model, the Gated Residual Network, which is the result when

augmenting Residual Networks. Experimental results show that augmenting layers

grants increased performance, less issues with depth, and more layer

independence — fully removing them does not cripple the model. We evaluate our

method on MNIST using fully-connected networks and on CIFAR-10 using Wide

ResNets, achieving a relative error reduction of more than 8% in the latter

when compared to the original model.

### Reparameterization trick for discrete variables

Seiya Tokui , Issei sato **Subjects** : Machine Learning (stat.ML) ; Learning (cs.LG)

Low-variance gradient estimation is crucial for learning directed graphical

models parameterized by neural networks, where the reparameterization trick is

widely used for those with continuous variables. While this technique gives

low-variance gradient estimates, it has not been directly applicable to

discrete variables, the sampling of which inherently requires discontinuous

operations. We argue that the discontinuity can be bypassed by marginalizing

out the variable of interest, which results in a new reparameterization trick

for discrete variables. This reparameterization greatly reduces the variance,

which is understood by regarding the method as an application of common random

numbers to the estimation. The resulting estimator is theoretically guaranteed

to have a variance not larger than that of the likelihood-ratio method with the

optimal input-dependent baseline. We give empirical results for variational

learning of sigmoid belief networks.

### Adversarial Machine Learning at Scale

Comments: 15 pages, 5 figures

**Subjects**

:

Computer Vision and Pattern Recognition (cs.CV)

; Cryptography and Security (cs.CR); Learning (cs.LG); Machine Learning (stat.ML)

Adversarial examples are malicious inputs designed to fool machine learning

models. They often transfer from one model to another, allowing attackers to

mount black box attacks without knowledge of the target model’s parameters.

Adversarial training is the process of explicitly training a model on

adversarial examples, in order to make it more robust to attack or to reduce

its test error on clean inputs. So far, adversarial training has primarily been

applied to small problems. In this research, we apply adversarial training to

ImageNet. Our contributions include: (1) recommendations for how to succesfully

scale adversarial training to large models and datasets, (2) the observation

that adversarial training confers robustness to single-step attack methods, (3)

the finding that multi-step attack methods are somewhat less transferable than

single-step attack methods, so single-step attacks are the best for mounting

black-box attacks, and (4) resolution of a “label leaking” effect that causes

adversarially trained models to perform better on adversarial examples than on

clean examples, because the adversarial example construction process uses the

true label and the model can learn to exploit regularities in the construction

process.

### Deep Information Propagation

Samuel S. Schoenholz , Justin Gilmer , Surya Ganguli , Jascha Sohl-Dickstein **Subjects** : Machine Learning (stat.ML) ; Learning (cs.LG)

We study the behavior of untrained neural networks whose weights and biases

are randomly distributed using mean field theory. We show the existence of

depth scales that naturally limit the maximum depth of signal propagation

through these random networks. Our main practical result is to show that random

networks may be trained precisely when information can travel through them.

Thus, the depth scales that we identify provide bounds on how deep a network

may be trained for a specific choice of hyperparameters. As a corollary to

this, we argue that in networks at the edge of chaos, one of these depth scales

diverges. Thus arbitrarily deep networks may be trained only sufficiently close

to criticality. We show that the presence of dropout destroys the

order-to-chaos critical point and therefore strongly limits the maximum

trainable depth for random networks. Finally, we develop a mean field theory

for backpropagation and we show that the ordered and chaotic phases correspond

to regions of vanishing and exploding gradient respectively.

### Conspiracies between Learning Algorithms, Circuit Lower Bounds and Pseudorandomness

Igor C. Oliveira , Rahul Santhanam **Subjects** : Computational Complexity (cs.CC) ; Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Learning (cs.LG)

We prove several results giving new and stronger connections between

learning, circuit lower bounds and pseudorandomness. Among other results, we

show a generic learning speedup lemma, equivalences between various learning

models in the exponential time and subexponential time regimes, a dichotomy

between learning and pseudorandomness, consequences of non-trivial learning for

circuit lower bounds, Karp-Lipton theorems for probabilistic exponential time,

and NC(^1)-hardness for the Minimum Circuit Size Problem.

### Demystifying ResNet

Sihan Li , Jiantao Jiao , Yanjun Han , Tsachy Weissman **Subjects** : Neural and Evolutionary Computing (cs.NE) ; Learning (cs.LG); Machine Learning (stat.ML)

We provide a theoretical explanation for the superb performance of ResNet via

the study of deep linear networks and some nonlinear variants. We show that

with or without nonlinearities, by adding shortcuts that have depth two, the

condition number of the Hessian of the loss function at the zero initial point

is depth-invariant, which makes training very deep models no more difficult

than shallow ones. Shortcuts of higher depth result in an extremely flat

(high-order) stationary point initially, from which the optimization algorithm

is hard to escape. The 1-shortcut, however, is essentially equivalent to no

shortcuts. Extensive experiments are provided accompanying our theoretical

results. We show that initializing the network to small weights with

2-shortcuts achieves significantly better results than random Gaussian (Xavier)

initialization, orthogonal initialization, and shortcuts of deeper depth, from

various perspectives ranging from final loss, learning dynamics and stability,

to the behavior of the Hessian along the learning process.

## Information Theory

### Almost universal codes for MIMO wiretap channels

Comments: 31 pages, 1 figure

**Subjects**

:

Information Theory (cs.IT)

; Number Theory (math.NT)

Despite several works on secrecy coding for fading and MIMO wiretap channels

from an error probability perspective, the construction of information

theoretically secure codes over such channels remains as an open problem. In

this paper, we consider a fading wiretap channel model where the transmitter

has only partial statistical channel state information. We extend the flatness

factor criterion from the Gaussian wiretap channel to fading and MIMO wiretap

channels, and propose concrete lattice codes with a vanishing flatness factor

to achieve information theoretic security. These codes are built from algebraic

number fields with constant root discriminant for the single-antenna fading

wiretap channel, and from division algebras centered at such number fields for

the MIMO wiretap channel, respectively. The proposed lattice codes achieve

strong secrecy and semantic security over any ergodic stationary fading/MIMO

wiretap channel with sufficiently fast decay of time correlations, for all

secrecy rates (R < C_b-C_e -kappa), where (C_b) and (C_e) are Bob and Eve’s

channel capacities respectively, and (kappa) is an explicit constant gap.

Moreover, these codes are almost universal in the sense that a fixed code is

good for secrecy for a wide range of fading models.

### Information-Theoretic Bounds and Approximations in Neural Population Coding

Wentao Huang , Kechen Zhang **Subjects** : Information Theory (cs.IT) ; Learning (cs.LG)

Information theory is a powerful tool for neuroscience and other disciplines.

Efficient calculation of Shannon’s mutual information (MI) is a key

computational step that often presents the biggest difficulty for practical

applications. In this paper, we propose effective approximation methods for

evaluating MI in the context of neural population coding, especially for

high-dimensional inputs. We prove several information-theoretic asymptotic

bounds and approximation formulas for large size neural populations. We also

discuss how optimization of neural population coding based on these

approximation formulas leads to a convex problem about the population density

distribution in neural population parameter space. Several useful techniques,

including variable transformation and dimensionality reduction, are proposed

for more efficient computation of the approximations. Our numerical simulation

results show that our asymptotic formulas are highly accurate for approximating

MI in neural populations. For some special cases, the approximations by our

formulas are exactly equal to the true MI. The approximation methods for MI may

have a wide range of applications in various disciplines.

### Denoising based Vector Approximate Message Passing

Philip Schniter , Sundeep Rangan , Alyson Fletcher **Subjects** : Information Theory (cs.IT)

The denoising-based approximate message passing (D-AMP) methodology, recently

proposed by Metzler, Maleki, and Baraniuk, allows one to plug in sophisticated

denoisers like BM3D into the AMP algorithm to achieve state-of-the-art

compressive image recovery. But AMP diverges with small deviations from the

i.i.d.-Gaussian assumption on the measurement matrix. Recently, the vector AMP

(VAMP) algorithm has been proposed to fix this problem. In this work, we show

that the benefits of VAMP extend to D-VAMP.

### Out-of-Band Radiation from Large Antenna Arrays

Christopher Mollén , Erik G. Larsson , Ulf Gustavsson , Thomas Eriksson , Robert W. Heath Jr **Subjects** : Information Theory (cs.IT)

Co-existing wireless systems that share a common spectrum need to mitigate

out-of-band (OOB) radiation to avoid excessive interference. In legacy SISO

transmitters and small MIMO arrays, OOB radiation is well understood and is

commonly handled by digital compensation techniques. In large arrays, however,

new phenomena and hardware limitations have to be considered: First, signals

can be radiated directionally, which might focus the OOB radiation. Second,

low-complexity hardware with poor linearity has to be used for cost reasons,

which increases the relative amount of OOB radiation. Given that massive MIMO

and millimeter wave communication rely on base stations with a huge number of

antennas, the spatial behavior of OOB radiation from large arrays will have

significant implications for future hardware requirements. We show that, if the

OOB radiation is beamformed, its array gain is never larger than that of the

in-band signal. In many cases, the OOB radiation is even close to isotropic

also when the in-band signal is highly directive. With the same total radiated

power, the OOB radiation from large arrays is therefore never more severe than

from a legacy system with the same adjacent-channel-leakage ratio. The OOB

radiation is even less detrimental than from a legacy system since the high

array gain of the in-band signal allows large arrays to radiate less total

power than legacy systems. We also show how OOB radiation from large arrays

varies with location in static propagation environments and how these effects

vanish when averaged over the small-scale fading. A main conclusions is that

the linearity requirement for large arrays can be relaxed as compared to legacy

systems. Specifically, less stringent linearity requirements on each

transmitter makes it possible to build large arrays from low-complexity

hardware.

### Phi-Entropic Measures of Correlation

Salman Beigi , Amin Gohari **Subjects** : Information Theory (cs.IT)

A measure of correlation is said to have the tensorization property if it is

unchanged when computed for i.i.d. copies. More precisely, a measure of

correlation between two random variables ((X, Y)) denoted by (

ho(X, Y)), has

the tensorization property if (

ho(X^n, Y^n)=

ho(X, Y)) where ((X^n, Y^n)) is

(n) i.i.d. copies of ((X, Y)).Two well-known examples of such measures are the

maximal correlation and the hypercontractivity ribbon (HC~ribbon). We show that

the maximal correlation and HC ribbons are special cases of (Phi)-ribbon,

defined in this paper for any function (Phi) from a class of convex functions

((Phi)-ribbon reduces to HC~ribbon and the maximal correlation for special

choices of (Phi)). Any (Phi)-ribbon is shown to be a measures of correlation

with the tensorization property. We show that the (Phi)-ribbon also

characterizes the (Phi)-strong data processing inequality constant introduced

by Raginsky. We further study the (Phi)-ribbon for the choice of (Phi(t)=t^2)

and introduce an equivalent characterization of this ribbon.

### On TDMA Optimality in Locally Connected Networks with no CSIT

Comments: submitted to IEEE International Conference on Communications (ICC 2017)

**Subjects**

:

Information Theory (cs.IT)

In this work, we study scenarios of wireless networks where using simple

Time-Division-Multiple-Access (TDMA) is optimal if no channel state information

is available at the transmitters (no CSIT). We consider single-hop locally

connected interference networks where each transmitter is connected to the

receiver that has the same index as well as L succeeding receivers. The

considered rate criterion is the per user Degrees of Freedom as the number of

transmitter-receiver pairs goes to infinity. For the case when L=1, it was

shown in previous work that TDMA is optimal, even if each message can be

available at multiple transmitters and linear cooperation schemes are allowed.

Here, we extend this conclusion to the case where L=2, by proving optimality of

TDMA in this case as well. We then study the problem for general values of L

without allowing for cooperative transmission. We prove optimality of TDMA when

each transmitter is serving the receiver with the same index, and finally

conjecture – based on an example – that the same conclusion applies even if

each receiver can be served by any single transmitter connected to it.

### Capacity-Achieving Rate-Compatible Polar Codes for General Channels

Comments: 6 pages, 2 figures, submitted to WCNC’17 workshop on polar coding

**Subjects**

:

Information Theory (cs.IT)

We present a rate-compatible polar coding scheme that achieves the capacity

of any family of channels. Our solution generalizes the previous results [1],

[2] that provide capacity-achieving rate-compatible polar codes for a degraded

family of channels. The motivation for our extension comes from the fact that

in many practical scenarios, e.g., MIMO systems and non-Gaussian interference,

the channels cannot be ordered by degradation. The main technical contribution

of this paper consists in removing the degradation condition. To do so, we

exploit the ideas coming from the construction of universal polar codes.

Our scheme possesses the usual attractive features of polar codes: low

complexity code construction, encoding, and decoding; super-polynomial scaling

of the error probability with the block length; and absence of error floors. On

the negative side, the scaling of the gap to capacity with the block length is

slower than in standard polar codes, and we prove an upper bound on the scaling

exponent.

### Hierarchical Overlapping Clustering of Network Data Using Cut Metrics

Comments: Submitted to IEEE Transactions on Signal and Information Processing Over Networks

**Subjects**

:

Social and Information Networks (cs.SI)

; Information Theory (cs.IT); Physics and Society (physics.soc-ph)

A novel method to obtain hierarchical and overlapping clusters from network

data -i.e., a set of nodes endowed with pairwise dissimilarities- is presented.

The introduced method is hierarchical in the sense that it outputs a nested

collection of groupings of the node set depending on the resolution or degree

of similarity desired, and it is overlapping since it allows nodes to belong to

more than one group. Our construction is rooted on the facts that a

hierarchical (non-overlapping) clustering of a network can be equivalently

represented by a finite ultrametric space and that a convex combination of

ultrametrics results in a cut metric. By applying a hierarchical

(non-overlapping) clustering method to multiple dithered versions of a given

network and then convexly combining the resulting ultrametrics, we obtain a cut

metric associated to the network of interest. We then show how to extract a

hierarchical overlapping clustering structure from the aforementioned cut

metric. Furthermore, the so-called overlapping function is presented as a tool

for gaining insights about the data by identifying meaningful resolutions of

the obtained hierarchical structure. Additionally, we explore hierarchical

overlapping quasi-clustering methods that preserve the asymmetry of the data

contained in directed networks. Finally, the presented method is illustrated

via synthetic and real-world classification problems including handwritten

digit classification and authorship attribution of famous plays.

### On the primitivity of PRESENT and other lightweight ciphers

Riccardo Aragona , Marco Calderini , Antonio Tortora , Maria Tota **Subjects** : Group Theory (math.GR) ; Cryptography and Security (cs.CR); Information Theory (cs.IT)

We provide two sufficient conditions to guarantee that the round functions of

a translation based cipher generate a primitive group. Furthermore, if a round

of the cipher is strongly proper and consists of m-bit S-Boxes, with m = 3, 4

or 5, we prove that such a group is the alternating group. As an immediate

consequence, we deduce that the round functions of some lightweight translation

based ciphers, such as the PRESENT cipher, generate the alternating group.

### Adaptive Geometric Multiscale Approximations for Intrinsically Low-dimensional Data

Wenjing Liao , Mauro Maggioni **Subjects** : Machine Learning (stat.ML) ; Information Theory (cs.IT); Statistics Theory (math.ST)

We consider the problem of efficiently approximating and encoding

high-dimensional data sampled from a probability distribution (

ho) in

(mathbb{R}^D), that is nearly supported on a (d)-dimensional set (mathcal{M})

– for example supported on a (d)-dimensional Riemannian manifold. Geometric

Multi-Resolution Analysis (GMRA) provides a robust and computationally

efficient procedure to construct low-dimensional geometric approximations of

(mathcal{M}) at varying resolutions. We introduce a thresholding algorithm on

the geometric wavelet coefficients, leading to what we call adaptive GMRA

approximations. We show that these data-driven, empirical approximations

perform well, when the threshold is chosen as a suitable universal function of

the number of samples (n), on a wide variety of measures (

ho), that are

allowed to exhibit different regularity at different scales and locations,

thereby efficiently encoding data from more complex measures than those

supported on manifolds. These approximations yield a data-driven dictionary,

together with a fast transform mapping data to coefficients, and an inverse of

such a map. The algorithms for both the dictionary construction and the

transforms have complexity (C n log n) with the constant linear in (D) and

exponential in (d). Our work therefore establishes adaptive GMRA as a fast

dictionary learning algorithm with approximation guarantees. We include several

numerical experiments on both synthetic and real data, confirming our

theoretical results and demonstrating the effectiveness of adaptive GMRA.

欢迎加入我爱机器学习QQ6群： **337537549**

微信扫一扫，关注我爱机器学习公众号

微博：我爱机器学习