Here’s our weekly digest – curated list of articles published this week – on Big Data Analytics, Machine Learning and Data Science. Hope you find it useful – please don’t forget to subscribe! You can find the archives here.
- Rapid Big Data Prototyping with Microsoft R Server on Apache Spark: Context Switching & Spark Tuning
In this blog post, Max Kaznady demonstrates how to develop a predictive model for 170 million rows (37GB of raw data in Apache Hive) of the public NYC Taxi dataset fare data for 2013 using MRS’s local compute context, to derive insights on what factors contribute to the taxi passenger tipping for the trip.
- Image Completion with Deep Learning in TensorFlow
In this blog post, Brandon Amos covers one method of completing images with deep learning in tensorflow that interprets images as being samples from a probability distribution, generates fake images, and finds the best fake image for completion.
- Creating a Multinode Hadoop cluster in 4 mins using docker containers
- Predicting house prices with regression-type Machine Learning methods
Several regression methods are compared by using the same dataset. House prices are predicted as a function of its attributes. Boston house-prices dataset, which includes 506 instances, representing houses in the suburbs of Boston by 14 features, one of them (the median value of owner-occupied homes) being the target class is used as the dataset.
- Interactive Data Visualization of Geospatial Data using D3.js, DC.js, Leaflet.js and Python
This tutorial introduces the steps for building an interactive visualization of geospatial data using a dataset from Kaggle competition that shows the distribution of mobile phone users in China.
- Infographic: The 8 Most Common Hadoop Cluster Ailments
Pepperdata recently performed a study on over one hundred Hadoop clusters and unearthed the most common symptoms of cluster flux within businesses of every size. The presented infographic shares their findings.
- Avro Schema Registry with Apache Atlas for Streaming Data
Extending Apache Atlas to store and curate Avro Schemas provides substantial benefits beyond a simple Avro Schema Registry.
- Mesosphere + DataStax + Confluent + Lightbend = Container 2.0… But is it complicated?
For the last couple of years, there has been a lot of discussion on the merits of Containers versus Virtual Machines (VMs). Even though containers are generally seen as more agile, better suited to cloud architectures such as microservices, and more performant than VMs, one of their major limitations has been the lack of support for stateful applications. So, simplistically put, Container 2.0 = Container + State.
- An Intuitive Explanation of Convolutional Neural Networks
Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self driving cars.
- Machine Learning Exercises In Python, Part 8
This post is part of a series covering the exercises from Andrew Ng’s machine learning class on Coursera. The original code, exercise text, and data files for this post are available here: https://github.com/jdwittenauer/ipython-notebooks.
Part 1 – Simple Linear Regression
Part 2 – Multivariate Linear Regression
Part 3 – Logistic Regression
Part 4 – Multivariate Logistic Regression
Part 5 – Neural Networks
Part 6 – Support Vector Machines
Part 7 – K-Means Clustering & PCA
Part 8 – Anomaly Detection & Recommendation