This Week in Big Data Analytics (July 24, 2016) – Weekly roundup

Here’s our weekly roundup of articles related to Big Data Analytics, Machine Learning and Data Science. Hope you find it useful – please don’t forget to subscribe!

  • Monitoring Hadoop’s health and performance metrics
    This article explores the key Hadoop metrics – HDFS metrics (NameNode and DataNode), MapReduce counters (Job, Task, Filesystem etc.), YARN metrics (Cluster, Application, NodeManager), ZooKeeper metrics – that should be monitored to keep tabs on the health and performance of the Hadoop cluster.

    Hadoop Metrics
    Hadoop Metrics
  • Using Apache Spark & Scala to analyze apache access logs
    This step-by-step tutorial shows us how to find broken URLs from Apache access log files and generate a list of URLs, sorted by hit count using Spark and Scala.

    Apache Spark and Scala
    Apache Spark and Scala
  • Cloudera’s Spark Guide
    This book gives an overview of Spark, talks about running the first Spark application, Spark application development practices, and Spark & Hadoop integration. It also touches on Oozie, Hive, HBase, PySpark, YARN, Spark Streaming, Spark MLlib, Spark SQL, Amazon S3 integration etc.

    Twitter Heron
    Twitter Heron
  • Why Twitter built yet another real-time stream processing engine – Heron? This article talks gives an overview of Apache Storm that was used by Twitter’s engineering team before they decided to build another streaming engine called Heron that’s API compatible with Storm and the motivating factors behind their decision
  • Real-Time Stream Processing Architecture for IoT using Google Cloud
    This article describes the infrastructure to handle streams of data fed from millions of intelligent devices in the Internet of Things (IoT). The architecture for this type of real-time stream processing must deal with data import, processing, storage, and analysis of hundreds of millions of events per hour. The architecture below depicts just such a system.

    IoT Stream Processing Architecture
    IoT Stream Processing Architecture
  • Learn How To Secure A Hadoop Cluster Using Kerberos This article introduces kerberos as a way of adding security to the Hadoop cluster. It discusses basic kerberos concepts, installation of client and server components, and configuring SSH and Hadoop to use kerberos.

    Hadoop - Kerberos
    Hadoop – Kerberos
  • Hands-on introduction to Data Science with Apache Spark – Crash Course This gives an introduction to Data Science & Machine Learning, talks about Machine Learning examples, overview of Machine Learning methods, K-means, Decision Trees & Random Forests, Spark ML library etc.
  • How to approach (almost) any Machine Learning problem? 
    An average data scientist deals with loads of data daily. Some say over 60-70% time is spent in data cleaning, munging and bringing data to a suitable format such that machine learning models can be applied on that data. This post focuses on the second part, i.e., applying machine learning models, including the preprocessing steps.

    Machine Learning - Data Pipeline
    Machine Learning – Data Pipeline
  • How To Use Regression Machine Learning Algorithms in Weka
    Algorithms reviewed: Linear Regression, k-Nearest Neighbors, Decision Tree, Support Vector Machines and Multi-Layer Perceptron
  • Installing Keras, a python package for deep learning – Step by step guide 
    The purpose of this blog post is to demonstrate how to install the Keras library for deep learning. The installation procedure will show how to install Keras with and without GPU support.
Copy Protected by Chetan's WP-Copyprotect.