Here’s our weekly roundup of articles related to Big Data Analytics, Machine Learning and Data Science. Hope you find it useful – please don’t forget to subscribe!
- Monitoring Hadoop’s health and performance metrics
This article explores the key Hadoop metrics – HDFS metrics (NameNode and DataNode), MapReduce counters (Job, Task, Filesystem etc.), YARN metrics (Cluster, Application, NodeManager), ZooKeeper metrics – that should be monitored to keep tabs on the health and performance of the Hadoop cluster.
- Using Apache Spark & Scala to analyze apache access logs
This step-by-step tutorial shows us how to find broken URLs from Apache access log files and generate a list of URLs, sorted by hit count using Spark and Scala.
- Cloudera’s Spark Guide
This book gives an overview of Spark, talks about running the first Spark application, Spark application development practices, and Spark & Hadoop integration. It also touches on Oozie, Hive, HBase, PySpark, YARN, Spark Streaming, Spark MLlib, Spark SQL, Amazon S3 integration etc.
- Why Twitter built yet another real-time stream processing engine – Heron? This article talks gives an overview of Apache Storm that was used by Twitter’s engineering team before they decided to build another streaming engine called Heron that’s API compatible with Storm and the motivating factors behind their decision
- Real-Time Stream Processing Architecture for IoT using Google Cloud
This article describes the infrastructure to handle streams of data fed from millions of intelligent devices in the Internet of Things (IoT). The architecture for this type of real-time stream processing must deal with data import, processing, storage, and analysis of hundreds of millions of events per hour. The architecture below depicts just such a system.
- Learn How To Secure A Hadoop Cluster Using Kerberos This article introduces kerberos as a way of adding security to the Hadoop cluster. It discusses basic kerberos concepts, installation of client and server components, and configuring SSH and Hadoop to use kerberos.
- Hands-on introduction to Data Science with Apache Spark – Crash Course This gives an introduction to Data Science & Machine Learning, talks about Machine Learning examples, overview of Machine Learning methods, K-means, Decision Trees & Random Forests, Spark ML library etc.
- How to approach (almost) any Machine Learning problem?
An average data scientist deals with loads of data daily. Some say over 60-70% time is spent in data cleaning, munging and bringing data to a suitable format such that machine learning models can be applied on that data. This post focuses on the second part, i.e., applying machine learning models, including the preprocessing steps.
- How To Use Regression Machine Learning Algorithms in Weka
Algorithms reviewed: Linear Regression, k-Nearest Neighbors, Decision Tree, Support Vector Machines and Multi-Layer Perceptron
- Installing Keras, a python package for deep learning – Step by step guide
The purpose of this blog post is to demonstrate how to install the Keras library for deep learning. The installation procedure will show how to install Keras with and without GPU support.