Here’s our weekly roundup of articles related to Big Data Analytics, Machine Learning and Data Science. Hope you find it useful – please don’t forget to subscribe!
- All the Apache Streaming Projects: An Exploratory Guide [Learning Spark]
This article attempts to help customers navigate the complex maze of Apache streaming projects by calling out the key differentiators for each. We will discuss the use cases and key scenarios addressed by Apache Kafka, Apache Storm, Apache Spark, Apache Samza, Apache Beam and related projects.
- Learning Path To Become Data Scientist – Step by Step Guide
This article talks about the things you need to learn such as linear algebra/matrix factorizations, distributed database systems, statistical analysis, computational mathematics, machine learning, signal detection & estimation etc. to become a data scientist.
- Using a Java UDF with Hive in Azure HDInsight
Hive is great for working with data in HDInsight, but sometimes you need a more general purpose language. Hive allows you to create user-defined functions (UDF) using a variety of programming languages. In this document, you will learn how to use a Java UDF from Hive.
Azure HDInsight deploys and provisions managed Apache Hadoop clusters in the cloud, providing a software framework designed to process, analyze, and report on big data with high reliability and availability. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution.
- Hadoop Audit and Logging “Back in Time” [Additional Reference]
This article talks about the Audit (the third of the three AAAs of Information Security) capabilities of several Hadoop components such as HDFS, MapReduce, YARN, Hive, HBase, Sentry, and Cloudera Impala
- Hadoop Summit 2016: Debugging Apache Hadoop YARN Cluster in Production
YARN is a generic resource management platform that can host a multitude of applications and services. We all spend a lot of effort to make YARN run smoothly for our customers and organizations, but also unavoidably dealt with various kinds of nasty bugs.
- Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Amazon’s product catalog is huge compared to the number of products that a customer has purchased, making our datasets extremely sparse. Neural network models often have to be distributed across multiple GPUs to meet space and time constraints – so we have created and open-sourced DSSTNE, the Deep Scalable Sparse Tensor Neural Engine, which runs entirely on the GPU.
- How to Transform Your Machine Learning Data in Weka [Learn Weka]
This article talks about how to convert a real valued attribute into a discrete distribution called discretization, convert a discrete attribute into multiple real values called dummy variables and when to discretize or create dummy variables from your data.
- A Guide to Machine Learning in Python
This tutorial aims to give you an accessible introduction on how to use machine learning techniques for your projects and data sets. In just 20 minutes, you will learn how to use Python to apply different machine learning techniques — from decision trees to deep neural networks — to a sample data set.
- Selecting the right chart for data visualization