We researched the necessary/must-have and desirable/nice-to-have skills Fortune 500 companies look for when hiring engineers to work on solutions requiring expertise in Machine Learning, Data Science, Big Data etc. Our motive is to keep this list updated so it’s relevant.
[Here’s a report on how much Data Scientists get paid in US in 2016: Data Scientist Salary Insights 2016]
Here are some:
- Proficient in querying and manipulating large data sets for analytical purposes using SQL-like languages (Hive / Impala)
- Apache Ecosystem – Hadoop, Hadoop File System (HDFS), MapReduce/YARN (Yet Another Resource Negotiator), Hive (Data warehouse infrastructure), HBase (Distributed Column-oriented NoSQL Database), Oozie Workflow, Sqoop Data Ingestion, Zookeeper, Pig Scripting, Ambari (Hadoop Clusters Management Platform), Spark (Big Data Processing Engine), Flink (Streaming dataflow / analytics engine), Storm (Real-time data processing), Flume (Log data processing), Avro (Data serialization)
- Machine learning techniques such as Neural networks, Hidden Markov Model (HMM), Maximum entropy models and other popular algorithms
- Feature engineering and statistical modeling methods such as Conditional Random Field (CRF), HMM, Support Vector Machine (SVM), Gradient Boosting Decision Tree(GBDT) etc.
- Statistical methods such as Categorical Data Analysis, Multivariate Analysis, Regression Analysis, Survey Sampling Design, Survival/Reliability analysis, Design of experiments, Analysis of variance.
- Building machine learning systems for modern parallel-computing environments (GPU, Multicore Symmetric Multiprocessing (SMP), Distributed Clusters); CUDA kernels
- Machine learning frameworks such as Caffe, Theano, Torch, TensorFlow, MXNet, Apache Mahout, Spark MLlib; scikit-learn, scipy, numpy; Amazon Machine Learning
- Convolutional Neural Networks (CNN), Recurrent Neural Network(RNN), Supervised and Unsupervised learning, and optimization techniques
- Traditional/Modern statistical techniques, including SVM, Regularization, Boosting, Random Forests, and other Ensemble Methods
- Natural language processing(NLP) problems, including predictive typing, input method conversion, tokenization, tagging, language modeling, language identification, sentiment analysis, named entity recognition, lemmatization, summarization
- Building solutions for spell corrections, related searches, synonym/acronym expansions, query rewrites, metrics accumulation, spam prevention, ranking, and recommendations
- Proficiency in predictive modeling and data mining tools such as SQL, R, SAS, JMP, Python, Watson, and Aster
- Experience with reporting/analytics/data visualization tools such as D3.js, Tableau, Qlikview, Datameer, Platfora, ELK and Cognos etc.
- Familiarity with commercial ETL platforms like Informatica, SSIS, Talend, etc