A Blog that Shows Ivan's Portofolio

Twitter Data Ingesting

Apache NiFi (Hortonworks Data Flow)

This project aims to digest tweets data using Twitter API. The raw data is then sorted based on language: Spanish and not Spanish Tweets Data.

PDF Link

Data Warehousing using Talend

Talend Open Studio - Data Integration

This project aims to split the data into multiple *.xlsx files based on certain criteria which can be further processed by a data scientist.

PDF Link

Exploring Movie Dataset

Apache Pig/Tez - Hortonworks Data Platform

This program aims to explore information from the Movie Dataset using Apache Pig / Tez found on the Hortonworks Data Platform.

GitHub Link

Dashboarding COVID-19 Data of United States

Google Data Studio & Google BigQuery

This dashboard aims to find insight about Covid-19 Data on the US. I display death per capita, new cases in linear and logaritmic trend, new deaths with cumulative cases, and also growth cases with 7-day moving average cases. Also, I provide the table with respect to the states.

Google Data Studio Link

ETL Process using PySpark

PySpark (Apache Spark)

This code is used as an example to Extract, Transform, and Load (ETL) using PySpark.

GitHub Link

Using Python (Jupyter Notebook) as Kafka Producer

Apache Kafka and Python

The use of Python scripts to produce data for Kafka Producer using a fake pizza-based dataset to then be pushed into the Kafka Topic.

Medium Link

Sinking Kafka Topic to MongoDB using Pymongo

Apache Kafka, MongoDB, and Python

One way to store data in a non-relational form such as tweets from Twitter is to store it with non-relational databases such as MongoDB or Cassandra. On this occasion, the author will demonstrate how data storage Apache Kafka receives and then saves to MongoDB using the help of the pymongo package.

Medium Link

Clinical Features Model to Predict Stroke

Logistic Regression, Gradient Boosting Classifier, XGBClassifier

This program is used to find the right model for the stroke healthcare dataset.

Kaggle Link

Telecomunication Classification Model

Logistic Regression, Random-Forest Classifier, XGBClassifier

This program is used to find the right model for the Telecomunication dataset.

Kaggle Link

Bankruptcy Predictive Model (Plus Caping Outliers)

Logistic Regression, Random-Forest Classifier, XGBClassifier

This program is used to find the right model for the bankruptcy model.

Kaggle Link

Loan Risk Prediction

Logistic Regression, Random-Forest Classification, XGBoost

This project is a project that aims to determine whether someone can be given a loan or not. The program begins with the process of checking null data in the dataset, the process of replacing the null values with new values, checking for duplicate data. Then, to improve the prediction performance, the use of label encoding technique is used. Then by using a variety of models (Logistic Regression, Random Forest Classification, XGBoost), XGBoost is found to be the best model with an accuracy level of up to 80%.

Kaggle Link Medium

Boston House Price (Using CRISP-DM Frameworks)

Regression Methods

This program is a program that aims to determine house prices with several parameters using linear regression. This program is preceded by data cleansing. After cleaning, the data is then examined for the correlation between one parameter and another using the Pearson method. After determining which parameters will be predicted, the data is then trained using linear regression, Lasso, ElasticNet, K-Neighbors-Regressor, Gradient Boosting Regressor. The models are then validated to see the accuracy of the models and the resulting predictions.

Kaggle Link

Boston House Price

Regression Methods

This program is a program that aims to determine house prices with several parameters using linear regression. This program is preceded by data cleansing. After cleaning, the data is then examined for the correlation between one parameter and another using the Pearson method. After determining which parameters will be predicted, the data is then trained using linear regression, Lasso, ElasticNet, K-Neighbors-Regressor, Gradient Boosting Regressor. The models are then validated to see the accuracy of the models and the resulting predictions.

Kaggle Link

Digit Image Recognition

Convolutional Neural Network

This program is an image recognition program using a convolutional neural network. Image detects an edge using sets of mathematical methods such as 2D convolution, Max Pooling collected in a sequential. Then, this form of training data is optimized using the Poisson function as a loss function. Then, the sequential and loss functions are compiled and implemented into the data set we have. The model is then evaluated by calculating the level of accuracy.

Kaggle Link

Comparative Study of Deep Learning Methods in Detection Face Mask Utilization

ResNet50, MobileNetV2, and Xception Model

This model detects the use of a face image wearing a mask using an artificial neural network architecture using ResNet50V2, MobileNetV2, and Xception which are models that have been trained in the early stages. To specify the model in a specific purpose, it is necessary to carry out a process called hyper-tuning in which the model is trained to recognize facial images using masks or not. The results obtained are the validation values for the ResNet50V2 and Xception methods have better accuracy values compared to MobileNetV2. Even so, there is a trade-off where MobileNetV2 has a faster training speed than the other two methods.

Preprint Link

Sentiment Tweet Analysis of Jouska

Natural Language Processing

This program analyzes the sentiment related to the conversation on the Twitter timeline when the investment case scandal was being carried out and masterminded by Jouska. From these results, it can be seen that the timeline considers this issue as a negative issue.

GitHub Link

Sentiment Tweet Analysis of Health Minister of Indonesia in July 2020

Natural Language Processing

This program analyzes sentiment related to conversations on the Twitter timeline about the performance of the Minister of Health in July 2020. From the figure above, it can be seen that sentiment regarding Minister Terawan is in a positive position.

GitHub Link

Sanbercode Final Project: Wage Prediction

XGBoost

This program predicts a person's salary based on parameters such as working class, age, education, occupation, etc. This program includes data wrangling, modeling, and validation processes. This model managed to occupy the 15th position out of a total of 420 people who participated in this competition.

GitHub Link

Image Recognition: Rock Paper Scissor

Convolutional Neural Networks

This program is a program created to complete courses in Dicoding Indonesia. This program models the hand symbols on which the model is trained using 1314 total training data and 874 test data. The data is then forwarded into a sequential model consisting of layers of artificial neural networks. This network takes the form of a mathematical operation used to detect edges. The resulting accuracy rate for this model is at 97.54%

GitHub Link

Text Classification on COVID-19 Tweets

Natural Language Processing (NLP)

This program is used to classify tweets into 5 classifications: Extremely Negative, Negative, Neutral, Positive, and Extremely Positive. This code is useful for understanding the public's perception of a particular event.

Kaggle Link

The Framework Process of Data Science: Cross-industry Standard Process for Data Mining (CRISP-DM)

CRISP-DM

In general, there are two frameworks that are commonly used by data scientists to gather information and create models from raw data. Commonly used methods include the Cross-industry Standard Process for Data Mining (CRISP-DM) and Obtain, Scrub, Explore, Model, and Interpret (OSEMN) Framework. I will explain the use of CRISP-DM and its use directly in the program that has been created.

Medium Link

Understanding Naive-Bayes Classifier

Naive-Bayes Classifier

The Naïve Bayes Classifier uses the probability method as the statistical basis it uses. For example, we have frame data regarding a text whether it is talking about sports or not.

Medium Link

Ridge-Regression in the Nutshell

Ridge-Regression

Ridge Regression is a regular form of linear regression. The cost function or commonly known as the loss function is a function used to find parameters in regression. This parameter can be taken by minimizing it so that the parameter θ is obtained. This is used as a statistical basis for the Ridge Regression method.

Medium Link

Learning Logistic Regression

Logistic Regression

In general, the classification method of logistic regression is to create a probability boundary. If there is a value that exceeds this probability, the model will assume that the value is in the positive class (classified to 1). However, if a value is less than this probability, then the value is in the negative class (or we classify it to a value of 0).

Wordpress Link

Understanding AdaBoost Classifier

AdaBoost Classifier

One of the tools that are relied upon and used to perform classification is AdaBoost. AdaBoost is one of the oldest tools available using the boosting method. One of the algorithms used is to combine various insignificant classifiers into one very strong classifier.

Wordpress Link

Understand the concept of Linear Regression

Linear Regression

Linear regression is a basic statistical tool for predictive analysis.

Wordpress Link

Uranium Decay

Numerical Methods

This program was created to calculate the decay time of Uranium 235. In addition, it calculates the engine rounding value as well as the correct grating width for this simulation.

GitHub Link

Classical Mechanics of Velocity

Numerical Methods

This program is made to calculate the speed of a racing athlete if the power value is known. The general formulation that we can then discretise into a form that is computationally easy to solve so that the speed is obtained according to the defined constraints along with the parameters.

GitHub Link

Pendulum Simulation

Comparative Study of Euler, Euler-Cromer, and Verlet Methods

This program is designed to see the effect of several numerical methods (Euler, Euler-Cromer, and Verlet) to calculate a pendulum motion by taking into account the accuracy of the program.

GitHub Link

Electromagnetic Propagation in 1D and 3D Domain

Numerical Methods

This program is designed to simulate the Governing Equation for E and H (1 Dimension) and calculate Hx, Hy, Dz, and Ez (3 Dimension).

GitHub Link

Monte-Carlo dan Random Walk

Numerical Methods

This program is designed to solve various problems using Monte-Carlo and Random Walks. One of the problems solved using Monte Carlo is calculating the value of Pi. Meanwhile, the problem that is solved using the Random Walk is finding a solution to an equation.

GitHub Link

Quantum Mechanics - Lennard Jones Potentials

Numerical Methods

This program is designed to calculate a movement of electrons in 1-dimensional NaCl crystals using the help of the Lennard-Jones potential using Matlab.

GitHub Link

Band-structure Simulation of Si, GaP, GaN, and TiO2 Rutile Using Julia Programming

Density Functional Theory

This program is designed to calculate the ribbon structure of various materials using the Julia programming. The results of this program are then compared with similar programs, namely Abinit and VaSP.

Link

Ivan M. Siegfried Data Scientist

About Me

Codes

Other Softwares, Tools, and Platforms

Data Engineering Projects

Twitter Data Ingesting

Data Warehousing using Talend

Exploring Movie Dataset

Dashboarding COVID-19 Data of United States

ETL Process using PySpark

Using Python (Jupyter Notebook) as Kafka Producer

Sinking Kafka Topic to MongoDB using Pymongo

Data Science and AI Portfolio

Clinical Features Model to Predict Stroke

Telecomunication Classification Model

Bankruptcy Predictive Model (Plus Caping Outliers)

Loan Risk Prediction

Boston House Price (Using CRISP-DM Frameworks)

Boston House Price

Digit Image Recognition

Comparative Study of Deep Learning Methods in Detection Face Mask Utilization

Sentiment Tweet Analysis of Jouska

Sentiment Tweet Analysis of Health Minister of Indonesia in July 2020

Sanbercode Final Project: Wage Prediction

Image Recognition: Rock Paper Scissor

Text Classification on COVID-19 Tweets

The Framework Process of Data Science: Cross-industry Standard Process for Data Mining (CRISP-DM)

Understanding Naive-Bayes Classifier

Ridge-Regression in the Nutshell

Learning Logistic Regression

Understanding AdaBoost Classifier

Understand the concept of Linear Regression

Computational Physics Portfolio

Uranium Decay

Classical Mechanics of Velocity

Pendulum Simulation

Electromagnetic Propagation in 1D and 3D Domain

Monte-Carlo dan Random Walk

Quantum Mechanics - Lennard Jones Potentials

Band-structure Simulation of Si, GaP, GaN, and TiO2 Rutile Using Julia Programming