Ivan M. Siegfried Data Scientist

About Me

Hi! My name is Ivan Muhammad Siegfried. I took my Master Degree in Physics and graduated from Institut Teknologi Bandung in 2019 and took my Bachelor in Physics from Padjadjaran University in 2016. I have strong interest in Computational Physics especially on Density Functional Theory, Computational Fluid Dynamics, Heat Transfer, and Instrumentation as well. Now, I focus on Data Science, Machine Learning, Computer Vision. I am currently working in private sector focusing on developing applied technologies that can improve human life.

Codes

Python Matlab Julia VB.NET

These are programming languages and tools that I mastered.I often use Python as main programming language for Data Science, Machine Learning, Computer Vision etc. The other languages are used for me to solving Computational Physics problem.

Other Softwares, Tools, and Platforms

Google Data Studio    Google BigQuery    
phpMyAdmin     VB.NET     VB.NET     HDP     HDF    

These are tools that I mastered.


Data Engineering Projects

mountains

Twitter Data Ingesting

  • Apache NiFi (Hortonworks Data Flow)

This project aims to digest tweets data using Twitter API. The raw data is then sorted based on language: Spanish and not Spanish Tweets Data.

PDF Link
mountains

Data Warehousing using Talend

  • Talend Open Studio - Data Integration

This project aims to split the data into multiple *.xlsx files based on certain criteria which can be further processed by a data scientist.

PDF Link
mountains

Exploring Movie Dataset

  • Apache Pig/Tez - Hortonworks Data Platform

This program aims to explore information from the Movie Dataset using Apache Pig / Tez found on the Hortonworks Data Platform.

GitHub Link
mountains

Dashboarding COVID-19 Data of United States

  • Google Data Studio & Google BigQuery

This dashboard aims to find insight about Covid-19 Data on the US. I display death per capita, new cases in linear and logaritmic trend, new deaths with cumulative cases, and also growth cases with 7-day moving average cases. Also, I provide the table with respect to the states.

Google Data Studio Link
mountains

ETL Process using PySpark

  • PySpark (Apache Spark)

This code is used as an example to Extract, Transform, and Load (ETL) using PySpark.

GitHub Link
mountains

Using Python (Jupyter Notebook) as Kafka Producer

  • Apache Kafka and Python

The use of Python scripts to produce data for Kafka Producer using a fake pizza-based dataset to then be pushed into the Kafka Topic.

Medium Link
mountains

Sinking Kafka Topic to MongoDB using Pymongo

  • Apache Kafka, MongoDB, and Python

One way to store data in a non-relational form such as tweets from Twitter is to store it with non-relational databases such as MongoDB or Cassandra. On this occasion, the author will demonstrate how data storage Apache Kafka receives and then saves to MongoDB using the help of the pymongo package.

Medium Link

Data Science and AI Portfolio

mountains

Clinical Features Model to Predict Stroke

  • Logistic Regression, Gradient Boosting Classifier, XGBClassifier

This program is used to find the right model for the stroke healthcare dataset.

Kaggle Link
mountains

Telecomunication Classification Model

  • Logistic Regression, Random-Forest Classifier, XGBClassifier

This program is used to find the right model for the Telecomunication dataset.

Kaggle Link
mountains

Bankruptcy Predictive Model (Plus Caping Outliers)

  • Logistic Regression, Random-Forest Classifier, XGBClassifier

This program is used to find the right model for the bankruptcy model.

Kaggle Link
mountains

Loan Risk Prediction

  • Logistic Regression, Random-Forest Classification, XGBoost

This project is a project that aims to determine whether someone can be given a loan or not. The program begins with the process of checking null data in the dataset, the process of replacing the null values with new values, checking for duplicate data. Then, to improve the prediction performance, the use of label encoding technique is used. Then by using a variety of models (Logistic Regression, Random Forest Classification, XGBoost), XGBoost is found to be the best model with an accuracy level of up to 80%.

Kaggle Link Medium
mountains

Boston House Price (Using CRISP-DM Frameworks)

  • Regression Methods

This program is a program that aims to determine house prices with several parameters using linear regression. This program is preceded by data cleansing. After cleaning, the data is then examined for the correlation between one parameter and another using the Pearson method. After determining which parameters will be predicted, the data is then trained using linear regression, Lasso, ElasticNet, K-Neighbors-Regressor, Gradient Boosting Regressor. The models are then validated to see the accuracy of the models and the resulting predictions.

Kaggle Link
mountains

Boston House Price

  • Regression Methods

This program is a program that aims to determine house prices with several parameters using linear regression. This program is preceded by data cleansing. After cleaning, the data is then examined for the correlation between one parameter and another using the Pearson method. After determining which parameters will be predicted, the data is then trained using linear regression, Lasso, ElasticNet, K-Neighbors-Regressor, Gradient Boosting Regressor. The models are then validated to see the accuracy of the models and the resulting predictions.

Kaggle Link
mountains

Digit Image Recognition

  • Convolutional Neural Network

This program is an image recognition program using a convolutional neural network. Image detects an edge using sets of mathematical methods such as 2D convolution, Max Pooling collected in a sequential. Then, this form of training data is optimized using the Poisson function as a loss function. Then, the sequential and loss functions are compiled and implemented into the data set we have. The model is then evaluated by calculating the level of accuracy.

Kaggle Link
mountains

Comparative Study of Deep Learning Methods in Detection Face Mask Utilization

  • ResNet50, MobileNetV2, and Xception Model

This model detects the use of a face image wearing a mask using an artificial neural network architecture using ResNet50V2, MobileNetV2, and Xception which are models that have been trained in the early stages. To specify the model in a specific purpose, it is necessary to carry out a process called hyper-tuning in which the model is trained to recognize facial images using masks or not. The results obtained are the validation values for the ResNet50V2 and Xception methods have better accuracy values compared to MobileNetV2. Even so, there is a trade-off where MobileNetV2 has a faster training speed than the other two methods.

Preprint Link
mountains

Sentiment Tweet Analysis of Jouska

  • Natural Language Processing

This program analyzes the sentiment related to the conversation on the Twitter timeline when the investment case scandal was being carried out and masterminded by Jouska. From these results, it can be seen that the timeline considers this issue as a negative issue.

GitHub Link
mountains

Sentiment Tweet Analysis of Health Minister of Indonesia in July 2020

  • Natural Language Processing

This program analyzes sentiment related to conversations on the Twitter timeline about the performance of the Minister of Health in July 2020. From the figure above, it can be seen that sentiment regarding Minister Terawan is in a positive position.

GitHub Link
mountains

Sanbercode Final Project: Wage Prediction

  • XGBoost

This program predicts a person's salary based on parameters such as working class, age, education, occupation, etc. This program includes data wrangling, modeling, and validation processes. This model managed to occupy the 15th position out of a total of 420 people who participated in this competition.

GitHub Link
mountains

Image Recognition: Rock Paper Scissor

  • Convolutional Neural Networks

This program is a program created to complete courses in Dicoding Indonesia. This program models the hand symbols on which the model is trained using 1314 total training data and 874 test data. The data is then forwarded into a sequential model consisting of layers of artificial neural networks. This network takes the form of a mathematical operation used to detect edges. The resulting accuracy rate for this model is at 97.54%

GitHub Link
mountains

Text Classification on COVID-19 Tweets

  • Natural Language Processing (NLP)

This program is used to classify tweets into 5 classifications: Extremely Negative, Negative, Neutral, Positive, and Extremely Positive. This code is useful for understanding the public's perception of a particular event.

Kaggle Link
mountains

The Framework Process of Data Science: Cross-industry Standard Process for Data Mining (CRISP-DM)

  • CRISP-DM

In general, there are two frameworks that are commonly used by data scientists to gather information and create models from raw data. Commonly used methods include the Cross-industry Standard Process for Data Mining (CRISP-DM) and Obtain, Scrub, Explore, Model, and Interpret (OSEMN) Framework. I will explain the use of CRISP-DM and its use directly in the program that has been created.

Medium Link
mountains

Understanding Naive-Bayes Classifier

  • Naive-Bayes Classifier

The Naïve Bayes Classifier uses the probability method as the statistical basis it uses. For example, we have frame data regarding a text whether it is talking about sports or not.

Medium Link
mountains

Ridge-Regression in the Nutshell

  • Ridge-Regression

Ridge Regression is a regular form of linear regression. The cost function or commonly known as the loss function is a function used to find parameters in regression. This parameter can be taken by minimizing it so that the parameter θ is obtained. This is used as a statistical basis for the Ridge Regression method.

Medium Link
mountains

Learning Logistic Regression

  • Logistic Regression

In general, the classification method of logistic regression is to create a probability boundary. If there is a value that exceeds this probability, the model will assume that the value is in the positive class (classified to 1). However, if a value is less than this probability, then the value is in the negative class (or we classify it to a value of 0).

Wordpress Link
mountains

Understanding AdaBoost Classifier

  • AdaBoost Classifier

One of the tools that are relied upon and used to perform classification is AdaBoost. AdaBoost is one of the oldest tools available using the boosting method. One of the algorithms used is to combine various insignificant classifiers into one very strong classifier.

Wordpress Link
mountains

Understand the concept of Linear Regression

  • Linear Regression

Linear regression is a basic statistical tool for predictive analysis.

Wordpress Link

Computational Physics Portfolio

mountains

Uranium Decay

  • Numerical Methods

This program was created to calculate the decay time of Uranium 235. In addition, it calculates the engine rounding value as well as the correct grating width for this simulation.

GitHub Link
mountains

Classical Mechanics of Velocity

  • Numerical Methods

This program is made to calculate the speed of a racing athlete if the power value is known. The general formulation that we can then discretise into a form that is computationally easy to solve so that the speed is obtained according to the defined constraints along with the parameters.

GitHub Link
mountains

Pendulum Simulation

  • Comparative Study of Euler, Euler-Cromer, and Verlet Methods

This program is designed to see the effect of several numerical methods (Euler, Euler-Cromer, and Verlet) to calculate a pendulum motion by taking into account the accuracy of the program.

GitHub Link
mountains

Electromagnetic Propagation in 1D and 3D Domain

  • Numerical Methods

This program is designed to simulate the Governing Equation for E and H (1 Dimension) and calculate Hx, Hy, Dz, and Ez (3 Dimension).

GitHub Link
mountains

Monte-Carlo dan Random Walk

  • Numerical Methods

This program is designed to solve various problems using Monte-Carlo and Random Walks. One of the problems solved using Monte Carlo is calculating the value of Pi. Meanwhile, the problem that is solved using the Random Walk is finding a solution to an equation.

GitHub Link
mountains

Quantum Mechanics - Lennard Jones Potentials

  • Numerical Methods

This program is designed to calculate a movement of electrons in 1-dimensional NaCl crystals using the help of the Lennard-Jones potential using Matlab.

GitHub Link
mountains

Band-structure Simulation of Si, GaP, GaN, and TiO2 Rutile Using Julia Programming

  • Density Functional Theory

This program is designed to calculate the ribbon structure of various materials using the Julia programming. The results of this program are then compared with similar programs, namely Abinit and VaSP.

Link