Soumya Ogoti

Software engineer with experience in driving positive business outcomes through data-driven strategies and leveraging the latest technologies. Combining expertise in technology with a background in consulting, I am adept at delivering actionable insights and driving innovation.

Large Language Models (LLMs)

Applied Research Project on Using Large Language Models for MND Assistive Devices

Python Alpaca Langchain Streamlit
DL.

This research introduces a novel communication interface using Large Language Models (LLMs) for Motor Neurone Disease (MND) patients, allowing selection of responses in multiple tones. A lightweight browser-based chat interface facilitates seamless interaction. Evaluation of LLMs considered parameters, training data, and hardware requirements, with prompt engineering for MND-specific criteria. A user study assessed satisfaction and effectiveness, with quantitative evaluation against BERT. Among models evaluated, GPT-3.5 Turbo with memory is identified as superior.

Prediction Analytics

Attrition Forecasting for Merger and Acquisition

Python Excel Pandas Sklearn Matplotlib
EDA and Feature importance.

Identified factors and determined probability of severance acceptance by employees using regression models. Objectively categorised employees into groups to offer severances to, without discrimination. Used Linear Programming to optimally choose groups for severance with the objective to minimize company costs, prevent mass exodus and maintain stable employee proportion.

Data Visualisation: Designing a Deck in Tableau

Exploring Profitable Property Investment in London: A Data-driven Analysis

Python Pandas Tableau
DL.

Identified profitable boroughs in London for property investment and Airbnb rental, using data from three sources: Airbnb listings, historical housing prices, and council tax records. Tableau visualizations were created, including bar charts, heatmaps, time series plots, and lollipop charts, to explore various aspects of the data. Key insights revealed the City of London, Westminster, and Greenwich as the most profitable for investment

Data Visualisation: Designing Impactful Charts

Exploring Pay Equity in an Organization: A Visual Analysis

Python Pandas Matplotlib
DL.

Examined equity within an organization using self-designed custom lollipop charts. Design considerations, including color-coding and scaled tips, were implemented to visualize multiple interacting pararameters together in a single chart such as salary comparisons based on gender, race, and ethnicity. The visualization offered insights into unbiased pay practices across job positions, supported by regression analysis confirming equitable salary distributions.

Database Design, Descriptive Statistics, and Insights using PostgreSQL

Bug Data Analysis for the Mozilla Project

Python Matplotlib PostgreSQL Psycopg2 SCiPy
Database ERD

Using PostgreSQL and Python, bug data was structured into a database, including tables for bug reports, users, changes history, customer fields, flags, and comments. Descriptive statistical analysis was conducted using psycopg2 for insights into bug metrics such as bug distribution by severity and priority, resolution time, user engagement, and bug dependencies. The findings were used to provide recommendations for bug management and customer support strategies.

Deep Learning for Sequence Prediction

Predicting Wind Turbine Operating Modes

Python Pandas Matplotlib Tensorflow
DL.

Predicted wind turbine operating modes from time series sensor data. Sequences of sensor data were analysed using dense networks like, Conv1D, Simple RNN, and GRU networks. An alternate approach where the data was transformed into images and fed into 2D CNNs, following the approach outlined in Rahimilarki et al. (2022) was also explored. The best-performing model, derived from Rahimilarki et al. and enhanced with additional CNN layers, fine-tuning, batch normalization, dropout, and learning rate scheduling, achieved the highest accuracy of 87.3% on the test dataset.

Natural Language Processing: Text Classification

Analyzing BeerAdvocate Reviews

Python Pandas Matplotlib Seaborn Transformers
NLP.

Classified user reviews on BeerAdvocate using natural language processing (NLP) techniques. A comprehensive analysis of several domain specific features such as TFIDF, LDA and Doc2Ver in combination with classifiers like Multinomial Naive Bayes classifier, Random Forests, OneVsOne, SVMs was performed. Deep learning models such as Bidirectional LSTM and BERT with learnt tokenization and embeddings were also analysed. The BERT model with smart padding emerged as the top performer, showcasing its ability to generalize well across diverse domains like beer reviews.

Exploratory Data Analysis and Visualisation

Wine Market Competitor Analysis

Python Selenium Pandas Matplotlib Seaborn
Data visualisation.

Explored the wine market through competitor websites, centering on wine attributes (type, origin, vintage, ABV), pricing (75cL bottle), and reviews (quantity and scores), aiming to uncover popular products and price ranges. Data collection was done using BeautifulSoup and Selenium while exploratory data analysis and visualisation was performed using Matplotlib and Seaborn.

Predictive Analytics for Risk Management

Forecasting Credit Card Default

Python Pandas Matplotlib Seaborn
Bonus assignment.

Predicted credit card default likelihood for a bank's customers and determine key drivers for credit approval decisions. Developed a MVP with logistic regression to establish a baseline. Addressed data complexities using extensive EDA, feature engineering, and class-balanced sampling. Optimized model performance using hyperparameter tuning and Youden's J Statistic, to select the best model based on ROC-AUC.

Spatio-Econometric Methods and Machine Learning Models

Predictive Analysis of Airbnb Prices in European Cities

Python Pandas Tensorflow Sklearn AML.

A comprehensive examination of Airbnb prices in popular European cities, utilizing spatio-econometric methods to predict listing prices based on various attributes was conducted. After data cleaning and intial EDA, multiple machine learning models including Decision Trees, Random Forest, and XGBoost were used to predict pricing. Neural networks such as Multi Layer Perceptrons and Autoencoders were also employed. Through comparative analysis, the XGBoost model, with feature selection, emerged as the top performer, offering valuable insights for pricing strategies and investment decisions for Airbnb hosts.

Strategic Business Analytics: Data driven decision making

Consulting Report: Addressing VFS Global's Key Challenges

SBA-consulting.

This consulting report outlines data-driven solutions to tackle VFS Global's key challenges: visa application processing delays, data security, and customer service issues. Through strategic analysis and predictive modeling, solutions were proposed, including optimization of processing times, implementation of anomaly detection for security, and AI-driven chatbots for enhanced customer support. A comprehensive innovation roadmap was provided to guide VFS Global in implementing these solutions.