RMRM Full Stack & AI Engineer · All projects · Roadmaps

Data · project ideas

Data Science Project Ideas

Build real data science projects spanning EDA, machine learning, NLP, deep learning, and deployment to develop end-to-end skills.

Exploratory Data Analysis Dashboard

beginner

Analyze a public dataset (e.g. Titanic or Iris) and build an interactive dashboard to communicate insights.

Requirements

Load and clean data using pandas, handling nulls and duplicates
Generate descriptive statistics for all features
Create at least 6 visualizations (distributions, correlations, box plots) using matplotlib/seaborn
Build an interactive dashboard with Plotly Dash or Streamlit
Document findings in a written summary section

pandas data wranglingdata visualizationEDA methodologyStreamlit/Dashstatistical summaries

House Price Regression Model

beginner

Predict house sale prices using the Ames Housing dataset with feature engineering and regression models.

Requirements

Perform EDA and handle missing values with justification
Engineer at least 5 new features from existing columns
Train and compare Linear Regression, Ridge, and Lasso models
Evaluate using RMSE and R² on a held-out test set
Plot feature importances and residuals

scikit-learnfeature engineeringregression modelingmodel evaluationdata preprocessing

Customer Churn Classifier

beginner

Build a binary classification model to predict which telecom customers will churn using real-world structured data.

Requirements

Handle class imbalance using SMOTE or class weights
Train at least 3 classifiers (Logistic Regression, Random Forest, XGBoost)
Tune hyperparameters with GridSearchCV or RandomizedSearchCV
Report precision, recall, F1, ROC-AUC, and confusion matrix
Generate a SHAP summary plot for model explainability

classificationimbalanced data handlinghyperparameter tuningSHAP explainabilityscikit-learn pipelines

Movie Recommendation Engine

intermediate

Build a collaborative and content-based hybrid recommendation system using the MovieLens dataset.

Requirements

Implement user-based and item-based collaborative filtering from scratch
Build a content-based model using TF-IDF on movie metadata
Combine both into a hybrid recommender with a blending weight parameter
Evaluate with RMSE and Precision@K metrics
Expose recommendations via a simple Streamlit UI

collaborative filteringTF-IDFrecommendation systemsmatrix factorizationmodel evaluation

Twitter Sentiment Analysis Pipeline

intermediate

Build an end-to-end NLP pipeline to classify tweet sentiment and visualize trends over time.

Requirements

Collect or load labeled tweet data and perform text preprocessing (tokenization, stopword removal, lemmatization)
Train a TF-IDF + Logistic Regression baseline and a fine-tuned BERT model
Compare both models on accuracy, F1, and inference speed
Visualize sentiment trends over time with rolling averages
Package the inference step as a reusable Python module

NLP preprocessingTF-IDFHuggingFace TransformersBERT fine-tuningpipeline design

Time Series Demand Forecasting

intermediate

Forecast retail product demand using classical and ML-based time series methods on the Rossmann store dataset.

Requirements

Decompose series into trend, seasonality, and residuals
Implement ARIMA and Exponential Smoothing baselines
Engineer lag features, rolling statistics, and calendar features for an XGBoost model
Compare models using MAE and MAPE on a future holdout window
Plot forecast vs actuals with confidence intervals

time series analysisARIMAfeature engineeringXGBoostforecasting evaluation

Image Classification with CNNs

intermediate

Train a convolutional neural network to classify images from CIFAR-10, applying transfer learning and data augmentation.

Requirements

Build and train a custom CNN baseline with PyTorch or TensorFlow
Apply data augmentation (flips, crops, color jitter) to reduce overfitting
Fine-tune a pretrained ResNet-18 model on the same dataset
Track training/validation loss and accuracy with TensorBoard
Compare custom CNN vs transfer learning in a results table

CNNstransfer learningdata augmentationPyTorch/TensorFlowmodel training & evaluation

End-to-End ML Model Deployment

advanced

Train a fraud detection model and deploy it as a production-ready REST API with monitoring and CI/CD.

Requirements

Train an XGBoost fraud classifier with full preprocessing pipeline using scikit-learn Pipeline
Serialize the model with MLflow, logging params, metrics, and artifacts
Wrap the model in a FastAPI REST endpoint with input validation via Pydantic
Containerize with Docker and deploy to a cloud platform (Render, Railway, or AWS EC2)
Implement basic data drift detection using Evidently AI and schedule periodic reports

MLflow experiment trackingFastAPIDockermodel deploymentdata drift monitoring

End-to-End Kaggle Competition Pipeline

advanced

Simulate a full competitive data science workflow by building a stacked ensemble for a structured prediction problem.

Requirements

Perform rigorous EDA and document all hypotheses before modeling
Build 5+ diverse base models (LGBM, XGBoost, CatBoost, RF, ElasticNet)
Implement k-fold stacked generalization with out-of-fold predictions
Tune the meta-learner and compare CV score to individual model scores
Generate a reproducible experiment report with all results and code in a Jupyter notebook

ensemble methodsstackingcross-validation strategyCatBoost/LightGBMcompetitive ML workflow

Stuck on a build? Our AI tutor reviews your code and unblocks you — without writing it for you.

Open the app — free to start