DonorsChoose.org Application Screening


Project Overview

Established in 2000, DonorsChoose.org empowers public school teachers nationwide to request essential materials and experiences for their students. The organization receives an overwhelming number of project proposals each year, with a current need for a large number of volunteers to manually review and approve submissions before they can be featured on the DonorsChoose.org website.

The objective is to develop predictive algorithms capable of determining whether a DonorsChoose.org project proposal submitted by a teacher will be approved.

Performace Metric

The evaluation of submissions is based on the Area under the Receiver Operating Characteristic (ROC) curve, measuring the predictive accuracy of the algorithm in comparison to the observed target of project approval.

Technologies Used

Project Details

The problem is formulated in form of binary classification where where '0' denotes not-accepted and '1' denotes accepted project proposals. Three distinct approaches—Naive Bayes, Decision Tree, and Gradient Boosting Decision Trees (GBDT)—are employed to address this challenge.

Approach 1-Naive Bayes

For this approach following operations are performed

  • Featurization :
  • Set 1: categorical, numerical features + preprocessed_eassay (BOW)
  • Set 2: categorical, numerical features + preprocessed_eassay (TFIDF)
  • Hyperparameter tuning
  • Training with best hyperparameters
  • Results

    Set 1

    Set 2

    Approach 2-Decision Tree

    For this approach following operations are performed

  • Featurization :
  • Set 1: categorical, numerical features + preprocessed_eassay (BOW)
  • Set 2: categorical, numerical features + preprocessed_eassay (TFIDF)
  • Hyperparameter tuning
  • Training model with best hyperparameter
  • selecting features which are having non-zero feature importance
  • Trainging machine learning model on these features
  • Results

    Set 1

    Set 2

    Approach 3-XG Boost

    For this approach following operations are performed

  • Featurization :
  • Set 1: categorical (Response coding use probability values), numerical features + Project title(TFIDF)+ Essay (TFIDF)+ Essay Sentiment Score
  • Set 2: categorical( response coding use probability values), numerical features + project_title(TFIDF W2V)+ preprocessed_eassay (TFIDF W2V)
  • Hyperparameter tuning
  • Training with best hyperparameters
  • Results

    Set 1

    Set 2

    Get in Touch

    Connect with me through the following platforms: