Chris Pollett > Students >
Swathi

    ( Print View)

    [Bio]

    [Blog]

    [C297 Proposal]

    [D1:Dataset Overview - pdf(ipynb)]

    [Deliverable 1]

    [Deliverable 2]

    [Visualization Techniques - pdf(Slides)]

    [Deliverable 3]

    [Deliverable 4]

    [CS 297 Report PDF]

    [C298 Proposal]

Deliverable 2:Classification and Clustering on Heart Data


Description:

This deliverable aimed to implement Clustering and Classification Techniques on Heart data from Tabula Sapiens. Logistic Regression, Support Vector Machine, K-Means and Hierarchical Clustering were performed on the Heart cells data to check for similar groups or clusters.

Implementation Steps:

  1. Download and install Python, Anaconda and Jupyer Notebook.
  2. Download Tabula Sapiens - Heart Dataset from CZ Biohub website.
  3. Install anndata, scanpy, numpy, pandas, matplotlib, seaborn python packages.
  4. Execute the code snippets step by step as mentioned in Google Colab(ipynb) file.

Results:

  1. Implemented Logistic Regression and Support Vector Machine algorithms on Heart cells data to predict the cell type .

  2. Classification

    • Logistic Regression
    • The Logistic Regression model trained with 1000 iterations gave fairly good results with an accuracy of 98.6% on the test data.

      Logistic Regression Predict Celltype Logistic Regression Accuracy
    • Support Vector Machine
    • Support Vector Machine using a linear kernel also resulted in an accuracy of 98.8%, slightly better than Logistic regression.

      Logistic Regression Accuracy Logistic Regression Accuracy

    Clustering

    • K-Means
    • Trained the K-Means model with 6 clusters on the genes of Heart data. Most of the cells were grouped into different clusters

      K-means cluster labels K-means clusters

      Implemented K-Means with 6 clusters on the first 10 principal components of the genes of Heart data. The clusters can be distinguished better than the ones plotted on actual data

      K-means XPCA cluster labels K-means XPA clusters
    • Hierarchical Clustering
    • Hierarchical Clustering was done on the Heart cell data with max_distance as 2000. The clusters aren't very clear from the plotted principal components.

      Hierarchical clusters

      The below image shows the dendrogram of the clusters.

      Hierarchical dendrogram