### <center>San Jose State University<br>Department of Applied Data Science<br><br>**DATA 200<br>Computational Programming for Data Analytics**<br><br>Spring 2023<br>Instructor: Ron Mak</center>

# Titanic Survival CSV Data

#### We will analyze actual passenger survival data from the sinking of the Titanic. The first few lines of `TitanicSurvival.csv`:
```
"","survived","sex","age","passengerClass"
"Allen, Miss. Elisabeth Walton","yes","female",29,"1st"
"Allison, Master. Hudson Trevor","yes","male",0.916700006,"1st"
"Allison, Miss. Helen Loraine","no","female",2,"1st"
"Allison, Mr. Hudson Joshua Crei","no","male",30,"1st"
"Allison, Mrs. Hudson J C (Bessi","no","female",25,"1st"
"Anderson, Mr. Harry","yes","male",48,"1st"
"Andrews, Miss. Kornelia Theodos","yes","female",63,"1st"
"Andrews, Mr. Thomas Jr","no","male",39,"1st"
"Appleton, Mrs. Edward Dale (Cha","yes","female",53,"1st"
"Artagaveytia, Mr. Ramon","no","male",71,"1st"
"Astor, Col. John Jacob","no","male",47,"1st"
"Astor, Mrs. John Jacob (Madelei","yes","female",18,"1st"
"Aubart, Mme. Leontine Pauline","yes","female",24,"1st"
"Barber, Miss. Ellen Nellie","yes","female",26,"1st"
"Barkworth, Mr. Algernon Henry W","yes","male",80,"1st"
"Baumann, Mr. John D","no","male",NA,"1st"
"Baxter, Mr. Quigg Edmond","no","male",24,"1st"
"Baxter, Mrs. James (Helene DeLa","yes","female",50,"1st"
```
#### Note that the name column does not have a header. Babies have fractional ages (for example, the first Allison). Not all ages were recorded, and missing ages were entered as `NA` (for example, Baumann).
#### After you've successfully identified and accessed the data you want to analyze, a major challenge is how to clean, format, and store the data in **data structures** that are appropriate for the types of analysis you want to do.
#### We want to analyze the Titanic survival data along four dimensions: survived (yes or no), sex (male or female), age (several age groups), and passenger class (1st, 2nd, or 3rd).

## Global constants for the four dimensions

In [None]:
SURVIVAL_GROUPS = 2
SURVIVED_NO = 0
SURVIVED_YES = 1

SEX_GROUPS = 2
SEX_MALE = 0
SEX_FEMALE = 1

AGE_GROUPS = 9
AGE_UNKNOWN = 0
AGE_BABY = 1      # age < 1
AGE_TODDLER = 2   #  1 <= age < 2
AGE_CHILD = 3     #  2 <= age < 13
AGE_TEENAGER = 4  # 13 <= age < 20
AGE_YOUNG = 5     # 20 <= age < 30
AGE_MIDDLE = 6    # 30 <= age < 65
AGE_SENIOR = 7    # 65 <= age < 75
AGE_ELDERLY = 8   # age >= 75

CLASS_GROUPS = 3
CLASS_1 = 0
CLASS_2 = 1
CLASS_3 = 2

## Create the four-dimensional `numpy` array

#### We need to count how many passengers are in each dimension. Therefore, store the counts in a four-dimensional `numpy` array so we can take advantage of `numpy`'s vector and matrix arithmetic.

In [None]:
import numpy as np

#### First, create a list of zeros of the appropriate length. Then convert the list into the `numpy` array `counts` that we can shape into the four dimensions.

In [None]:
multidimensional_list = \
    [0]*SURVIVAL_GROUPS*SEX_GROUPS*AGE_GROUPS*CLASS_GROUPS

counts = np.array(multidimensional_list)\
    .reshape(SURVIVAL_GROUPS, SEX_GROUPS, AGE_GROUPS, CLASS_GROUPS)
counts

#### Can you identify the four dimensions in the above **hypercube**?

## For graphing later

In [None]:
age_known_count_1 = 0
age_known_count_2 = 0
age_known_count_3 = 0

age_sum_1 = 0
age_sum_2 = 0
age_sum_3 = 0

ages = []

## Imports

In [None]:
import csv
import re
import matplotlib
import matplotlib.pyplot as plt

## Read and process the rows of the CSV file
#### We will get a **list of values** for each row of the CSV file.

In [None]:
first = True

with open('TitanicSurvival.csv', newline='') as titanic_csv_file:
    titanic_data = csv.reader(titanic_csv_file, delimiter=',', quotechar='"')
    
    # Loop for each row.
    for row in titanic_data:
        # Ignore the column headers.
        if first:
            first = False
            continue
        
        # Unpack the row of values.
        name, survived, sex, age, pclass = row
        
        # Convert the passenger's survival status and sex.
        survived = SURVIVED_NO if survived == 'no' else SURVIVED_YES
        sex      = SEX_FEMALE  if sex == 'female'  else SEX_MALE
        
        # Convert the passenger class.
        if pclass == '1st':
            pclass = CLASS_1
        elif pclass == '2nd':
            pclass = CLASS_2
        else:
            pclass = CLASS_3
        
        # Convert the age. Use a regular expression to
        # check that it is numeric (and therefore not 'NA').
        if re.fullmatch('\d+(\.\d*)?', age):
            age = float(age)
            ages.append(round(age))
            
            # Count and sum of known ages in each class.
            if pclass == CLASS_1:
                age_known_count_1 += 1
                age_sum_1 += age
            elif pclass == CLASS_2:
                age_known_count_2 += 1
                age_sum_2 += age
            else:
                age_known_count_3 += 1
                age_sum_3 += age
        
            # Tally each age group.
            if age < 1:
                age_group = AGE_BABY
            elif age < 2:
                age_group = AGE_TODDLER
            elif age < 13:
                age_group = AGE_CHILD
            elif age < 20:
                age_group = AGE_TEENAGER
            elif age < 30:
                age_group = AGE_YOUNG
            elif age < 65:
                age_group = AGE_MIDDLE
            elif age < 75:
                age_group = AGE_SENIOR
            else:
                age_group = AGE_ELDERLY
        
        # The age was 'NA'.
        else:
            age = 0
            age_group = AGE_UNKNOWN
            
        # Update the counts.
        counts[survived][sex][age_group][pclass] += 1

In [None]:
counts

## Total count of passengers

In [None]:
total_count = np.sum(counts)
total_count

## Count of males and females

In [None]:
female_count = np.sum(counts[:, SEX_FEMALE, :, :])
male_count   = np.sum(counts[:, SEX_MALE,   :, :])

print(f'{female_count = }')
print(f'{male_count = }')

In [None]:
male_count + female_count

### Quick tutorial on creating simple Python graphs: [Matplotlib Tutorial](https://www.w3schools.com/python/matplotlib_intro.asp)

In [None]:
x = ['males', 'females']
y = [male_count, female_count]

plt.bar(x, y)
plt.show()

In [None]:
plt.pie(y, labels=['males', 'females'])
plt.show()

## Count of passengers in each class

In [None]:
class_1_count = np.sum(counts[:, :, :, CLASS_1])
class_2_count = np.sum(counts[:, :, :, CLASS_2])
class_3_count = np.sum(counts[:, :, :, CLASS_3])

print(f'{class_1_count = }')
print(f'{class_2_count = }')
print(f'{class_3_count = }')

In [None]:
class_1_count + class_2_count + class_3_count

In [None]:
x = ['1st', '2nd', '3rd']
y = [class_1_count, class_2_count, class_3_count]

plt.bar(x, y)
plt.show()

In [None]:
plt.pie(y, labels=['1st', '2nd', '3rd'])
plt.show()

## Histogram of overall ages

In [None]:
plt.hist(ages)
plt.show()

### What's the difference between a histogram and a bar chart? See [Difference Between Histogram and Bar Graph](https://keydifferences.com/difference-between-histogram-and-bar-graph.html)

## Average age in each class

In [None]:
avg_1st = age_sum_1/age_known_count_1
avg_2nd = age_sum_2/age_known_count_2
avg_3rd = age_sum_3/age_known_count_3

print(f'{avg_1st = :.1f}')
print(f'{avg_2nd = :.1f}')
print(f'{avg_3rd = :.1f}')

In [None]:
x = ['1st', '2nd', '3rd']
y = [avg_1st, avg_2nd, avg_3rd]

plt.bar(x, y)
plt.show()

## Count of passengers in each age group

In [None]:
age_baby_count     = np.sum(counts[:, :, AGE_BABY,     :])
age_toddler_count  = np.sum(counts[:, :, AGE_TODDLER,  :])
age_child_count    = np.sum(counts[:, :, AGE_CHILD,    :])
age_teenager_count = np.sum(counts[:, :, AGE_TEENAGER, :])
age_young_count    = np.sum(counts[:, :, AGE_YOUNG,    :])
age_middle_count   = np.sum(counts[:, :, AGE_MIDDLE,   :])
age_senior_count   = np.sum(counts[:, :, AGE_SENIOR,   :])
age_elderly_count  = np.sum(counts[:, :, AGE_ELDERLY,  :])
age_unknown_count  = np.sum(counts[:, :, AGE_UNKNOWN,  :])

print(f'{age_baby_count     = :3d}')
print(f'{age_toddler_count  = :3d}')
print(f'{age_child_count    = :3d}')
print(f'{age_teenager_count = :3d}')
print(f'{age_young_count    = :3d}')
print(f'{age_middle_count   = :3d}')
print(f'{age_senior_count   = :3d}')
print(f'{age_elderly_count  = :3d}')
print(f'{age_unknown_count  = :3d}')

In [None]:
age_baby_count + age_toddler_count + age_child_count + \
age_teenager_count + age_young_count + age_middle_count + \
age_senior_count + age_elderly_count + age_unknown_count

In [None]:
np.sum(counts[:, :, 0:AGE_GROUPS, :])

In [None]:
x = ['baby', 'toddler', 'child', 'teenager', 'young',
     'middle', 'senior', 'elderly', 'unknown']
y = [age_baby_count, age_toddler_count, age_child_count,
     age_teenager_count, age_young_count, age_middle_count,
     age_senior_count, age_elderly_count, age_unknown_count]

plt.bar(x, y)
plt.show()

In [None]:
y = [age_baby_count + age_toddler_count, age_child_count,
     age_teenager_count, age_young_count, age_middle_count,
     age_senior_count + age_elderly_count]


plt.pie(y, labels=['baby and toddler', 'child', 
                   'teenager', 'young','middle', 
                   'senior and elderly'])
plt.show()

## Count of survivors and nonsurvivors

In [None]:
perished_count = np.sum(counts[SURVIVED_NO,  :, :, :])
survived_count = np.sum(counts[SURVIVED_YES, :, :, :])

print(f'{perished_count = }')
print(f'{survived_count = }')

In [None]:
perished_count + survived_count

In [None]:
x = ['survived', 'perished']
y = [survived_count, perished_count]

plt.bar(x, y)
plt.show()

In [None]:
plt.pie(y, labels=['survived', 'perished'])
plt.show()

In [None]:
class_1_survivors = np.sum(counts[SURVIVED_YES, :, :, CLASS_1])
class_2_survivors = np.sum(counts[SURVIVED_YES, :, :, CLASS_2])
class_3_survivors = np.sum(counts[SURVIVED_YES, :, :, CLASS_3])

print(f'{class_1_survivors = }')
print(f'{class_2_survivors = }')
print(f'{class_3_survivors = }')

## Survivor percentages by class

In [None]:
class_1_survivor_pct = class_1_survivors/class_1_count
class_2_survivor_pct = class_2_survivors/class_2_count
class_3_survivor_pct = class_3_survivors/class_3_count

print(f'{class_1_survivor_pct = :.1%}')
print(f'{class_2_survivor_pct = :.1%}')
print(f'{class_3_survivor_pct = :.1%}')

In [None]:
x = ['1st', '2nd', '3rd']
y = [class_1_survivor_pct, class_2_survivor_pct, class_3_survivor_pct]

plt.bar(x, y)
plt.show()

In [None]:
y = [class_1_survivors, class_1_count - class_1_survivors]

plt.pie(y, labels=['survived in 1st class', 'perished in 1st class'])
plt.show()

In [None]:
y = [class_2_survivors, class_2_count - class_2_survivors]

plt.pie(y, labels=['survived in 2nd class', 'perished in 2nd class'])
plt.show()

In [None]:
y = [class_3_survivors, class_3_count - class_3_survivors]

plt.pie(y, labels=['survived in 3rd class', 'perished in 3rd class'])
plt.show()

## Survivor percentages by age group

In [None]:
baby_survivor_pct     = np.sum(counts[SURVIVED_YES, :, AGE_BABY,     :])/age_baby_count
toddler_survivor_pct  = np.sum(counts[SURVIVED_YES, :, AGE_TODDLER,  :])/age_toddler_count
child_survivor_pct    = np.sum(counts[SURVIVED_YES, :, AGE_CHILD,    :])/age_child_count
teenager_survivor_pct = np.sum(counts[SURVIVED_YES, :, AGE_TEENAGER, :])/age_teenager_count
young_survivor_pct    = np.sum(counts[SURVIVED_YES, :, AGE_YOUNG,    :])/age_young_count
middle_survivor_pct   = np.sum(counts[SURVIVED_YES, :, AGE_MIDDLE,   :])/age_middle_count
senior_survivor_pct   = np.sum(counts[SURVIVED_YES, :, AGE_SENIOR,   :])/age_senior_count
elderly_survivor_pct  = np.sum(counts[SURVIVED_YES, :, AGE_ELDERLY   :])/age_elderly_count

print(f'{baby_survivor_pct     = :5.1%}')
print(f'{toddler_survivor_pct  = :5.1%}')
print(f'{child_survivor_pct    = :5.1%}')
print(f'{teenager_survivor_pct = :5.1%}')
print(f'{young_survivor_pct    = :5.1%}')
print(f'{middle_survivor_pct   = :5.1%}')
print(f'{senior_survivor_pct   = :5.1%}')
print(f'{elderly_survivor_pct  = :5.1%}')

In [None]:
x = ['baby', 'toddler', 'child', 'teenager', 
     'young', 'middle', 'senior', 'elderly']
y = [baby_survivor_pct, toddler_survivor_pct, child_survivor_pct, teenager_survivor_pct,
     young_survivor_pct, middle_survivor_pct, senior_survivor_pct, elderly_survivor_pct]

plt.bar(x, y)
plt.show()

## Survivor percentages by sex

In [None]:
female_survivors = np.sum(counts[SURVIVED_YES, SEX_FEMALE, :, :])

print(f'{female_survivors = }')
print('% Female survivors: '
      f'{female_survivors/female_count:.1%}') 

In [None]:
male_survivors = np.sum(counts[SURVIVED_YES, SEX_MALE, :, :])

print(f'{male_survivors = }')
print('% Male survivors: '
      f'{male_survivors/male_count:.1%}') 

In [None]:
total_survivors = np.sum(counts[SURVIVED_YES, :, :, :])

print(f'{total_survivors = }')
print('% All survivors: '
      f'{total_survivors/total_count:.1%}') 

In [None]:
y = [female_survivors, female_count - female_survivors]

plt.pie(y, labels=['females survived', 'females perished'])
plt.show()

In [None]:
y = [male_survivors, male_count - male_survivors]

plt.pie(y, labels=['males survived', 'males perished'])
plt.show()

## Count of female survivors

#### In 1st class

In [None]:
female_survived_1st = np.sum(counts[SURVIVED_YES, SEX_FEMALE, :, CLASS_1])

print(f'{female_survived_1st = }')

#### In both 2nd and 3rd class

In [None]:
female_survived_2nd3rd = np.sum(counts[SURVIVED_YES, SEX_FEMALE, :, [CLASS_2, CLASS_3]])

print(f'{female_survived_2nd3rd = }')

In [None]:
x = ['female survivors: 1st class', '2nd+3rd']
y = [female_survived_1st, female_survived_2nd3rd]

plt.bar(x, y)
plt.show()

In [None]:
plt.pie(y, labels=['1st', '2nd+3rd'])
plt.show()

#### (C) Copyright 2023 by Ronald Mak