Chris Pollett > Students

Deliverable 4: Dataset Collection

DESCRIPTION

The ideal dataset would include user data where all conditions were recorded. Here conditions being the type of exercises, learning language, native language, total times word was encountered by the user, total time user was correct about the word and user identity to create a profile and understand the effectiveness.

The first dataset contains 13 million Duolingo student learning traces. Each record consists of proportion of exercises indicating the exercise where the word was correctly used, activity timestamp, time since last activity included the same word, user identity, language user trying to learn, native language, word, total number of time user has seen the word, and total number of time user has correctly identified the word.

The second dataset contains information about 7 million words from 6000 plus users collected over a period of 30 days. The dataset is divided between English, French and Spanish words. First-line in the dataset contains information about the user identity, user country, number of days from when a user has started learning this language, session type, activity format, and time stamp. Each line followed by the first line contains unique id, the word, part of the speech in universal dependencies format, morphological features, dependency edge label, and edge head. Each unique id contains session information, index of activity within a given session and word position in the current activity.

Duolingo spaced repitition dataset

Duolingo SLAM shared task dataset

REFERENCE:

Settles and Burr, "Replication Data for: A Trainable Spaced Repetition Model for Language Learning", https://doi.org/10.7910/DVN/N8XJME, Harvard Dataverse, 2017.

Settles and Burr, "Data for the 2018 Duolingo Shared Task on Second Language Acquisition Modeling (SLAM)", https://doi.org/10.7910/DVN/8SWHNO, Harvard Dataverse, 2018.