(If you want to skip ahead to our dataset visit our Zenodo community page and for our code there is our Github repository)
Guest post by Irina Preda from the EgBot project, a collaboration between the University of Dundee and Good-Loop
Our research focuses on the role of examples in mathematical discourse. One of the ways in which we examine this is through the construction of an autonomous example generator. The generator would be able to contribute to mathematical discussions by proposing appropriate examples, in a socially-appropriate way. This example generator would be a first step towards a machine-enhanced mathematics, where humans and machines collaborate to produce novel mathematics research.
To be able to build a model of how humans use examples, we would need a large dataset of examples and the context in which they are provided. Unfortunately such a dataset does not exist, but there is a lot of potential for generating one. First we must find a source that would allow us to collect all of this data. Online collaborative mathematics platforms (such as Polymath and MathOverflow) provide a remarkable opportunity to explore mathematical discourse and elements of the mathematical process. They are also high quality data-rich sources that provide the perfect resources to analyse discourse as well as train models.
As StackExchange is a platform with an abundant amount of data (MathStackExchange has approximately 1 million questions) and a well-documented API, we decided to use the StackExchange API to extract their Math Q&A data and thus generate our dataset. From the start of the project we focused on making our work accessible, which is why we decided to publish our code openly on Github, as well as to publish our dataset online. We considered this to be very important as a good dataset is an extremely valuable resource for the data science and machine learning community and can provide a significant boost to further research efforts. Making a dataset truly accessible requires for it to be well-constructed and documented such as to be easily understood, but also needs to exist on a platform that allows for it to be easily found. So we turned to Zenodo, which is an open access research publication framework. It assigns all publicly available uploads a DOI (Digital Object Identifier) to make the data uniquely citeable and also supports harvesting of all content via well-known open web standards (OAI-PMH protocol) which makes the meta-data easily searchable. The only limitation we found with using Zenodo is that it doesn’t allow the uploading of json files using the online upload tool, however this was easily fixed by archiving the files (which conveniently reduces the size of the download as well).
Data collection was only the first stage in our project, there is also data analysis, building a conversational model and an interactive web application. Our intention to use deep learning to build the conversational model meant that this first stage was very important, as deep learning neural networks require an immense amount of data to train. Thankfully the approach we sketched above was successful and we were able to harvest 6GB worth of mathematical question-answering and discourse data. If you would like to take a look at our dataset visit our Zenodo community page and for our code there is our Github repository.