-
Notifications
You must be signed in to change notification settings - Fork 1
Introduction
Welcome to the regex-ml wiki!
Introduction
This project tested whether using machine learning tools might be useful in tasks of information tagging. Part of a larger project, “The Jewish Book Closet”, it focuses on tagging references of hebrew sources, in this case, the Babylonian Talmud. In the past, regular expressions were used for the task, but they have proven difficult to work with, and therefore we checked whether machine learning is a better approach. One of the more difficult steps when working with machine learning is the creation of a large enough data set for the machine to learn from. Our purpose was to create that data set using weak supervision machine learning methods.
“The Jewish Book Closet”, an ongoing research of the Computer Science Department in the Technion. The research purpose is to create a computerized database of Jewish texts to enrich our knowledge about them. Mainly using nowadays advanced technologies which allow a new approach to exploring a large amount of data. One goal is to create a tool that detects references from one Jewish text to another. Our focus was on detecting references from the Babylonian Talmud (BT). The BT is a set of Jewish laws from the first centuries which serve as an interpretation of the Mishna. To this day, the Babylonian Talmud is considered a valuable source and is quoted in many later Jewish texts. Such a tool that outputs BT references in a given text would be beneficial to the target audience of researchers and teachers.
However, since bibliography rules started to appear mainly in the last century, there was no strict format used, and therefore deciding whether a text quotes from the BT requires prior knowledge. Usually, a quote would contain a masechet name or/and chapter name, a page and the side of the page from which the quote was taken from. A manual review of example references showed that there are different ways the BT was quoted, and though somewhat similar, using a search based solely on a regular expression matching the mentioned elements would miss a variety of results. For example, some references are mentioned by the chapter name in the masechet from which they were taken, instead of the masechet name. In the past, this task was handled by creating various regular expressions based on examples extracted from a text, yet different texts quote in different ways, and so proved to be too much to be done manually. Also, combining Hebrew with regular expressions turned out not programmer-friendly, to say the least.
Therefore, this project tested whether combining machine learning might be more productive. We chose to approach the task at hand using weak supervision, “a branch of machine learning where noisy limited sourced or imprecise sources are used to provide supervision signal for labeling large amount of data” (Wikipedia). Using Python Panda, Snorkel and Scikit learn libraries designed for that purpose, we created a large tagged data set of references to the Babyon Talmud. and tested the created data set on a basic classifier, which labels references of the BT in a given text.
The process consisted of three steps. First, creating a labeled data set. Second, using transformations on the tagged dataset to enlarge it. And finally, training the classifier. Read full project process in next page.