-
Notifications
You must be signed in to change notification settings - Fork 1
Project Process
Tagging of Talmud Bavli References Using Machine Learning
Creating the Labeled Data Set
The Snorkel library enables labeling data using a tagging function provided by the user. Meaning that first, we had to characterize the patterns we wished to label as BT references. We had to find and go over various BT quotes to manually detect quotation formats used to create the tagging functions. Though this step was long and tedious, it only supported the decision to use machine learning. Plus, it allowed us to better understand the complexity and to check ourselves in later steps.
The next step involved preparing the data. First, we prepared variables that contained “masachtot” and chapter names which were used in the labeling functions. The second and more challenging part included splitting an input text to short segments which could be labeled as a reference. We extracted a text of the Talmud from CSV files and moved it to data_frame format. Several techniques were considered with how to split it, and eventually, the n-gram method was chosen, in which the text is split to sequences sized n. We chose to create ngrams of size 3-7, in order to find the most precise reference. The limits can be easily changed and were chosen since we noticed that at least 3 words were mentioned in a reference, with 7 as a max limit based on examples we saw. We shortened the process and created n-grams for each sentence separately, to avoid creating a faulty ngram, which starts with the end of a sentence and ends with the beginning of the next one.
With the data prepared for labeling, we started the labeling functions. Labeling functions are “heuristics that take as input a data point and either assign a label to it” or refrain (snorkel). In our case, the labels were "REF", “NO_REF” and “Abstain”. REF means the ngram should be tagged as a reference, "NO_REF" means the ngram should not be tagged as a reference, and "ABSTAIN" means that the function could not decide whether to tag the ngram and so avoided doing so. Snorkel provides different models to combine the results of the functions to a conclusive decision on whether to label or not. We chose the Majority Label Voter model, in which the number of REF and NO_REF tags are compared and the label is set based on the majority. In cases of equality, the ngram remained unlabeled.
Based on the examples we encountered manually, we created different functions, mostly using regular expressions. We tried creating as many as we could, thinking it would capture as many variations as possible. A function was created for each element in a Babylonian Talmud quote, for example, one for the appearance of a masechet name and one for that of a chapter name. Yet we had only one function to label an ngram as not a reference (inequality in number and order of parenthesis). Meaning that Most functions labeled an ngram as a reference, which created a bias while working with the majority model. Indeed, the results showed that though we caught many references, it came with a large percentage of false positives. Meaning We had high recall with low precision. After discussions, we decided to change the functions in order to gain higher precision, even at the cost of lower recall. That is because it is best the classifier would find a small number of correct answers than a large amount of incorrect ones. Also, Snorkel tools provided coverage percentage of each function, which showed that some of them were redundant and only prolonged runtime of the program.
Therefore, we changed the labeling function into the following functions, which kept a mostly even ratio of positive (REF tag) and negative (NO_REF tag) labeling functions:
- masechet_then_parans - this function checks if the ngram contains "masechet" and ends with parenthesis.
- perek_then_parans - this function checks if the ngram contains a chapter of the Babylonian Talmud, and ends with parenthesis.
- daf_in_parntes - this function checks if the ngram contains the word "daf" , and ends with parenthesis.
- no mishna- this function checks if the ngram contains the word "mishna", and if so, mark the ngram as not a reference.
- no_double parans() - this function checks if the ngram contains only one set of opening and closing parenthesis in the correct order. If not, marks the ngram as not a reference.
For example, running the program on on 3000 csv rows resulted in the following coverage:

As shown above, Snorkel also provides useful statistics about the extent to which the functions agree or conflict with eachother.
After running the functions, due to using a range of ngram size, some labeled ngrams were included within other ngrams. Meaning if a 3gram was labeled as reference, the 4th-7h nrgrams which included it were probably labeled as such as well. We chose to keep the largest labeled ngram in order to include as much of the reference as possible. We took into consideration that a possible downside to this approach is that the largest ngram may include redundant words.
Data Augmentation
Snorkel provides methods of enlarging a data set. Once the data was labeled, We used transformation functions which take a label, either positive or negative, and insert variations of it to the data set by switching keywords in the ngram, and labeling according to the original ngram label. Our transformation functions consisted of switching a Masechet name and a Masechet chapter names. Adding them resulted in an especially long runtime, so the code allows to control the number of variations added (constant called TRANSFORMATION_FACTOR).
Training The Classifier
Using Snorkel tutorials, we trained the classifier several times and compared the resulting classifier performance. The main principle was to take all tagged examples (positive and negatives), split it into 70-30 sets, the large one for training, and the last one for testing. After training the classifier each time, we ran it on the remaining test set and compared to see how accurate it was. For the training model, we vectorized the text of the train set and then applied the Logistic Regression linear model (scikit learn).
We tested two parameters while training the classifier. One was whether the augmented data added any value to the Classifier. The second was how different amount of labeled data affected the accuracy of the classifier. The big sample was 7000 csv rows, while the small sample was 800 csv rows with TRANSFORMATION FACTOR=6 :

Our expectations were that increasing the sample size and that adding the augmented data will significantly increase the accuracy of the classifier. We indeed found that augmented data improved the accuracy but to our disappointment, not by much. To our surprise, the sample size did not improve the accuracy by much as well. It is possible that with greater transformation factor the accuracy will improve even more. In addition, we suggest checking the classifier accuracy on examples not taken from the same input. Though the training and testing set are disjoint (have no ngrams elements in common), taking them from the same input may have affected the accuracy as well.
It is important to mention that both the creation of labeled data and the data augmentation prolonged run time significantly, especially for larger sample size. More testing and modifications to the variables mentioned above is required in order to get the best result with lower run time.
Conclusion
In conclusion, this project tested whether using machine learning can be useful in finding and tagging references. We focused on creating a tagged data set of references (and not references) to the Babylonian Talmud using weakly supervised machine learning methods in addition to regular expressions. The working process showed that the task at hand was much easier than it was using only regular expressions, especially when dealing with Hebrew sources. Most importantly, it resulted in a large tagged data set, which would have been impossible to create manually. As seen above, in order to test the data set, we created a basic classifer using the data set, and checked it on a small test set. The next step is to take the data set this tool creates to train a classifer that will tag any input text. We believe that with further understanding of existing tools in Deep learning it will be possible to achieve even better and meaningful results.