Analysis of historical data using machine learning

Introduction

Deriving conclusions from historical datasets is one of the hardest problems historians have to solve. The problems include verifying information, deriving correlations, and defending conclusions. The complexity of the problem results in increased time for a solution.

Through this project, we have proposed an accurate and faster way of concluding new inferences from historical data.

This methodology primarily has 4 parts

1. Collecting and preprocessing the dataset
2. Analysing data using modern machine learning techniques.
3. Validate results using existing Information
4. Derive conclusions

To showcase the working of the proposed methodology we have designed an experiment, in which we have derived cultural and economical relationships between ancient cities of the Indus valley civilization.

Project Description

Experimentation

Data Collection (https://github.com/sidgupta2205/CSS_Project/tree/descriptors/Data)

In the first phase, we worked on research and knowledge gathering about the traditional methods used by researchers in the field of history and archaeology.
It was necessary to gather large amounts of data related to artifacts, scripts, seals, coins, daily usage articles, bones and weapons.

We used the following data collection methods for this project:

Interviews
Annotations
Web Scraping bots (needed quality assurance)
Documents and Records from ASI and RMRL

Criteria applied while cleaning data

Size of object
State of Preservation of object
Object types, and material of object

Categories of annotation:

Location
Seal
Coin
Ornament
Daily Usage
Burial Sites
Weapons
Symbols
Toys
Sculptures
Tablets
Artifacts (Graffitti)

Map of Indus Valley Civilisation

After verifying source of data and random selection we have considered following sites

Mohenjo-daro
Harappa
Lothal
Kalibangan
Chanhu-daro
Banawali
Alamgirpur
Amri
Chandigarh
Daimabad
Desalpur
Dholavira

Interviews:

Dr. Aniket Alam

We conducted 2 interviews with Aniket Alam sir. He guided us in researching traditional methods used by history researchers.

Dr. Satish Palniappan (Microsoft) (on 22nd February)

He has worked on OCR of Indus scripts. He guided us about feature extraction from the dataset. And taught us about data collection methods.

RMRL lab, Chennai (on 25 February)

Roja Muthiah Research Laboratory in Chennai has done extensive work on indus script recognition. We contacted them via email and phone calls and asked about sites to use in our research.

Annotations (Mass Collaboration of 20 people for 6000 annotations):

Sites: https://www.cvat.org/

We collected 8013 images during the data collection phase. But after preprocessing we were able to create a quality dataset of around 6000 images.

We used a supervised mass collaboration approach for labeling data. We asked 20 of our colleagues to first do the annotations. And later 2 of our group members verified the annotations done by them.

Data preprocessing

Image preprocessing is done or standardization of images before they are fed into a neural network. Image pre-processing improves the image data such that it suppresses undesired distortions or enhances some image features relevant for further processing and analysis tasks.

Image preprocessing done in this project are:

Image Resizing:

Image resizing refers to the scaling of images. It helps in reducing the number of pixels from an image and that has several advantages e.g. It is useful in the feature extraction process from an image and also It reduces the time of training of a neural network as more is the number of pixels in an image more is the number of input nodes that in turn increases the complexity of the model.

We have used the cv2.resize() method in our project for resizing the image dataset. We have used cv.INTER_LINEAR for our image data.

Fig: Original Image

Image Normalization:

Normalization in image processing is used to change the intensity level of pixels. It is used to get better contrast in images with poor contrast due to glare.

Fig: Image after Normalization

Image Thresholding:

Image thresholding is a simple form of image segmentation. It is a way to create a binary image from a grayscale or full-color image. This is typically done in order to separate "object" or foreground pixels from background pixels to aid in image processing.

There are several types of thresholding methods available in Open cv library. We have experimented with some of them given below:

Global Thresholding
Binary Thresholding
Inverse-Binary Thresholding
Truncate Thresholding
Threshold to Zero

Inverted Threshold to Zero

After applying all the methods we have noted that Threshold to Zero works best with our project requirement. In this type of thresholding,
The destination pixel value is set to the pixel value of the corresponding source, if the source pixel value is greater than the threshold.
Otherwise, it is set to zero
The maxValue is ignored

Fig: Image after applying Threshold to Zero

Smoothing and Blurring:

By smoothing an image prior to applying techniques such as edge detection or thresholding we are able to reduce the amount of high-frequency content, such as noise and edges (i.e., the “detail” of an image).

We experimented with 3 different smoothing and blurring techniques:

Gaussian blurring: is similar to average blurring, but instead of using a simple mean, we used a weighted mean, where neighborhood pixels that are closer to the central pixel contributed more “weight” to the average.

Median blurring: The median blur method is most effective when removing salt-and-pepper noise. It replaces the central pixel with the median of the neighborhood.

Bilateral blurring: To reduce noise while still maintaining edges, we used bilateral blurring. Bilateral blurring accomplishes this by introducing two Gaussian distributions.

** Bilateral filtering was most effective on the dataset we are using as it reduced noise and still maintained edges. Hence it is used in this project.

Fig: Image after bilateral filtering

Image Descriptors:

About

Image descriptors, also known as Visual descriptors, are descriptions of the visual aspects of contents in photos, videos, or techniques or applications that generate such descriptions in computer vision. They describe basic features such as shape, color, texture, and motion, among other things.

Work Done in the Project

Images of various Seals, Coins, Ornaments, Burial Sites, Weapons, Symbols, Toys, Sculptures, Tablets, and other items were provided to us for this project. It was a challenging task to convert those photos to feature vectors. The Preprocessing team preprocessed the images and we acquired photos of the same size. The brightness, intensity, hue, and saturation of those images were also standardized.

Initially we studied about different classical Image descriptors like Harris Corner Detection, Shi-Tomasi Corner Detector, Scale-Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF), Features from Accelerated Segment Test (FAST), Binary Robust Independent Elementary Features (BRIEF), Oriented FAST and Rotated BRIEF (ORB). We tried some of them, but these descriptors were too simple and didn’t give good results.

Then we tried some Deep Learning feature extraction techniques.

Super Point: Self-Supervised Interest Point Detection and Description

Superpoint is a self-supervised framework for training interest point detectors and descriptors suitable for a large number of multiple-view geometry problems. It first pretrains an interest point detector on synthetic data and then learns the descriptors by generating image pairs with known homography transformation.

The task of finding interest points consists of detection and description. Detection is the localization of an interest point in an image, and description is to describe each of the detected points with a vector. The overall goal is to find characteristic and stable visual features effectively and efficiently.

Superpoint takes images as input of any dimensions and produces 2D point observations and their corresponding D dimensional descriptors. We used superpoint to obtain the image descriptors and then performed K-means clustering to get clusters of similar objects such as seals, artifacts, etc. The Silhouetter Score obtained using this technique is 0.258.

Resnet50

ResNet-50 is a convolutional neural network that is 50 layers deep. We can load a pre-trained version of the network trained on more than a million images from the ImageNet database. The pretrained network can classify images into 1000 object categories, such as keyboard, mouse, pencil, and many animals. As a result, the network has learned rich feature representations for a wide range of images. The network has an image input size of 224-by-224. We can use classify to classify new images using the ResNet-50 model.

We used the ResNet-50 model to vectorize a subset of images and tried K-means clustering to find the result.

XCEPTION

Xception stands for “extreme inception”, it takes the principles of Inception to an extreme. In Inception, 1x1 convolutions were used to compress the original input, and from each of those input spaces we used different type of filters on each of the depth space. Xception just reverses this step. Instead, it first applies the filters on each of the depth map and then finally compresses the input space using 1X1 convolution by applying it across the depth. This method is almost identical to a depth wise separable convolution, an operation that has been used in neural network design as early as 2014.

We used the XCEPTION model to vectorize a subset of images and tried K-means clustering to find the result.

All these descriptors and results were sent to the next group to test and experiment with the best algorithm to find the similarity between images.

Unsupervised Techniques

All the feature descriptors collected from the above process were tested with various unsupervised clustering algorithms to learn the pattern of the data.

Some of the clustering algorithms that were tested are

K-means (Cosine and euclidean distance)
Agglomerative clustering
GMM
Deep clusters

K-means shows best results on all the cases while other algorithm fail to create appropriate clusters.

Results

Clusters

Confusion Matrix

Conclusion:

From our small yet effective experiment it can be derived that machine learning algorithms can be using deriving primary conclusion from primary datasets. Although the black box nature of the feature descriptors remains a big problem for historians acceptance of the proposed methodology, yet with results it cannot be denied that the new proposed methodology is accurate, fast and open new frontiers of research in historical data analysis.

References:

Documents and Records:

Corpus on Indus Seals and Inscriptions
By Jagat Pati Joshi & Asko Parpola, Erja Lahdenperä, Virpi Hämeen Anttila
The Indus Script - Texts, Concordance & Tables
By Iravatham Mahadevan
Sathish Palaniappan Paper
Deep Learning the Indus Script : Sathish Palaniappan and Ronojoy Adhikari
[https://arxiv.org/abs/1702.00523v1]
https://github.com/tpsatish95/indus-script-ocr

Search This Blog

13:Archaeological Data Analysis