Communication through web is becoming increasingly popular thanks to wireless and cellular networks. As this awareness spreads far and wide in different countries, significant complexities arise in terms of language and communication means for extracting information on the web. This is particularly true in India where more than fifteen officially recognized language texts and more variations in local dialect exist. An example is in Tamilnadu where Tamizh, native language with its own variations like Chennai, Madurai and Coimbatore dialects is combined effectively and easily with other languages Telugu, Kannada and Malayalam from nearby states and English and Hindi from global and national perspectives. So a web document here could be in any one of the languages or a mixture of words from different languages to avoid translation like ‘computer’ of English doesn't have translation in Tamizh. There are several aspects to this variational usage with language protagonists and communication engineers. But the complexity in the web document due to these variations does create difficulties in using conventional data mining approaches. The present study focuses attention on this, beginning from text variations to word and document. Typical characters which have similar usage like ‘a’ in English with those in Tamizh and Telugu are taken and their pixelmaps are mapped for similarity and contrasts. This is later extended to more complex characters like **** in Telugu which is one character as compared to its English equivalent ‘kO’ making representations difficult. When one starts looking at words, complexity increases as ‘temple’ in English translated as ‘****’ in Telugu or mandiram written in English. Similarities in pixel-maps are looked at and characteristics in terms of matrices are projected so that mining content when such words or letters are extracted in web document can be put in a probabilistic format with predictions based on correlations. Typical histograms highlighting these aspects are presented and later an experiment with a document page dealing with magnetism is used as model-l for predicting content.
[1]
Michael L. Creech,et al.
FotoFile: a consumer multimedia organization and retrieval system
,
1999,
CHI '99.
[2]
Alex Pentland,et al.
Photobook: Content-based manipulation of image databases
,
1996,
International Journal of Computer Vision.
[3]
Alberto Del Bimbo,et al.
Visual information retrieval
,
1999
.
[4]
Jake K. Aggarwal,et al.
CIRES: a system for content-based retrieval in digital image libraries
,
2002,
7th International Conference on Control, Automation, Robotics and Vision, 2002. ICARCV 2002..
[5]
Arif Ghafoor.
Multimedia database management systems
,
1995,
CSUR.
[6]
Ying Li,et al.
Multimedia database management systems
,
1999,
J. Vis. Commun. Image Represent..
[7]
C.-C. Jay Kuo,et al.
Introduction to Content‐Based Image Retrieval—Overview of Key Techniques
,
2002
.
[8]
Rafael C. González,et al.
Digital image processing using MATLAB
,
2006
.
[9]
Stephan Vogel,et al.
Adaptive parallel sentences mining from web bilingual news collection
,
2002,
2002 IEEE International Conference on Data Mining, 2002. Proceedings..
[10]
Chang Nian Zhang,et al.
A Criterion-Based Role-Based Multilayer Access Control Model for Multimedia Applications
,
2006,
Eighth IEEE International Symposium on Multimedia (ISM'06).
[11]
Chengcui Zhang,et al.
A Dynamic User Concept Pattern Learning Framework for Content-Based Image Retrieval
,
2006,
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).