Está en la página 1de 61

Urdu Optical Character Recognition Using Neural Networks

Submitted by: Zaheer Ahmad (MS-IT Session 2006-08)

A thesis submitted in partial fulfillment of the requirement for the degree of Master of Science in Information Technology (MS-IT)

In

Institute of Management Sciences Peshawar, Pakistan January, 2009

Certificate of Originality

It is certified that the thesis titled Urdu Optical Character Recognition Using Neural Networks submitted by the concerned student, is up to the requirements of the MSIT degree. All the work done is solely the effort of the student and an adequate appreciation is given to work of other authors which is mentioned as a reference material.

1. Supervisor: Designation: Signature:

______OWAIS ADNAN_______________ Lecturer_______________________ ________________________

2. External Examiner:

_______

______

3. Research In charge: Signature

NAFEES-UR-REHMAN _______________________

Acknowledgements

I would like to thank my thesis supervisor, Mr. Awais Adnan. I gratefully acknowledge his warm encouragement and patient guidance through preparing this thesis.

I would also like to thank my other thesis committee members. In particular, I am grateful to Dr. Mohammad Ali, IT coordinator, Mr. Nafees ur Rehman, Research Incharge, Mr. Shahid Nawaz, lecturer and Mr.Adnan Yousaf, lecturer for their valuable feedback.

I am deeply indebted to Dr. Nasir Ali Khan, Director IMS who on our request specially arranged Artificial Intelligence classes for us in the Institute of Management Sciences.

Special thanks go to my collogues in Literacy For All (LFA) Project/ Elementary Education Foundation (EEF), Peshawar particularly Mr. Amin Khan Bangash, Planning Officer LFA, who always helped me to reach out to my classes in time.

Finally I would like to acknowledge many thanks to my parents, supported me with their encouragement and comprehension, my well wishers who from my admission to writing of thesis prayed for my success. If I have seen further, it is by standing on the shoulders of giants (Newton)

Zaheer Ahmad

Abbreviations
ANN ANSI BPM ECMA FFNN Gb GHZ ISO MATLAB M-files NN NNT OCR PDF RTF TIFF UOCR Artificial Neural Networks American National Standards Institute Bitmap European Computer Manufacturers Association Feedforward Neural Network Giga byte Giga Hertz International Standards Organization Matrix Laboratory Matlab Files Neural Networks Neural Networks Toolbox Optical Character Recognition Portable Document Format Rich Text Format Tagged-Image File Format Urdu Optical Character Recognition

Figures

Figure 2.1 Figure.3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4

Character Set (58 alphabets) of Urdu Script Neural Net block diagram Perceptron Neural Network Feedforward Neural Network Back-propagation Neural Network Character Segmentation and Recognition line of Urdu text (above ) Segmented character ( below) Training set of single and two classes Character input to NN (above ) character as a result from NN (below)

Figure 4.5 Figure 4.6

Character-wise Recognition Percentage Lam or Alif of Islam

Abstract

Urdu Optical Character Recognition is a less developed area and a complex task to develop as Urdu being a family of Arabic script is cursive, right to left in nature and the characters change its shapes and forms when it is placed at initial, middle or at the end of a word. In the proposed system pixels strength is measured to detect words in a sentence and joins of characters in a compound/connected word for segmentation these segmented characters are feeded to Neural Network for classification. A prototype of the system has been developed using Matlab, currently achieves 70% accuracy on the average.

Contents
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.9 2 2.1 2.2 2.3 2.4 2.5 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4 4.1 4.2 4.3 4.3.1 Introduction Optical Character Recognition (OCR) History of OCR Common Steps of OCR Pattern Recognition Application of Pattern Recognition Scope of the Work Objectives of the Work Thesis Organization Urdu a Cursive Script Introduction Problems of Urdu Script Other Problems in Urdu OCR Characteristics of Urdu Characters Urdu and Devanagari Script NEURAL NETWORKS Introduction Perceptron Feedforward Neural Network Back-propagation Algorithm Advantages of Neural Computing Limitations of Neural Computing Matlab and Neural Network Toolbox Urdu Character Recognition Using Neural Networks Introduction Urdu Character Recognition Structure of UOCR Segmentation 9 9 9 11 12 13 14 14 15 16 16 21 22 23 24 29 29 30 31 32 36 37 38 40 40 40 41 42

4.3.2 4.4 4.5 5 6 7

Recognition using Neural Network Simulation Results Discussions References Appendix A Appendix B

44 4.4 46 49 52 58

Chapter 1

Introduction

1.1.

Optical Character Recognition Optical Character Recognition (OCR) is the mechanical or electronic translation / reading of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text. OCR is a field of research in pattern recognition, artificial intelligence and machine vision. An OCR system enables you to take a book or a magazine article, feed it directly into an electronic computer file, and then edit the file using a word processor. All OCR systems include an optical scanner for reading text, and sophisticated software for analyzing images. Most OCR systems use a combination of hardware (specialized circuit boards) and software to recognize characters, although some inexpensive systems do it entirely through software. Advanced roman OCR systems can read text in large variety of fonts, but they still have difficulty with handwritten text.

1.2

HISTORY OF Optical Character Recognition (OCR) To understand the phenomena described in the above section, we have to look at the history of OCR [3, 4, 6], its development, recognition methods, computer technologies, and the differences between humans and machines [1, 2, 5, 7, 8]. It is always fascinating to be able to find ways of enabling a computer to mimic human functions, like the ability to read, to write, to see things, and so on. OCR research and development can be traced back to the early 1950s, when scientists tried to capture the images of characters and texts, first by mechanical and optical means of rotating disks and photomultiplier, flying spot scanner with a cathode 9

ray tube lens, followed by photocells and arrays of them. At first, the scanning operation was slow and one line of characters could be digitized at a time by moving the scanner or the paper medium. Subsequently, the inventions of drum and flatbed scanners arrived, which extended scanning to the full page. Then, advances in digital-integrated circuits brought photoarrays with higher density, faster transports for documents and higher speed in scanning and digital conversions. These important improvements greatly accelerated the speed of character recognition and reduced the cost, and opened up the possibilities of processing a great variety of forms and documents. Throughout the 1960s and 1970s, new OCR applications sprang up in retail businesses, banks, hospitals, post offices; insurance, railroad, and aircraft companies; newspaper publishers, and many other industries [3, 4].In parallel with these advances in hardware development, intensive research on character recognition was taking place in the research laboratories of both academic and industrial sectors [6, 7]. Although both recognition techniques and computers were not that powerful in the early days (1960s), OCR machines tended to make lots of errors when the print quality was poor, caused either by wide variations in type fonts and roughness of the surface of the paper or by the cotton ribbons of the typewriters [5]. To make OCR work efficiently and economically, there was a big push from OCR manufacturers and suppliers toward the standardization of print fonts, paper, and ink qualities for OCR applications. New fonts such as OCRA and OCRB were designed in the 1970s by the American National Standards Institute (ANSI) and the European Computer Manufacturers Association (ECMA), respectively. These special fonts were quickly adopted by the International Standards Organization (ISO) to facilitate the recognition process [3, 4, 6, 7]. As a result, very high recognition rates became achievable at high speed and at reasonable costs. Such accomplishments also brought better printing qualities of data and paper for practical applications. Actually, they completely revolutionalized the data input industry [6] and eliminated the jobs of thousands of keypunch operators who were doing the really mundane work of keying data into the computer.

10

1.3

Common Steps of OCR Processing

The process of converting documents into electronic forms, which is usually referred to as digitization is undertaken in different steps. The process of scanning a document and representing the scanned image for further processing is called the pre-processing or imaging phase. The process of manipulating the scanned image of a document to produce a searchable text is called the OCR processing stage. 1.3.1 The Imaging Stage The imaging process involves scanning the document and storing it as an image. The most popular image format used for this purpose is called Tagged-Image File Format (TIFF). The resolution (number of dots per inch dpi) determines the accuracy rate of the OCR process. 1.3.2 The OCR Process The major steps of the OCR processing stage are shown below. 1.3.3 Distinguishing between text and images Segmentation In this step, the process of identifying the text and image blocks of the scanned image is undertaken. The boundaries of each image are analyzed in order to recognize the text. 1.3.4 Character recognition Feature Extraction This step involves recognizing a character using a method known as feature extraction. OCR tools store rules about the characters of a given script using a method known as the learning process. A character is then identified by analyzing its shape and comparing its features against a set of rules stored on the OCR engine that distinguishes each character.

11

1.3.5

Recognition of Words Following the character recognition process, word identification process is performed by comparing the string of characters against an existing dictionary of words. Additional processes such as spell-checking are performed under this step.

1.3.6

Correction of Unrecognized Characters Error Correction In this step, the user is allowed to provide corrections to unrecognized characters.

1.3.7

Output Formatting The final step involves storing the output in one of the industry standard formats such as RTF, PDF, WORD and plain UNICODE text.

1.4

Pattern Recognition

Pattern recognition (also known as classification or pattern classification) is a field within the area of artificial intelligence and can be defined as "the act of taking in raw data and taking an action based on the category of the data". It uses methods from statistics, machine learning and other areas.

Typical applications of pattern recognition are:


Automatic speech recognition. Classification of text into several categories (e.g. spam/non-spam email messages).

The automatic recognition of handwritten postal codes on postal envelopes. The automatic recognition of images of human faces etc.

The last three examples form the subtopic image analysis of pattern recognition that deals with digital images as input to pattern recognition systems. Some popular techniques for pattern recognition include:

12

Neural Networks Hidden Markov Models Bayesian networks

The application domains of pattern recognition include:


Computer Vision Cachine Vision Medical Image Analysis Optical Character Recognition Credit Scoring

1.5

Applications of the Pattern Recognition Technology

Pattern recognition has many practical applications. Some of them are outlined below.

Use as a telecommunication aid for deaf, in airline reservation, in postal department for postal address reading (both handwritten and printed postal codes/addresses) and for medical diagnosis.

For use in customer billing as in telephone exchange billing system, order data logging, and automatic finger print identification, as an automatic inspection system.

In automated cartography, metallurgical industries, computer assisted forensic linguist system, electronic mail, information units and libraries and for facsimile.

For direct processing of documents as a multipurpose document reader for large scale data processing, as a micro-film reader data input system, for high speed data entry, for changing text/graphics into a computer readable form, as electronic page reader to handle large volume of mail.

13

1.6

Scope of this Work

The Project is designed to classify and recognize a scanned image containing Urdu (Arabic, Persian) characters using two step approach. In the first step the Urdu text image is segmented into words and characters and in the second step it classifies and recognizes these characters using Neural Network. During the whole work it is assumed that there is no noise in the image and the image is perfectly scanned with no deviation from its original angle no skewing. In fact we have typed Urdu words in Ms Words and taken its image to test the application.

1.7

Objectives and applications of this Work

Urdu Optical Character Recognition can open a new way of realizing the dream of the natural mode of communication between man and machine in this part of the world. It will expand and multiply already available knowledge to new horizons. Centuries old rare script in Arabic, Urdu and Persian will become available to common man. The ultimate goal of character recognition is to simulate the human reading capabilities. Character recognition systems can contribute tremendously to the advancement of the automation process and can improve the interaction between man and machine in many applications, including office automation, check verification and a large variety of banking, business and data entry applications, library archives, documents identifications, e-books producing, invoice and shipping receipt processing, subscription collections, questionnaires processing, exam papers processing and many other applications[9], beside online address and signboard reading.

14

1.8

Thesis Organization

The remaining part of this thesis is divided into three chapters. Chapter 2 describes Urdu as Arabic script, its peculiarities and problems. Chapter 3 briefly explains Neural Networks and Matlab, and chapter 4 is about the development of Urdu Character Recognition.

15

Chapter 2

Urdu a Cursive Script

2.1

Introduction

Urdu is the national language of Pakistan and one of the popular script in the Indian subcontinent evolved in the subcontinent from the mixture of Arabic, Turkish, Farsi and Hindi Languages with 58 character set defined by National Language Authority Pakistan[10,26] as shown in figure 2.1. But only 40 basic and one do-chashmi-hey is used to form all composite alphabets; therefore the working set is consists of 41 alphabets.

Fig-2.1 Character Set (58 alphabets) of Urdu Script.

The Urdu alphabet is the right to left alphabet used for the Urdu language. It is a modification of the Persian alphabet, which is itself a derivative of the Arabic alphabet. Urdu is typically written in the calligraphic Nasta'liq script, whereas Arabic is more commonly in the Naskh style. Usually, bare transliterations of Urd into Roman letters 16

omit many phonemic elements that have no equivalent in English or other languages commonly written in the Roman alphabet. National Language Authority of Pakistan has developed a number of systems with specific notations to signify non-English sounds, but these can only be properly read by someone already familiar with Urd, Persian, or Arabic for letters such as: or and Hindi for letters such as ]11[ .

Urdu shares a common script and many characteristics of Arabic script with additional set of alphabets. Most of Urdu characters when combined form a degree of about 45 to the horizontal line because of which Urdu script reading is faster than roman script but on the other hand it makes it harder for the novice readers and the machines to recognize the word or segment one character from the rest. Unlike the English script there is no capital or small characters in Urdu, but the last character of a word can be considered as a capital character as in many cases it presents the full form of the character and the characters at initial and middle positions are considered as small. Every character has a standalone shape besides different joining forms, but some of the alphabet like the characters making the word Urdu ( ) or of the similar category are not joinable or cannot be connected. Urdu alphabet utilizes consonant letters, vowels, diacritic marks, numerals, punctuations and a few superscripts signs. The graphical representation of each alphabet has more than one form depending on its position and context in the word. In general each letter has four forms that is beginning, middle, final and standalone as shown in table-2.1.

Table-2.1 Characters and its different forms

Char # 0 1 1a

Forms

Name

Name hamzah alif alif madd

17

Char # 2 2h 3 3h 4 4h 5 5h 6 7 7h 8 8h 9 01 11 11h 21 12h 31 41

Forms

Name

Name b bh p ph t th t. t.h s jm jh = c h = ch bar. H x = kh dl dh d.l d.h zl r

81

Char # 14h 51 15h 61 71 81 91 02 12 22 32 42 52 62 72 82 28h 92 29h 03 30h

Forms

Name

Name r r. r.h z = zh sn n = shn Sd, Sud d, ud T Z ain ain f qf kf kh gf gh lm lm

91

Char # 31 31h 32 32h 32a 23ah 33 33h 34 34b 35 35 35b

Forms

Name

Name mm mm nn nn nn-e unnah nn-e unnah v v ht. h d-am h ht. y ht. y bar. y


Problems of Urdu Script

2.2.

Despite a large character set Urdu/Arabic has a small set of characters which are easily distinguishable from one another. The remaining character differ from these character using dots or symbols above or below these shapes [12].The table 2.2 shows group of similar characters and their derived forms.

20

Table-2.2 Groups of Similar Characters

Characters #.S Standalone Forms Groups #.S Standalone

Characters Forms Groups

1 2 3 4 5 6 7 8 9 01 11 21 31 41 51 61 71 81 91

1 2

12 22 32 42 52 62

41 51 61 71 81 91
12

72 82 92 03

01

11 21 31

13 23 33 43 53 63 73

5 6

83 93

Characters S.# Standalone Forms Groups S.# Standalone

Characters Forms Groups

20

40 41

20 21

As shown above only 21 different groups exits out of 41 character set. It will complicate the recognition phase of Urdu characters. Further the study of other forms ( initial, middle and final ) of these character reveals that ein(

)is similar to hamza( ,)wow () might

be confusing with ( , )ze ( )resembles noon () and ( ,)dhal ( )is close match to tay ( ) and mem( )can be confused with middle form of ein ( )and with stand alone goalhe (). A key difference between Latin scripts and Arabic script is the fact that many letters only differ by a dot(s) but the primary stroke is exactly the same.[13]

2.3

Other Problems in Urdu OCR

Below is a list of problems found in Urdu characters but its somehow common to all languages which are using Arabic script, like Persian, Pashtu, Punjabi, Sindhi, Balochi, Hindko, Saraiky in addition to that, all Muslims (almost of the people on the earth) can read Arabic because it is the language of Al-Quran, the holy book of Muslims. Even though, Arabic script recognition has not received enough interests by the researchers. Little research progress has been achieved comparing to the one done on the Latin and Chinese. The solutions available in the market are still far from being perfect [5, 6]. There are few reasons led to this result. There is no financial support from government No text databases or dictionary available, except the one under preparation by the Urdu Language Authority but their Web shows a slow progress so far.

22

Even no standard keyboard exits, National Language Authority of Pakistan has devised a keyboard in which the most used characters are set under the main fingers but it is very different from the one already in use ( phonetic keyboard of Inpage). Moreover still to be adopted by software vendors as even Windows Vista is using its own version of Urdu keyboard.

The research carried out on Urdu language is mostly scattered and outside from the Urdu/Arab world. There are no specialized conferences or symposium conducted so far. Algorithms developed for other language scripts are not applicable on Urdu.

2.4.

Characteristics of Urdu Characters

This section provides a comprehensive list of characteristics of the Urdu/Arabic characters with figures to illustrate the concepts. We aim to provide a source for researchers to start with. These characteristics will be presented from character recognition point of view [17-25].

1. Urdu is written from right to left in both printed and handwritten forms 2. No upper or lower cases exist in Urdu, but sometimes the last character of a word is considerd as upper case because its always remains in its full form. 3. 4. The shape of the character varies according to its position in the word. Each character has either two or four different forms. Off course this will increase the number of classes to be recognized from. In our experiments we have used 54 different classes for 41 different Urdu characters. 5. Urdu is always written cursively. Words are separated by spaces. However, there are 6 characters can be connected only from the right, these are: . , , , , , 6. Urdu characters are normally connected on an imaginary line called baseline and each alphabet in a character has some fixed size depending upon the pen (Qalam) used which is called khat. 7. Character on different location not only changes its shape but also its size.

23

8. Some Urdu characters have dots associated with the character, they can be above or below. 9. Some characters contain closed loop (refer to Table 1). Loop is an important feature to describe a character. Character contains two loops. The open portion of characters , and sometimes, if written by hand, is closed to form a triangle (See Figure 4). The loop of character , and sometimes becomes too small that the internal opening part is disappeared (Figure 4). 10. Hamza ( )zigzag shape, is not really a letter but it can cause difficulty in segmentation process as it resembles with the character ein ( . ) also it is some times used 11. There are only three characters that represent vowels, , or . However, there are other shorter vowels represented by diacritics in the form of overscores or underscores but usage of overscore and underscore in Urdu is less as compare to Arabic language. 12. Dots may appear as two separated dots, touched dots, hat or as a stroke. 13. Another style of Urdu handwriting is the artistic or decorative calligraphy which is usually full of overlapping making the recognition process even more difficult by human being rather than by computers. 2.5. Urdu and Devanagari Script Despite that Urdu and Hindi speakers can understand each other the script of writing is different, Hindi is written in Devanagari script and Urdu in Arabic script. Hindi is almost entirely phonetic. The table 2.3 shows Devanagari set of alphabet with respect to Urdu character set. In Devanagari each letter in the main grouping has a place in a grid, with vertical columns identified by the nature of the sound, and horizontal rows identified by the place in the mouth (moving from back to front) where the sound is made. This is a more lucid, logical, and linguistically sounds way of writing characters set. In Devanagari letters have no special names rather they are defined by their grid-places. By contrast, the Urdu script is arranged by shape, in a way that's basically not phonetically useful. Where as the English alphabet is arranged by no criterion whatsoever. [14, 15]

24

Table 2.3 Urdu alphabet with their equivalent Devanagari alphabet

# 0

Name hamzah

Unicode U+0621 U+0654

Nagari equiv.

1 1a 2 2h 3 3h 4 4h 5 5a 5h 6 7 7h 8 8h 9 10

alif alif madd b bh p ph t th t. t. t.h s jm jh = c h = ch bar. H x = kh

U+0627 U+0622 U+0628 U+067E U+062A U+0679 U+067F U+062B U+062C U+0686 U+062D U+062E U+0906 U+092C U+092D U+092A U+092B U+0924 U+0925 U+091F U+091F U+0920 U+0938 U+091C U+091D U+091A U+091B U+0939 U+0959

25

# 11 11h 12 12a 12h 13 14 15 15a 15h 16 17 18 19 20 21 22 23 24 25

Name dl dh d.l d.l d.h zl r r. r. r.h z = zh sn n = shn Swd wd T Z ain ain

Unicode U+062F

Nagari equiv. U+0926 U+0927 U+0921 U+0921 U+0922 U+095B U+0930 U+095C U+095C U+095D U+095B U+091D U+093C U+0938 U+0936 U+0938 U+095B U+0924 U+095B

U+0688 U+0690

U+0630 U+0631 U+0691 U+0699

U+0632 U+0698 U+0633 U+0634 U+0635 U+0636 U+0637 U+0638 U+0639 U+063A

U+095A

26

# 26 27 28 28h 29 29h 30 30a 31 32 32a 33 33a 34 34a

Name f qf kf kh gf gh lm lm alif mm nn nn-e unnah v vv-e mahmz ht. h ht. h d-am h ht. y hamzah bar. y

Unicode U+0641 U+0642 U+06A9

Nagari equiv. U+095E U+0958 U+0915 U+0916 U+0917 U+0918 U+0932

U+06AF

U+0644

U+0645 U+0646 U+06BA U+0648 U+0624 U+06C1 U+0647

U+092E U+0928 U+0901 U+0935 U+0913 U+0939 U+0939

34b 35 35a 35b

U+06BE U+06CC U+0626 U+06D2 U+090F U+092F

27

Note that the Urdu alphabet defined in table 2.3 is not compatible with the character set defined by National Language Pakistan. The letter defined 5a, 12a, 15a are archaic glyph variants of letters 5, 12, 15, not used in working Urdu. Letter 1a denotes initial or medial . Letter 32a denotes nasalization. Letter 34a is a glyph variant of letter 34. Letter 34b forms eleven digraphs denoting aspirates. Letter 35b denotes nal or nal ai.

2.6.

Summary

Urdu is written in Arabic script with an additional set of alphabet. It inherits all the complexities of Arabic script including its cursive nature of writings, right to left style of writting and change of form and shape when a character is placed at different locations of a word, loops, half closed characters and dots on above or below a character. National Language Authority defined 58 characters set but it has 41 working characters beside numeral and diacritics.

28

Chapter 3

NEURAL NETWORKS
3.1 Introduction5 N

Neural networks are composed of simple elements operating in parallel. These elements are inspired by biological nervous systems. As in nature, the network function is determined largely by the connections between elements. We can train a neural network to perform a particular function by adjusting the values of the connections (weights) between elements. Commonly neural networks are adjusted, or trained, so that a particular input leads to a specific target output. Such a situation is shown in figure - 3.1. There, the network is adjusted, based on a comparison of the output and the target, until the network output matches the target. Typically many such input/target pairs are used, in this supervised learning.

Figure 3.1: Neural Net Block Diagram

Neural networks have been trained to perform complex functions in various fields of application including pattern recognition, identification, classification, speech, vision and control systems. Today neural networks can be trained to solve problems that are difficult for conventional

29

computers or human beings. The supervised training methods are commonly used, but other networks can be obtained from unsupervised training techniques or from direct design methods. Unsupervised networks can be used, for instance, to identify groups of Final Report - Hand Gesture Recognition using Neural Networks . Certain kinds of linear networks and Hopfield networks are designed directly. In summary, there are a variety of kinds of design and learning techniques that enrich the choices that a user can make. The field of neural networks has a history of some six decades but has found solid application only in the past fifteen years, and the field is still developing rapidly. Thus, it is distinctly different from the fields of control systems or optimization where the terminology, basic mathematics, and design procedures have been firmly established and applied for many years.

3.2

Perceptron

The perceptron neural network consists of a single layer of S perceptron neurons connected to R inputs through a set of weights wi,j, as shown figure-3.2 in two forms. As before, the network indices i and j indicate that wi,j is the strength of the connection from the jth input to the ith neuron.

Figure 3.2 Perceptron Neural Network

30

The perceptron learning rule is capable of training only a single layer. Thus only onelayer networks are considered here. This restriction places limitations on the computation a perceptron can perform.

3.3

Feedforward Neural Network

Feedforward networks often have one or more hidden layers of sigmoid neurons followed by an output layer of linear neurons. Multiple layers of neurons with nonlinear transfer functions allow the network to learn nonlinear and linear relationships between input and output vectors. The linear output layer lets the network produce values outside the range 1 to +1. On the other hand, if you want to constrain the outputs of a network (such as between 0 and 1), then the output layer should use a sigmoid transfer function (such as logsig). As noted in Neuron Model and Network Architectures, for multiple-layer networks the number of layers determines the superscript on the weight matrices. The appropriate notation is used in the two-layer tansig/purelin network shown next.

Figure 3.3 Feedforward Neural Network

31

This network can be used as a general function approximator. It can approximate any function with a finite number of discontinuities arbitrarily well, given sufficient neurons in the hidden layer.

3.4

Back-propagation Algorithm

The back-propagation (BP) algorithm is the most popular method for neural networks training and it has been used to solve numerous real life problems. BP is a multilayer feedforward neural networks that consist in an iterative minimization of a cost function, by making weight connection adjustments according to the error between the computed and the desired output values[53]. Figure 3.21 shows a general three layer network. The following relationships for the derivation of the back-propagation hold:

The cost function (error function) is defined as the mean square sum of differences between the output values of the network and the desired target values. The following formula is used for this error:

32

Figure 3.4 Back-Propagation Neural Network

A general three-layer back-propagation network. When wik changes it affects only the error on one output unit, k. When wij changes it affects the error on all the output units. where p is the subscript representing the pattern and k represents the output units. In this way, tpk is the target value of output unit k for pattern p and okp is the actual output value of the output layer unit k for pattern p. This error function is the mostly commonly used, however other types of error functions can also be applied. During the training process a set of pattern examples is used, each example consisting of a pair with the input and corresponding target output. The patterns are presented to the network

33

sequentially, in an iterative manner. The appropriate weight corrections being performed during the process to adapt the network to the desired behavior. The iterative procedure continues until the connection weight values allow the network to perform the required mapping. Each presentation of the whole pattern set is named an epoch. The mimization of the error function is carried out using a gradient-descent technique. The necessary corrections to the weights of the network for each iteration n are obtained by calculating the partial derivative of the error function in relation to each weight wjk, which gives a direction of steepest descent. A gradient vector representing the steepest increasing direction in the weight space is thus obtained. Due to the fact that a minimization is required, the weight update value wjk uses the negative of the corresponding gradient vector component for that weight. The delta rule determines the amount of weight update based on this gradient direction along with a step size:

The parameter represents the step size and is called the learning rate. The partial derivative is equal to:

Normally the error signal, k is defined as following

so that the delta rule formula becomes:

34

For the hidden neuron, the weight change of wij is obtained in a similar way. A change to the weight, wij , changes oj and this changes the inputs into each unit, k, in the output layer. The change in E with a change in wij is therefore the sum of the changes to each of the output units. The chain rules produces:

So that defining the error j as:

We have the weight change in the hidden layer equals to:

The k for the output units can be calculated using directly available values, since the error measure is based on the difference between the desired (tk) and the actual ok values. However, that measure is not available for the hidden neurons. The solution is to back-propagate the k values layer by layer through the network, so that finally the weights are updated.

35

A momentum term was introduced in the BP algorithm by Rummelhart . The idea consists in incorporating in the present weight update some influence of the past iterations. The delta rule becomes

is the momentum parameter and determines the amount of influence from the previous iteration on the present one. The momentum introduces a damping effect on the search procedure, thus avoiding oscillations in irregular areas of the error surface by averaging gradient components with opposite sign and accelerating the convergence in long flat areas. In some situations it possibly avoids the search procedure from being stopped in a local minimum, helping it to skip over those regions without performing any minimization there. Momentum may be considered as an approximation to a second-order method, as it uses information from the previous iterations. In some applications, it has been shown to improve the convergence of the BP algorithm.

3.5

Advantages of Neural Computing

There are a variety of benefits that an analyst realizes from using neural networks in their work. Pattern recognition is a powerful technique for harnessing the information in the data and generalizing about it. Neural nets learn to recognize the patterns which exist in the data set. The system is developed through learning rather than programming. Programming is much more time consuming for the analyst and requires the analyst to specify the exact behavior of the model. Neural nets teach themselves the patterns in the data freeing the analyst for more interesting work. Neural networks are flexible in a changing environment. Rule based systems or programmed systems are limited to the situation for which they were designed

36

when conditions change, they are no longer valid. Although neural networks may take some time to learn a sudden drastic change, they are excellent at adapting to constantly changing information. Neural networks can build informative models where more conventional approaches fail. Because neural networks can handle very complex interactions they can easily model data which is too difficult to model with traditional approaches such as inferential statistics or programming logic. Performance of neural networks is at least as good as classical statistical modeling, and better on most problems. The neural networks build models that are more reflective of the structure of the data in significantly less time. Neural networks now operate well with modest computer hardware. Although neural networks are computationally intensive, the routines have been optimized to the point that they can now run in reasonable time on personal computers. They do not require supercomputers as they did in the early days of neural network research.

3.6

Limitations of Neural Computing

There are some limitations to neural computing. The key limitation is the neural network's inability to explain the model it has built in a useful way. Analysts often want to know why the model is behaving as it is. Neural networks get better answers but they have a hard time explaining how they got there. It is difficult to extract rules from neural networks. This is sometimes important to people who have to explain their answer to others and to people who have been involved with artificial intelligence, particularly expert systems which are rulebased. As with most analytical methods, you cannot just throw data at a neural net and get a good answer. You have to spend time understanding the problem or the outcome you are trying to predict. And, you must be sure that the data used to train the system are appropriate and are measured in a way that reflects the

37

behavior of the factors. If the data are not representative of the problem, neural computing will not product good results.This is a classic situation where "garbage in" will certainly produce "garbage out." Finally, it can take time to train a model from a very complex data set. Neural techniques are computer intensive and will be slow on low end PCs or machines without math coprocessors. It is important to remember though that the overall time to results can still be faster than other data analysis approaches, even when the system takes longer to train. Processing speed alone is not the only factor in performance and neural networks do not require the time programming and debugging or testing assumptions that other analytical approaches do.

3.7

Matlab and Neural Network Toolbox

The name MATLAB stands for matrix laboratory.MATLAB is a highperformance language for technical computing. It integrates computation,

visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. Typical uses include: Math and computation Algorithm development Modeling, simulation, and prototyping Data analysis, exploration, and visualization Scientific and engineering graphics Application development, including Graphical User Interface building

MATLAB is an interactive system whose basic data element is an array that does not require dimensioning. This allows you to solve many technical computing problems, especially those with matrix and vector formulations, in a fraction of the time it would take to write a program in a scalar non-interactive language such as C or Fortran.MATLAB has evolved over a period of years with input from many users. In university environments, it is the standard instructional tool for introductory and

38

advanced courses in mathematics, engineering, and science. In industry, MATLAB is the tool of choice for high-productivity research, development, and analysis. The reason that I have decided to use MATLAB for the development of this project is its toolboxes. Toolboxes allow you to learn and apply specialized technology. Toolboxes are comprehensive collections of MATLAB functions (M-files) that extend the MATLAB environment to solve particular classes of problems. It includes among others image processing and neural networks toolboxes.

3.8

Summary

Neural Network is parallel information-processing device that consists of a large number of processing modules, connected by elements that have information storage and programming functions. Neural Networks were inspired by biological nervous systems and were developed to mimic human brain and mainly used for pattern recognition. There are different architectures exists for Neural Network which consists of a input layer, output layer and one or more than one hidden layers but some may not have hidden layers. A typical Neural Network gets trained by receiving input data through input layer, perform calculation in its hidden layer, adjust its weights by subtracting generated output from the output layer from the actual values.

39

Chapter 4

Urdu Character Recognition Using Neural Networks

4.1 Introduction

The Urdu OCR (UOCR) is designed and developed to recognize images of Urdu text/characters. The system gets a single line of Urdu text, converts text into words and then into characters. A Multilayer Feed Forward Neural Network is trained to recognize these segments as characters. Each character is feeded to a trained Neural Net, which on successful recognition shows the correct character otherwise Character not Recognized message is generated. The results percentage of the system is 70%.

4.2 Urdu Optical Character Recognition (UOCR)

UOCR will enable natural mode of communication between man and machine for the Urdu speakers. Centuries old rare script in Arabic, Urdu and Persian will become available to common man. It will help machine translation to Urdu from other languages and vice versa. Character recognition systems can contribute tremendously to the advancement of the automation process and can improve the interaction between man and machine in many applications, including office automation, check verification and a large variety of banking, business and data entry applications, library archives, documents identifications, e-books producing, invoice and shipping receipt processing, subscription collections, questionnaires processing, exam papers processing and many other applications[9], beside online address and signboard reading.

40

4.3 Structure of UOCR

Any OCR consists of two main modules, one work as a feature extracting and segmentation and other is used to recognize the segments as characters. The UOCR work similarly, it is also composed of two main modules as shown figure 4.1.

Input Urdu Text Image Preprocessing Segmentation Segmented Character Binary Character ( Resized )

Character Code (Results)


Figure 4.1 Character Segmentation and Recognition

41

The main two modules are given as below: 4.3.1 Segmentation 4.3.2 Recognition using Neural Network

4.3.1

Segmentation

It is the first phase of the UOCR system where the text image is read, possible characters are segmented for feeding to neural network for recognition. It consists of the following sub modules.

4.3.2. Preprocessing

Urdu line of text is converted to pure black and white colors, area above and below the text is discarded. It was assumed that there is no noise and skewing in the text image as the images are directly taken from MS Word in BPM format.

4.3.3

Feature Extraction and Segmentation

During this phase, pixels strength is measured to detect words in a sentence and joins of characters in a word to segment sentences into words and words into characters. The method to find the strength is to first find the minimum value in the last row (which becomes the (i,j)th pixel), saving the pixel location to change it status to 1, then working backwards by finding the minimum of the 3 neighboring pixels of (i,j) in the (i-1)th row and saving that pixel to the seam path. This process is repeated until the first row is reached, and results a line/seams with minimum strength of the pixels, an example of which is shown in table 4.1. After the strength of the seam is found, the path of pixels that make up the seam are set to 1 in the image to increase

42

its energy level and discourage these pixels contribution in the next search

Table 4.1 Pixels Selection i i ii iii iv v vi vii ii iii iv v vi vii viii 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 0 1 1 0 1 0 0 1 0 0 0 0 0 0 1

viii 0 0 1 Ix 0 0 1

for seams. As a first priority those seams are selected which are straight vertically for words segmentation and for character segmentation vertical seams are preferred but if the size of the segment is large enough to a threshold value then horizontal seams are applied on the same segment to further get it segmentated. For Words segmentation zero level energy is selected where as for character segmentation energy of the seam is calculated and compared with the average energy of the columns for vertical segmentation. If the energy is less than the seam is chosen for segmentation else not considered. If more than one contiguous seam found, middle of these seams is selected for segmentation. In the table-4.1 column II,III and IV as a unit make a seam, column V,VI when combined make a seam and column I,VIII independently make seams. These seams are selected for segmenting the image for words or characters.

4.3.1.2. Character Size and threshold values 43

The algorithm produces a large number of small segments, which may not be a character as a whole, to recover from this error a threshold value is selected, if a segment is less than the threshold, it is merged with earlier character encountered otherwise if the segment is larger than a threshold value it is set to undergo for a horizontal segmentation. Checking for the size and strength goes in many steps till the segment fulfills the criteria for a character.

4.3.1.3 Garbage Characters

During the process of segmentation keen examination is carried out to segment a complete character, if the character is segmented in parts these are merged with its other parts, if two are more than two characters are segmented as a single character these are further segmented to get single characters, as a result some garbage characters are produced, these are unnecessary , undesired segments of a characters, which the algorithm unable to merge with its relevant segments and treated as characters till it is declared as garbage characters in the recognition phase. Examples of these garbage characters are shown in table 4.2. Some of these garbage character are identical to one or other real character therefore sometimes they passes the recognition phase as well, which is a major weakness of this algorithm and should be tackled down in the future work. In figure 4.2, is a line of Urdu text (upper line) and character segmented (lower line ) from the above line text. It shows correctly and garbage characters it produced during the line of action.

Figure 4.2 line of Urdu text (above ) Segmented character ( below)

44

The above example shows about 88% correct characters segmentation when seen through human eye, but the recognition process through neural network gives 70% results for the same. The 5th segmented character ( in second line ) from the right side and the 2nd last segmented character from the right side are not making their full or differentiable forms and even a human eye will not be able to correctly recognize it. As it is more looks like re( )than noon ( ) or noonghuna ( . )

4.3.2

Recognition Using Neural Network

The Neural Network module can further be classified into training and simulation parts.

4.3.2.1 Neural Network Architecture And Training

A Multilayer Feedforward Neural Network(FFNN) with 21x15 (315) input nodes, a single hidden layer with 2000 nodes and output layer of 6 nodes was used. Tansig and logsig was used for hidden and output layer respectively. Training function trainscg was used because of its optimized memory usage with all of its defaults.

Hidden layer of 2000 nodes was finally selected after testing on different layer sizes for its optimum results, where as Input layer of 315 nodes was selected keeping in view the average size of the characters produced by using Ariel font of size 36. Normally there should be a fixed input data size for all the inputs therefore all the characters were resized to 21x15 array, the larger were reduced and smaller were enlarged using Matlab imresize function with nearest parameter.

45

Characters were resized, normalized, formed vectors to feed in the net for training. The FFNN with above parameters taken 2000 epochs to get trained/meet the goal of 0.0005, it took about 5-7 hours on 2 GHZ, Dual Core System having 2 Gb of RAM. So a hard job to train the net using different parameters.

4.3.2.2.

Training Set As discussed in chapter-3 Urdu have 40 basic and one dochashmi hye () as distinct characters. But all joinable characters change its shapes and size on different locations therefore 54 different characters/classes with 100 samples for each is used to train the network. Some of the training samples are shown in figure 4.3, remaining can be seen in appendix A.

46

Figure 4.3 Training set of single and two classes

The character sheen ( )and swad () is used as single class but tay ( ) is divided into two classes, same is the case with tee (.) Graphical software was used to cut characters from MS Word and create a set of samples manually.

4.4

Simulation

Segmented characters from segmentation module are feeded to trained network after resizing it to 21x15 array using the same Matlab function imresize with nearest parameter used earlier for preprocessing the characters for training purposes. As a result the Matlab sim function returns a 6 digit binary number, the number is matched with the 54 character set (used as target during the training). If the number matches to any of the 54 binary numbers the relevant character image is displayed otherwise No Character Found message is generated. The No Character Found message shows that the character was not identified but if the data feeded to sim is not a valid character rather is a garbage character (as

47

discussed above) then it is training success because garbage character should not be identified as a character.

Figure 4.4 character input to NN (above ) character as a result from NN ( below)

4.5 Experimental Results Segmentation technique applied is very successful. The algorithm produces 85% results but because of the similar nature of characters particularly when these characters are in their compound form Neural Network results is not as good and the overall results drops to 70%. Figure 4.5 shows character wise neural network performance. Recognition of character family of ( ,) pee

( ,)tee ( ) tay (, )cee ( ) and fee ()

is around 80 %

same is the case of character family of kaf ( ) and gaf ( ) as these are the most simple characters and despite their similarity with each other they are totally different from the other characters. The character lam ( ) when used in middle of a word behave like and alif ( ) which decrease its recognition percentage but alif is not misunderstood as lam ( ) in most of the cases. The character waw() and choty yee ( )are is difficult to be differentiated by the NN as the segment of choty yee ( )after it produce the garbage is very similar to waw () . Characters fee ( , )mem ( ) and ein ( ) when used in the middle form of a character can deceive neural network for each other during the recognition process which leads to a low percentage for their recognition.Character noon ( ) when used in the

beginning it looks like ze ( ) and zal ( ) and thus produces low results.

48

Figure 4.5 Character-wise Recognition %ge

49

In the segmentation part, garbage characters are produced during the segmentation of seen(, )sheen(, )swad(),dwad(), noon( ,)noon ghuna( )which in most of the cases get passes the character test during segmentation, where as bee

( ,) pee

( ,)tee ( , ) tay ( , )cee ( ) and fee ( ) also produces garbage characters as


shown in table 4.3 but in most of the cases they are identified as garbage characters. But the good thing is that, these character produce garbage only when they are located at the end of a word.
Table 4.3 Characters and Garbage Produced

Character Noon Chotee yee Name of Allah Seen Sheen,Swad,Dwad be, pe,te and tay Yee (unsegmented )

Part/Garbage

() () ( ) ( ) () ( ) )like words make some

The combination of lam ( )and alif ( ) when used in (

what a new character, in the segmentation phase as shown in figure 4.5. This needs to be treated carefully.

Figure 4.6 Lam or Alif of Islam

50

Each time the algorithm produces the same results when used on same line of text and environment. The same is the case with a saved neural network results on same line of text. The lower percentage of successful character recognition is because the number of similar character increases when they form the middle or beginning part of a word. Despite that the Neural Network produces 70% results the segmentation accuracy of the algorithm developed is about 85% when seen through human eye. Which is quite promising and vouches that neural network recognition can be enhanced with more test/sample data and processing power for training of neural networks.

3.6

Summary

The segmentation algorithm is based on finding the pixels strengths of a word for detecting joins of a character. The algorithm results 85% success which is quite encouraging despite that it produces some undetectable garbage characters. The Feed Forward Neural Network was trained using 56 classes of the alphabet set with 100 samples for each. Because of Urdu script inbuilt character similarities with each other the overall results produced are 70% but the results can be improved by introducing more classes within the 41 working character set of the Urdu script.

51

REFERENCES
1. H. Bunke and P. S. P. Wang. Handbook of Character Recognition and Document Image Analysis. World Scientific Publishing, Singapore, 1997. 2. S. Mori, H. Nishida, and H. Yamada. Optical Character Recognition, Wiley Interscience, New Jersey, 1999. 3. Optical Character Recognition and the Years Ahead. The Business Press, Elmhurst, IL, 1969. 4. Pas dauteur. Auerbach on Optical Character Recognition. Auerbach Publishers, Inc., Princeton, 1971. 5. S. V. Rice, G. Nagy, and T. A. Nartker. Optical Character Recognition: An Illustrated Guide to the Frontier. Kluwer Academic Publishers, Boston, 1999. 6. H. F. Schantz. The History of OCR. Recognition Technologies Users Association, Boston, 1982. 7. C. Y. Suen. Character recognition by computer and applications. In T. Y. Young and K. S. Fu, editors, Handbook of Pattern Recognition and Image Processing. Academic Press, Inc., Orlando, FL, 1986, pp. 569586. 8. Proceedings of the following international workshops and conferences: _ ICPRInternational Conference on Pattern Recognition _ ICDARInternational Conference on Document Analysis and Recognition _ DASDocument Analysis Systems _ IWFHRInternational Workshop on Frontiers in Handwriting Recognition. [9] A. Amin, H. Al-Sadoun and S. Fischer, Hand-Printed Arabic Character Recognition System using An Arificial Network Pattern Recognition, Vol. 29, No. 4, pp. 663-675, 1996.

52

[10] Zaheer Ahmad, Jehanzeb Khan, Urdu Nastaleeq OCR (Optical Character Recognition, Proceedings of World Academy of Science, Engineering and Technology, Volume 2, ISSN:1307-6884, December 2007. [11] Urdu Alphabet reterived on 15/02/08 from http://en.wikipedia.org/wiki/Urdu_alphabet [12] Amin, A. Arabic Character Recognition, Handbook of Character Recognition and Document Image Analysis, World Scientific Publishing Company, 1997, pp. 398. [13] Towards Neural Network Recognition Of Handwritten Arabic Letters By Tim Klassen thesis for MASTER OF COMPUTER SCIENCE (M.C.Sc.) 2001 [14] Connectors and non-connectors reterived on 12/04/08 from http://www.columbia.edu/itc/mealac /pritchett/00urdu/urduscript/section00.html?urdu#00_02

[15] Devangari and Urdu Alphabets reterived on 12/3/08 from


http://freenet- homepage.de/prilop/urdu-alphabet.html [16] Seam Carving reterived on 12/3/08 from www.seamcarving.com [17] Ahmed M. Zeki and Mohamad S. Zakaria ,Challenges in Recognizing Arabic Character, International Islamic University Malaysia (IIUM), Kuala Lumpur, Malaysia, National University of Malaysia (UKM), Bangi, Selangor, Malaysia. [18] A. Amin, Off-line Arabic Character Recognition - the State of the Art, Pattern Recognition, Vol. 31, No. 5, 517-530, 1998. [19] F. Al-Fakhri, On-Line Computer Recognition of Hand-Written Arabic Text, Masters Thesis, Science University of Malaysia, 1997. [20] A. Zeki, Plausable inference Approach to Character Recognition, Masters Thesis, National University of Malaysia, 1999. [21] A. Amin, H. Al-Sadoun and S. Fischer, Hand-Printed Arabic Character Recognition System using An Arificial Network Pattern Recognition, Vol. 29, No. 4, pp. 663-675, 1996. [22] T. Kanungo, G. Marton and O. Bulbul, Performance Evaluation of Two Arabic Products, in Proceeding of AIPR Workshop on Advances in Computer Assisted Recognition, SPIE, Vol. 3584, Washington DC, 1998. [23] T. Kanungo, G. Marton and O. Bulbul, OmniPage vs. Sakhr: Paired Model Evaluation of Two Arabic OCR Products, in Proceeding of SPIE Conference on Document Recognition and Retrieval (VI), Vol. 3651, San Jose, 1999. [24] A. Amin, Off line Arabic Character Recognition - A Survey, in Proceeding of the 4th International Conference Document Analysis and Recognition (ICDAR '97), pp. 596-599, 1997. [25] K. Jumari and M. Ali, A Survey and Comparative Evaluation of Selected off-line Arabic handwritten Character Recognition Systems, Jurnal Teknology, Malaysian University of Technology, 2001. [26] Inam Shamsheer, Zaheer Ahmad, OCR For Printed Urdu Script Using Feed Forward Neural

53

Network, MLPR 2007: International Conference on Machine Learning and Pattern Recognition, Germany, 2007 [27] Hyder, S.S., "A System for Generating Urdu/Farsi/ Arabic Script", Information Processing 71, North Holland Publishing Co. Amsterdam, pp. 1145-1149, 1972. [28]. Hyder, S.S., Richer, F., "The Theory and Design of a System for Printing and Communicating in Arabic-Urdu-Farsi", 3ournal of Bio-Sciences Communications, Vol. 3, pp. 181-206, 1977. [29] ATU-Baghdad Specification for Arabic Teleprinters, 3une 1980. [30] A Comparison of Two Handwriting Recognizers for Pen-based Computers reterived on 12/3/08 from http://www.yorku.ca/mack/CASCON94.html [31] National Language Authority reterived on 12/3/08 from http://www.nla.gov.pk [32] Machine Translation, reterived on 12/3/08 from http://www.nlauit.gov.pk/umt.htm

Appendix A

Data Set Used for NN Training

54

Below is the data set used for training the Neural Network. These are divided into 54 classes where as each class consists of 100 samples. It is tried to keep a small variation in the samples to enhance the generalization capability of Neural Network.

55

56

57

58

59

60

Appendix B

Publications
The following papers have been published from this thesis in various international conferences.

1.

Zaheer Ahmad, Jehanzeb Khan, Inam Shamsheer,Owais Adnan, Urdu Nastaleeq OCR (Optical Character Recognition,Proceedings of World Academy of Science, Engineering and Technology, Volume 2, ISSN:1307-6884, December 2007.

2.

Inam Shamsheer, Zaheer Ahmad, Jehanzeb Khan, Owais Adnan, OCR For Printed Urdu Script Using Feed Forward Neural Network, MLPR 2007: International Conference on Machine Learning and Pattern Recognition, Germany, 2007

61

También podría gustarte