Você está na página 1de 11

International Journal on Natural Language Computing (IJNLC) Vol. 4, No.

4, August 2015

ALGORITHM FOR TEXT TO GRAPH


CONVERSION AND SUMMARIZING USING
NLP: A NEW APPROACH FOR BUSINESS
SOLUTIONS

Prajakta Yerpude and Rashmi Jakhotiya and Manoj Chandak


Department of Computer Science and Engineering, RCOEM, Nagpur

Abstract
Text can be analysed by splitting the text and extracting the keywords .These may be represented as
summaries, tabular representation, graphical forms, and images. In order to provide a solution to large
amount of information present in textual format led to a research of extracting the text and transforming
the unstructured form to a structured format. The paper presents the importance of Natural Language
Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text
summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The
main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural
language input which is done using regular expressions, artificial intelligence and database concepts.
Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency
of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various
index (points and percent) and time and then tokens are mapped to graph. This paper proposes a
business solution for users for effective time management.

Keywords
NLP, Automatic Summarizer, Text to Graph Converter, Data Visualization, Regular Expression,
Artificial Intelligence

1. Introduction
The paper deals with applications of natural language processing using its various domains
regarding textual analysis. Natural language processing (NLP)[1] is a bridge between human
interpretations and computer. It makes use of artificial intelligence and various techniques of
analysis to give about 90% accuracy of data. The term Natural Language Processing [4]
comprises a great horizon of techniques for automatic generation, manipulation and analysis of
natural or human languages. It includes various categories like syntactic analysis[22]where
sequence of words are converted to structures that shows relation between the words, semantic
analysis[9] where meanings are assigned to a group of words, pragmatic analysis[24] where
differences between expected and actual interpretation is analysed, morphological analysis[10]
where punctuations are grouped and removed etc. The paper demonstrates two different types of
applications that use NLP principle and are as follows:
 An automatic text summarizer
DOI: 10.5121/ijnlc.2015.4403 22
CHAPTER 4

LINEAR EQUATIONS IN TWO VARIABLES

(A) Main Concepts and Results


An equation is a statement in which one expression equals to another expression. An
equation of the form ax + by + c = 0, where a, b and c are real numbers such that
a ≠ 0 and b ≠ 0, is called a linear equation in two variables. The process of finding
solution(s) is called solving an equation.
The solution of a linear equation is not affected when
(i) the same number is added to (subtracted from) both sides of the equation,
(ii) both sides of the equation are multiplied or divided by the same non-zero
number.
Further, a linear equation in two variables has infinitely many solutions. The graph of
every linear equation in two variables is a straight line and every point on the graph
(straight line) represents a solution of the linear equation. Thus, every solution of the
linear equation can be represented by a unique point on the graph of the equation. The
graphs of x = a and y = a are lines parallel to the y-axis and x-axis, respectively.
(B) Multiple Choice Questions
Write the correct answer:
Sample Question 1 : The linear equation 3x – y = x – 1 has :
(A) A unique solution (B) Two solutions
(C) Infinitely many solutions (D) No solution
Solution : Answer (C)
Sample Question 2 : A linear equation in two variables is of the form
ax + by + c = 0, where

29052014
Text-mining based journal splitting

Xiaofan Lin
Hewlett-Packard Laboratories
1501 Page Mill Road, MS 1126, Palo Alto, CA 94304
Email: xiaofan.lin@hp.com

importantly to enable full text search, image processing


Abstract and OCR (Optical Character Recognition) follow, in
which the original images are segmented into
This paper introduces a novel journal splitting
homogenous regions and stored in a compact form. OCR
algorithm. It takes full advantage of various kinds of
also generates the text information. In the last step, both
information such as text match, layout and page numbers.
the processed image and the text are embedded into the
The core procedure is a highly efficient text-mining
PDF files.
algorithm, which detects the matched phrases between the
content pages and the title pages of individual articles.
Experiments show that this algorithm is robust and able Paper Media
to split a wide range of journals, magazines and books.
Scanning
1. Introduction
Printing-on-demand (POD) is an emerging TIFF Files
commercial publishing field, which promises to lower the
cost of short-run publishing as well as to bring new Image Processing and OCR
revenue streams to out-of-print publications. For
publications existing in paper rather than the electronic
form, the first step is re-mastering, in which the paper
media is scanned and converted to some widely electronic Building PDF Files
formats such as PDF files. For periodicals people usually
are more interested in specific articles rather than the PDF Files
whole journals or magazines, so it is desirable to have
separate PDF files for individual articles. Although
human operators can split a journal into separate articles, Fig 1: Workflow of the re-mastering process
the manual processing tends to be slow and expensive.
That is the motivation behind our research: Design an It is apparent that two kinds of information is
algorithm that enables the computers to automatically available for journal splitting: image data and the OCR
detect the title pages of journals and magazines. result. Although numerous studies have been carried out
In Section 2 we analyze the unique problems related on document analysis [9][10], there is no published work
to journal splitting. Then we present a text-mining based on the topic of journal splitting. The closest research is
JS algorithm in Section 3. Section 4 is devoted to other the logical structure analysis of books or journals by
related issues such as the combination of other features analyzing the OCR results of the table of contents (TOC)
and the detection of content pages. Section 5 shows the [1][2]. Similarly, one straightforward approach is to
experimental results and Section 6 summarizes the extract the page numbers from the TOC and use them to
proposed method and gives directions for future research. find the start page of each article [11]. There are some
drawbacks with this TOC-only solution:
2. Problem Analysis Scenario 1: OCR can make errors in recognizing the
In this section we will define the problem more page numbers in the content pages.
clearly in terms of what information is available. Fig. 1 is In this situation, we get the wrong page numbers or
the workflow of the re-mastering process: miss some page numbers and accordingly split the
The first step is to scan the paper books, journals and journals in the wrong way.
magazines and convert them into TIFF files. Color or Scenario 2: The page numbers are all correctly
gray-scale images are kept to preserve all the significant recognized but we will make false-negative or false-
information. In order to reduce file size and more

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003)
0-7695-1960-1/03 $17.00 © 2003 IEEE
Page Frame Detection for Double Page Document Images
N. Stamatopoulos B. Gatos T. Georgiou
Computational Intelligence Computational Intelligence Department of Informatics and
Laboratory, Laboratory, Telecommunications
Institute of Informatics and Institute of Informatics and University of Athens, Greece
Telecommunications, Telecommunications,
National Research Center National Research Center
t.georgiou@di.uoa.gr
"Demokritos" "Demokritos"
153 10 Athens, Greece 153 10 Athens, Greece
nstam@iit.demokritos.gr bgat@iit.demokritos.gr

ABSTRACT 1. INTRODUCTION
Scanning two book pages at the same time helps to accelerate the
Document images are usually produced by scanning books or
scanning process but on the other hand introduces several
periodicals. Scanning two pages at the same time is a very
difficulties if the user needs to have one page per image. A major
common practice as it helps to accelerate the scanning process.
difficulty is the appearance of noisy black borders around text
However, it may affect the performance of subsequent processing
areas as well as of noisy black stripes between the two pages. In
such as document analysis and optical character recognition
this paper, we propose a novel algorithm for detecting the page
(OCR) since the majority of approaches are able to process only
frames on double page document images. Our aim is to split the
single page images. Furthermore, another drawback of scanning
image into the two pages as well as to remove noisy borders. First
two pages at the same time is the appearance of noisy black
we apply a pre-processing which includes binarization, noise
borders around text areas as well as of noisy black stripes between
removal and image smoothing. Then, we detect the vertical zones
the two pages (see Fig.1).
of the two pages. In this stage, we introduce the vertical white run
projections which have been proved efficient for detecting
vertical zones of text areas. Finally, the horizontal zones of the
two pages are detected based on horizontal white run projections.
The experimental results on several double page document images
from fifteen different books demonstrate the effectiveness of the
proposed technique.

Categories and Subject Descriptors


I.7.5 [DOCUMENT AND TEXT PROCESSING]: Document
Capture – Document analysis, Scanning; I.4.3 [IMAGE
PROCESSING AND COMPUTER VISION]: Enhancement

General Terms
Algorithms, Design, Experimentation

Keywords Figure 1. An example of a double page document image.


Document Image Enhancement, Border Removal, Page Splitting
There are only few techniques in the literature that address the
problem of page splitting. According to these approaches, double
page documents are split in their middle after border removal or
by defining the coordinates of the print space. Yacoub et al. [13]
involve a splitting process as a preprocessing stage for the
conversion of a large collection of complex documents and
Permission to make digital or hard copies of all or part of this work for deployment for online web access to its information rich content.
personal or classroom use is granted without fee provided that copies are First, the double page images are cropped to the border of the
not made or distributed for profit or commercial advantage and that
page, using several criteria, which include searching for
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists, horizontal and vertical projections of straight page edges and
requires prior specific permission and/or a fee. checking the computed dimension versus a database of magazine
DAS '10, June 9-11, 2010, Boston, MA, USA sizes sampled throughout history. Once the double-pages are
Copyright © 2010 ACM 978-1-60558-773-8/10/06... $10.00 cropped, the centerfold of the double-page is identified and two
individual pages are generated using a splitting operation.
INSTRUCTIONS FOR FILLING REQUEST FOR NEW PAN CARD OR/AND CHANGES OR CORRECTION IN PAN DATA

(a) Form to be filled legibly in BLOCK LETTERS and preferably in BLACK INK. Form should be filled in English only
(b) Mention 10 digit PAN correctly.
(c) Each box, wherever provided, should contain only one character (alphabet /number / punctuation sign) leaving a blank box after
each word.
(d) ‘Individual’ applicants should affix two recent colour photographs with white background (size 3.5 cm x 2.5 cm) in the space
provided on the form. The photographs should not be stapled or clipped to the form. The clarity of image on PAN card will
depend on the quality and clarity of photograph affixed on the form.
(e) Signature / Left hand thumb impression should be provided across the photo affixed on the left side of the form in such a
manner that portion of signature/impression is on photo as well as on form.
(f) Signature /Left hand thumb impression should be within the box provided on the right side of the form. The signature should
not be on the photograph affixed on right side of the form. If there is any mark on this photograph such that it hinders the clear
visibility of the face of the applicant, the application will not be accepted.
(g) Thumb impression, if used, should be attested by a Magistrate or a Notary Public or a Gazetted Officer under official seal and
stamp.
(h) For issue of new PAN card without any changes- In case you have a PAN but no PAN card and wish to get a PAN card, fill
all column of the form but do not tick any of the boxes on the left margin. In case of loss of PAN card, a copy of FIR may be
submitted along with the form.
(i) For changes or correction in PAN data, fill all column of the form and tick box on the left margin of appropriate row where change/
correction is required.
(j) Having or using more than one PAN is illegal. If you possess more than one PAN, kindly fill the details in Item No. 11 of this form
and surrender the same.

Item Item Details Guidelines for filling the form


No.
1 Full Name Please select appropriate title.
Do not use abbreviations in the First and the Last name/Surname.
For example RAVIKANT should be written as :

Last
R A V I K A N T
Name/Surname
First Name
Middle Name

For example SURESH SARDA should be written as :


Last
S A R D A
Name/Surname
First Name S U R E S H
Middle Name

For example POONAM RAVI NARAYAN should be written as:


Last
N A R A Y A N
Name/Surname
First Name P O O N A M
Middle Name R A V I

For example SATYAM VENKAT M. K. RAO should be written as :


Last
R A O
Name/Surname
First Name S A T Y A M
Middle Name V E N K A T M K

For example M. S. KANDASWAMY(MADURAI SOMASUNDRAM KANDASWAMY) should be


written as:
Last
K A N D A S W A M Y
Name/Surname
First Name M A D U R A I
Middle Name S O M A S U N D R A M
Applicants other than ‘Individuals’ may ignore above instructions.

Non-Individuals should write their full name starting from the first block of Last Name/Surname. If the
name is longer than the space provided for the last name, it can be continued in the space provided
for First and Middle Name.
For example XYZ DATA CORPORATION (INDIA) PRIVATE LIMITED should be written as :
Last
X Y Z D A T A C O R P O R A T I O N ( I N D
Name/Surname
First Name I A ) P R I V A T E L I M I T E D
Middle Name

For example MANOJ MAFATLAL DAVE (HUF) should be written as :


Last
M A N O J M A F A T L A L D A V E ( H U F )
Name/Surname
First Name
Middle Name

In case of Company, the name should be provided without any abbreviations. For example,
different variations of ‘Private Limited’ viz. Pvt Ltd, Private Ltd, Pvt Limited, P Ltd, P. Ltd., P. Ltd are
not allowed. It should be ‘Private Limited’ only.

In case of sole proprietorship concern, the proprietor should apply for PAN in his/her own name.
Name should not be prefixed with any title such as Shri, Smt, Kumari, Dr., Major, M/s etc.

Abbreviation of the Individual applicants should provide full/abbreviated name to be printed on the PAN card. Name, if
full name to be abbreviated, should necessarily contain the last name. For example:
printed on the PAN
SATYAM VENKAT M. K. RAO which is written in the Name field as:
card
Last
Name/Surname
R A O
First Name S A T Y A M
Middle Name V E N K A T M K

Can be written as in ‘Name to be printed on the PAN Card’ column as SATYAM VENKAT M. K. RAO
or S. V. M. K. RAO or SATYAM V. M. K. RAO
For non-individual applicants, this should be same as last name field in item no. 1 above.

2 Details of Parents Instructions in Item No.1 with respect to name apply here.
(Applicable to Father’s Name: It is mandatory for Individual applicants to provide father’s name. Married woman
Individuals only) applicant should also give father’s name and not husband’s name.
Mother’s Name: This is an optional field.
Appropriate flag should be selected to indicate the name (out of the father’s name and mother’s
given in the form) to be printed on the PAN card.
If none of the option is selected, then father’s name shall be considered for printing on the PAN card.
3 Date of Birth/ Date cannot be a future date. Date: 2nd August 1975 should be written as:
Incorporation/
D D M M Y Y Y Y
Agreement/
Partnership or 0 2 0 8 1 9 7 5
Trust Relevant date for different categories of applicants is:
Deed/Formation
Individual: Actual Date of Birth; Company: Date of Incorporation; Association of Persons: Date of
of Body of
formation/creation; Trusts: Date of creation of Trust Deed; Partnership Firms: Date of Partnership
Individuals/
Deed; LLPs: Date of Incorporation/Registration; HUFs: Date of creation of HUF and for ancestral
Association of
HUF date can be 01-01-0001 where the date of creation is not available.
Persons
4 Gender This field is mandatory for Individuals. Field should be left blank in case of other applicants.

5&6 Photo/signature Individuals issued a PAN card with incorrect/unclear photograph/signature should tick the box on
Mismatch the left margin.
GENERAL INFORMATION FOR APPLICANTS

(a) Applicants may obtain the ‘Request for New PAN Card or/and Changes or Correction in PAN Data’ Form in the format prescribed by
Income Tax Department from any IT PAN Service Centres (managed by UTIITSL) or TIN-Facilitation Centres (TIN-FCs)/PAN Centres
(managed by NSDL e-Gov), or any other stationery vendor providing such forms or download from the Income Tax Department
website (www.incometaxindia.gov.in) / UTIITSL website (www.utiitsl.com) / NSDL e-Gov website (www.tin-nsdl.com).

(b) The fee for processing PAN application is ` 110/- (including goods & service tax). In case, the PAN card is to be dispatched
outside India then additional dispatch charge of ` 910/- will have to be paid by applicant.

(c) It is mandatory to attach proof of identity, proof of address and proof of date of birth with PAN application. Changes or corrections
desired in PAN particulars should be supported by any one or combination of the relevant documents mentioned below :

Document acceptable as proof of identity, address and date of birth as per Rule 114 of Income Tax Rules, 1962
Proof of Identity Proof of Address Proof of date of birth
Indian Citizens (including those located outside India)
Individuals & HUF
(i) Copy of (i) Copy of Copy of the following documents
a. Aadhaar Card issued by the Unique a. Aadhaar Card issued by the Unique if they bear the name, date, month
Identification Authority of India; or Identification Authority of India; or and year of birth of the applicant,
b. Elector’s photo identity card; or b. Elector’s photo identity card; or namely:-
c. Driving License; or c. Driving License; or a. Aadhaar Card issued by the Unique
d. Passport; or d. Passport; or Identification Authority of India; or
e. Ration card having photograph of e. Passport of the spouse; or b. Elector’s photo identity card; or
the applicant; or f. Post office passbook having address c. Driving License; or
f. Arm’s license; or of the applicant; or d. Passport; or
g. Photo identity card issued by the g. Latest property tax assessment e. Matriculation Certificate or Mark
Central Government or State order; or Sheet of recognized board; or
Government or Public Sector h. Domicile certificate issued by the f. Birth Certificate issued by the
Undertaking; or Government; or Municipal Authority or any office
h. Pensioner card having photograph i. Allotment letter of accommodation authorized to issue Birth and Death
of the applicant; or issued by Central or State Certificate by the Registrar of Birth
i. Central Government Health Government of not more than three and Death or the Indian Consulate
Service Scheme Card or Ex- years old; or as defined in clause (d) of sub-
Servicemen Contributory Health j. Property Registration Document; or section (1) of section 2 of the
Scheme photo card; or Citizenship Act, 1955 (57 of1955);
(ii) Copy of following documents of or
(ii) Certificate of identity in Original not more than three months old g. Photo identity card issued by
signed by a Member of Parliament (a) Electricity Bill; or the Central Government or State
or Member of Legislative Assembly (b) Landline Telephone or Broadband Government or Public Sector
or Municipal Councilor or a Gazetted connection bill; or Undertaking or State Public Sector
officer, as the case may be; or (c) Water Bill; or Undertaking; or
(d) Consumer gas connection card or h. Domicile Certificate issued by the
(iii) Bank certificate in Original on letter
book or piped gas bill; or Government; or
head from the branch (alongwith
(e) Bank account statement or as per i. Central Government Health
name and stamp of the issuing
Note 2; Service Scheme photo Card or Ex-
officer) containing duly attested
or Servicemen Contributory Health
photograph and bank account
(f) Depository account statement; or Scheme photo card; or
number of the applicant
(g) Credit card statement; or j. Pension payment order; or
(iii) Certificate of address in Original k. Marriage certificate issued by
signed by a Member of Parliament Registrar of Marriages; or
or Member of Legislative Assembly l. Affidavit sworn before a magistrate
or Municipal Councilor or a Gazetted stating the date of birth.
officer, as the case may be; or
(iv) Employer certificate in original.
Writing Quadratics Given a Graph with a Vertex

Writing Quadratic Equations from a Graph


To write the equations of a quadratic function when given the graph:

1) Find the vertex (h,k) and one point (x,y)


2
2) Plug into Vertex Form y = a( x - h) + k

3) Solve for a

4) Plug a and (h,k) back into vertex form

EX EX
10  |  National Diabetes Statistics Report, 2017

Table 5. Number and rate of emergency department visits among adults aged ≥18 years with diagnosed
diabetes, United States, 2014

Crude rate per 1,000 persons with


Cause of emergency department visit No. in thousands
diabetes (95% CI)
Diabetes as any listed diagnosis 14,192 648.9 (600.9–696.9)
Hypoglycemia 245 11.2 (10.4–12.1)
Hyperglycemic crisis 207 9.5 (8.8–10.2)
CI = confidence interval.
Data source: United States Diabetes Surveillance System.

Kidney Disease
• Among U.S. adults aged 20 years or older with diagnosed diabetes, the estimated crude prevalence of chronic
kidney disease (stages 1–4) was 36.5% (95% CI, 32.2%–40.8%) during 2011–2012.2
• Among those with diabetes and moderate to severe kidney disease (stage 3 or 4), 19.4% (95% CI, 15.5%–23.2%)
were aware of their kidney disease during 1999–2012.3
• In 2014, a total of 52,159 people developed end-stage renal disease with diabetes as the primary cause. Adjusted
for age group, sex, and racial or ethnic group, the rate was 154.4 per 1 million persons.4

Deaths
• Diabetes was the seventh leading cause of death in the United
States in 2015. This finding is based on 79,535 death certificates
in which diabetes was listed as the underlying cause of death
(crude rate, 24.7 per 100,000 persons).5
• Diabetes was listed as any cause of death on 252,806 death
certificates in 2015 (crude rate, 78.7 per 100,000 persons).5

Cost
• The total direct and indirect estimated cost of diagnosed
diabetes in the United States in 2012 was $245 billion.6
• Average medical expenditures for people with diagnosed
diabetes were about $13,700 per year. About $7,900 of this
amount was attributed to diabetes.6
• After adjusting for age group and sex, average medical
expenditures among people with diagnosed diabetes were
about 2.3 times higher than expenditures for people without
diabetes.6
Denny Gunawan

221 Queen St
Melbourne VIC 3000

123 Somewhere St, Melbourne VIC 3000


(03) 1234 5678
$39.60
Invoice Number: #20130304

Organic Items Price/kg Quantity(kg) Subtotal

Apple $5.00 1 $5.00

Orange $1.99 2 $3.98

Watermelon $1.69 3 $5.07

Mango $9.56 2 $19.12

Peach $2.99 1 $2.99

Subtotal $36.00
GST (10%) $3.60

* Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam sodales Total $39.60
dapibus fermentum. Nunc adipiscing, magna sed scelerisque cursus, erat
lectus dapibus urna, sed facilisis leo dui et ipsum.
Invoice

Invoice Number INV-3337


From:
DEMO - Sliced Invoices Order Number 12345
Suite 5A-1204 Invoice Date January 25, 2016
123 Somewhere Street Due Date January 31, 2016
Your City AZ 12345
Total Due $93.50
admin@slicedinvoices.com

To:
Test Business
123 Somewhere St
Melbourne, VIC 3000
test@test.com

Hrs/Qty

1.00
Service

Web Design
This is a sample description...
id Rate/Price

$85.00
Adjust

0.00%
Sub Total

$85.00

Sub Total $85.00


Pa
Tax $8.50
Total $93.50

ANZ Bank
ACC # 1234 1234
BSB # 4321 432

Payment is due within 30 days from date of invoice. Late payment is subject to fees of 5% per month.
Thanks for choosing DEMO - Sliced Invoices | admin@slicedinvoices.com
Page 1/1

Você também pode gostar