Você está na página 1de 22

FINDING semantic

relationship among associated


medical terms
Submitted By:
Manisha Singh(111497)
Sneha Bairagi(111717)
Abhinav Rai (511004)

Introduction
With the continuous digitisation of medical knowledge, information extraction

tools become more and more important for practitioners of the medical
domain. In this project we tackle semantic relationships extraction from
medical texts.
In this project, we have focused on Disease-Medicine co-occurrence
relationship extraction from the text of the literature. A large-scale and
accurate list of drug-disease treatment pairs derived from published
biomedical literature can be used for drug repurposing.

PROPOSED SYSTEM
Information extraction is the identification of specific

information in unstructured data sources, such as natural


resources text.
First task identifies and extracts informative sentences on
diseases and treatment topics.
The second one performs a finer grained classification of
these sentence according to semantic relation that exist
between diseases and treatments.

Implementation

Steps involved are:


Obtaining documents from the web containing medical data.
Perform tokenization.
Perform stemming.
Perform POS tagging.
Perform annotation.
Find disease-treatment pairs using pattern matching.

Tokenization

Tokenization is the process of breaking up the given text into

units called token. The tokens may be words or number or


punctuation mark.
Tokenization does this task by locating word boundaries. Ending
point of a word and beginning of the next word is called word
boundaries. Tokenization is also known as word segmentation.

Stemming
Stemming is the term used in linguistic morphology
and information retrieval to describe the process for
reducing inflected words to their word stem, base
or root form-generally a written word form.
Existing stemming algorithms are :

Truncate(n), Lovins Stemmer, Dawson


Porters Stemmer.
We are using porters stemmer.

stemmer,

POS tagging
Part-of-speech tagging (POS tagging or POST), also

called grammatical tagging or word-category


disambiguation, is the process of marking up a word in a
text (corpus) as corresponding to a particular part of
speech , based on both its definition, as well as its context
i.e. relationship with adjacent and related words in a
phrase, sentence, or paragraph.
The process of assigning a part-of-speech to each word in
a sentence.

VITERBI ALGORITM
Given

a) start state: s1
b)alphabet A={a1 a2 an}
c)Set of states S={s1 s2 .. sn}
d) Transition probability.
Data structure

1. A N*T array called SEQSCORE to maintain the winner sequence


always(N=#states, T=length of O/P sequence).
2.Another N*T array called BACKPTR to recover the path.

Steps
1.Initilization
SEQSCORE(1,1)=1.0
BACKPTR(1,1)=0.0
For(i=2 to N) do
SEQSCORE(i,1)=0.0
(expressing the fact that first state is S1)
2 Iteration
for(t=2 to T) do
for(i=2 to N) do
SEQSCORE(i,t)=max(j=1,N)
BACKPTR(i,t)=index j that gives the max above.

3 Sequence identification
C(T)= i that maximizes SEQSCORE(i,T)
for i from (T-1) to 1 do
C(i)=BACKPTR[C(i+1),(i+1)]

Example

[a1,0.3]
[a1,0.3]

s1
[a1,0.1]
[a2,0.2]

[a2,0.4]

s2

[a1,0.2]
[a2,0.2]
[a2,0.3]

Tabular representation

A1

A2

A1

A2

S1

1.0

0.1

0.09

.012

.0081

S2

0.0

0.3

.06

.027

.0054

Probability table

E
S1
s2

A1

A2

A1

A2

BACKPTR Table

Annotating Corpora and Searching patterns


Sentences are tagged with disease entities from the clean

disease lexicon and drug entities from the drug list.


Pattern is searched between disease and drug :
- in,
- in the treatment of,
- for,
- in patients with,
- for the treatment of,
- treatment of,
- therapy for,
- therapy in etc.

Algorithm
Input: Disease, Rules.
Output: Medicine, Semantic Relationship.
1. For any disease do
Extract paper form Medline.
2. Tokenize the document.
3. Remove all stopwords.
4. Perform stemming.
5. POS tagging is preformed to separate required part of speech.
6. convert this corpora to annotated corpora.
7. From annotated sentences
Extract sentence having atleast one medicine and one
disease.
8. Pattern is searched between disease and medicine.
9. Medicines are associated and ranked based on frequency and
superiority.
10. Semantic relationships are then presented to user.

HARDWARE requirements

PROCESSOR
RAM
HARD DISK

: PENTIUM IV
: 256 MB
: 40GB

SOFTWARE REQUIREMENTS

FRONT END

: JAVA SWING
OPERATING SYSTEM
: WINDOWS XP/7
TOOL
: ECLIPSE

Deliverables
Rapid access to information regarding potential immunizations.
Medicines ranked on the basis of their frequency.
Can be used in medicine repurposing.
Can provide knowledge to doctors about new drugs available for

disease by processing biomedical literature and clinical trial studies.

Extension Possibility
It can extended to extract information regarding cure, symptoms and

prevention of disease.
It can help in finding the root cause of the disease and then by taking the
patient history or condition and providing him the dose accordingly. It is based
on viewing the composition of medicine and after applying it on patient report
identifying that is it be suiting him .

References
Rong Xu and QuanQiu Wang Large- scale extraction of accurate drug-

disease treatment pairs from biomedical literature for drug repurposing,


Issue 2013.
Fadi Yamout, Further Enhancement to the Porters Stemming
Algorithm, Issue 2006.
Ray S and Craven M,Representing sentence structure in Hidden Markov
Models for information extraction, Proceedings of IJCAI-2001.
M. S. Ryan and G. R. Nudd., The Viterbi Algorithm, Department of
Computer Science, University of Warwick, Coventry,England,Issue 1993.
Jesse Davis jdavis Mark Goadrich, The Relationship Between PrecisionRecall and ROC Curves, Department of Computer Sciences and
Department of Biostatistics and Medical Informatics, University of
Wisconsin-Madison,USA.

Thank you

Você também pode gostar