Machine Learning Approach For Efficient Keyword Prediction

ABSTRACT
Internet services and applications have become an inextricable part of daily life,
enabling communication and the management of personal information from anywhere. To
accommodate this increase in application and data complexity, web services have moved to a
multitiered design wherein the webserver runs the application front-end logic and data are
outsourced to a database or file server. In this paper, we present Doubleguard, an IDS system
that models the network behavior of user sessions across both the front-end webserver and
the back-end database. By monitoring both web and subsequent database requests, we are
able to ferret out attacks that an independent IDS would not be able to identify. Furthermore,
we quantity the limitations of any multitier IDS in terms of training sessions and functionality
coverage. We implemented Doubleguard using an Apache webserver with MySQL and
lightweight virtualization. We then collected and processed real-world traffic over a 15-day
period of system deployement in both dynamic and static web applications. Finally, using
Doubleguard, we were able to expose a wide range of attacks with 100 percent accuracy
while maintaining 0 percent false positives for static web services and 0.6 percent false
positives for dynamic web services.
INTRODUCTION
The immense popularity of Internet and P2P networks has produced a significant
stimulus to P2P file sharing systems, where a file requesters query is forwarded to a file
provider in a distributed manner. The median file size of these P2P systems is 4 MB, which
represents a 1,000-fold increase over the 4 KB median size of typical Web objects. The study
also shows that the access to these files is highly repetitive and skewed toward the most
popular ones. In such circumstances, if a server receives many requests at a time, it could
become overloaded and consequently cannot respond to the requests quickly. Therefore,
highly popular files (i.e., hot files) could exhaust the bandwidth capacity of the servers,
leading to low efficiency in file sharing.
File replication is an effective method to deal with the problem of server overload by
distributing load over replica nodes. It helps to achieve high query efficiency by reducing
server response latency and lookup path length (i.e., the number of hops in a lookup path). A
higher effective file replication method produces higher replica hit rate. A replica hit occurs
when a file request is resolved by a replica node rather than the file owner. Replica hit rate
denotes the percentage of the number of file queries that are resolved by replica nodes among
total queries. Recently, numerous file replication methods have been proposed. The methods
can be generally classified into three categories denoted by Server Side, Client Side, and
Path. Server Side replicates a file close to the file owner; Client Side replicates a file close to
or at a file requester and Path replicates on the nodes along the query path from a requester to
a file owner. However, most of these methods either have low effectiveness on improving
query efficiency or come at a cost of high overhead. By replicating files on the nodes near the
file owners, Server Side enhances replica hit rate and query efficiency. However, it cannot
significantly reduce path length because replicas are close to the file owners. It may overload
the replica nodes since a node has limited number of neighbours. On the other hand, Client
Side could dramatically improve
Query efficiency when a replica node queries for its replica files, but such a case is
not guaranteed to occur as node interest varies over time. Moreover, these replicas have low
chance to serve other requesters. Thus, Client Side cannot ensure high hit rate and replica
utilization. Path avoids the problems of Server Side and Client Side.
It provides high hit rate and greatly reduces lookup path length. However, its
effectiveness is outweighed by its high cost of overhead for replicating and maintaining much
more replicas. Furthermore, it may produce underutilized replicas. Since more replicas lead
to higher query efficiency but more maintenance overhead, a challenge for a replication
algorithm is how to minimize replicas while still achieving high query efficiency. From a
technical perspective, the key for the success of a corporate network is choosing the right data
sharing platform, a system which enables the shared data (stored and maintained by different
companies) network-wide visible and supports efcient analytical queries over those data.
Traditionally, data sharing is achieved by building a centralized data warehouse, which
periodically extracts data from the internal production systems (e.g., ERP) of each company
for subsequent querying. Unfortunately, such a warehousing solution has some deciencies in
real deployment.
Fig.1.1 A Cloud Peer-Peer System

Finally, to maximize the revenues, companies often dynamically adjust their business process
and may change their business partners. Therefore, the participants may join and leave the
corporate networks at will. The data warehouse solution has not been designed to handle such
dynamicity. To address the aforementioned problems, this paper presents Doubleguard, a
cloud enabled data sharing platform designed for corporate network applications.
By integrating cloud computing, database, and peer-to-peer (P2P) technologies,

Doubleguard achieves its query processing efciency and is a promising approach for
corporate network applications, with the following distinguished features. Doubleguard is
deployed as a service in the cloud. To form a corporate network, companies simply register
their sites with the Doubleguard service provider, launch Doubleguard instances in the cloud
and nally export data to those instances for sharing. Doubleguard adopts the pay-as-you-go
business model popularized by cloud computing.
Fig 1.2 Doubleguard Cloud Cluster

The total cost of ownership is therefore substantially reduced since companies do not
have to buy any hardware/software in advance. Instead, they pay for what they use in terms
of Doubleguard instances hours and storage capacity.
LITERATURE REVIEW
1. Spontaneous fluctuations in brain activity observed with functional magnetic
resonance imaging
Author: Michael D. Fox* and Marcus E. Raichle
This approach is a paradigm that requires subjects to open and close their eyes at
fixed intervals. Modulation of the functional magnetic resonance imaging (fMRI)
blood oxygen level dependent (BOLD) signal attributable to the experimental paradigm can
be observed in distinct brain regions, such as the visual cortex, allowing one to relate brain
topography to function. However, spontaneous modulation of the BOLD signal which cannot
be attributed to the experimental paradigm or any other explicit input or output is also
present. Because it has been viewed as noise in task-response studies, this spontaneous
component of the BOLD signal is usually minimized through averaging.
Methods for analyzing spontaneous BOLD data spontaneous neuronal activity refers
to activity that is not attributable to specific inputs or outputs; it represents neuronal activity
that is intrinsically generated by the brain. As such, fMRI studies of spontaneous activity
attempt to minimize changes in sensory input and refrain from requiring subjects to make
responses or perform specific cognitive tasks. Most studies are con- ducted during continuous
resting-state conditions such as fixation on a cross-hair or eyes-closed rest. subjects are
usually instructed simply to lie still in the scanner and refrain from falling asleep. After data
acquisition, two important data analysis issues must be considered: how to account for nonneuronal noise and how to identify spatial patterns of spontaneous activity.
Advantage
Measure of changes in neuronal activity.
Disadvantage
It doesnt consider different models for different regions of the time series
More time and Computational Cost.
2. A Wavelet-Based Anytime Algorithm for K-Means Clustering

Author: Michail Vlachos,
Jessica Lin, Eamonn Keogh,
Dimitrios Gunopulos
The emergence of the field of data mining in the last decade has sparked an increase
of interest in clustering of time series. Such clustering is useful in its own right as a
method to summarize and visualize massive datasets . Clustering is often used as a
subroutine in other data mining algorithms such as similarity search classification and the
discovery of association rules.
Applications of these algorithms cover a wide range of
activities found in finance, meteorology, industry, medicine etc.

Although there has been much research on clustering in general , the unique
structure of time series means that most classic machine learning and data mining
algorithms do not work well for time series.
In particular, the high dimensionality,
very high feature correlation, and the (typically) large amount of noise that characterize
time series data present a difficult challenge.
Anytime algorithms are algorithms that trade execution time for quality of
results .
In particular, an anytime algorithm always has a best- so-far answer available,
and the quality of the answer improves with execution time. The user may examine this
answer at any time, and choose to terminate the algorithm, temporarily suspend the
algorithm, or allow the algorithm to run to completion. It would be highly desirable to
implement the algorithm as an anytime algorithm. This would allow a user to examine the
best current answer after an hour or so as a sanity check of all assumptions and parameters.
As a simple example, suppose the user had accidentally set the value of K to 50 instead of the
desired value of 5. Using a batch algorithm the mistake would not be noted for a week,
whereas using an anytime algorithm the mistake could be noted early on and the
algorithm restarted with little cost.
Advantage
Very coarse resolution representation of the data.

Quickly to dirty noise.
Improves the execution time.
Disadvantage
FMRI brain activity information collection not able to Wavelet analysis .
3. Model-based Classification of Data with Time Series-valued Attributes
Author: Claudia Plant, AndrewZherdin, Leonhard Laer

Time series data are collected in many applications including finance, science, natural
language processing, medicine and multimedia. The content of large time series databases
cannot be analyzed manually. To support the knowledge discovery process from time series
databases, effective and efficient data mining methods are required. Often, the primary goal
of the knowledge discovery process is classification, which is the task to automatically assign
class labels to data objects. To learn rules, strategies or patterns for automatic classification,
the classifier needs to be trained on a set of data objects for which the class labels are known.
Typically, this so-called training data set has been labeled by human experts. Based on the
patterns learned form the training data set, the classifier automatically assigns labels to new,
unseen objects.
The classification of the test objects is now simple. To classify an object, we sum up the mean
square error for all relevant models for all classes. We assign the object to the class with the
smallest mean square error.
Advantage
Increasing amounts of captured motion stream.

Motion classification is the task to automatically assign movements.
Time series obtained from the different sensors.
Disadvantage
Multiple sensors to capture human movements.

No Interaction Region consideration.
4. CoRE: A Context-Aware Relation Extraction Method for Relation Completion

Author: Zhixu Li, Mohamed A. Sharaf, Laurianne Sitbon, Xiaoyong Du, and Xiaofang Zhou,
Senior Member, IEEE
ABSTRACT
Relation completion (RC) as one recurring problem that is central to the success of novel big
data applications such as Entity Reconstruction and Data Enrichment. Given a semantic
relation R, RC attempts at linking entity pairs between two entity lists under the relation R.
To accomplish the RC goals, we propose to formulate search queries for each query entity a
based on some auxiliary information, so that to detect its target entity b from the set of
retrieved documents. For instance, a pattern-based method (PaRE) uses extracted patterns as
the auxiliary information in formulating search queries. However, high-quality patterns may
decrease the probability of nding suitable target entities. As an alternative, we propose
CoRE method that uses context terms learned surrounding the expression of a relation as the
auxiliary information in formulating queries. The experimental results based on several realworld web data collections demonstrate that CoRE reaches a much higher accuracy than
PaRE for the purpose of RC.
DISADVANTAGES
This data is typically unstructured and naturally lacks any binding information (i.e.,
foreign keys). Linking this data clearly goes beyond the capabilities of current data
integration systems.
This motivated novel frameworks that incorporate information extraction (IE) tasks
such as named entity recognition
(NER) and relation extraction (RE).
Those frameworks have been used to enable some of the emerging data linking
applications such as entity reconstruction and data enrichment.
EXISTING SYSTEM
The corporate network needs to scale up to support thousands of participants, while the
installation of a large-scale centralized data warehouse system entails nontrivial costs
including huge hardware/software investments (a.k.a total cost of ownership) and high
maintenance cost (a.k.a total cost of operations). In the real world, most companies are not
keen to invest heavily on additional information systems until they can clearly see the
potential return on investment (ROI). Second, companies want to fully customize the access
control policy to determine which business partners can see which part of their shared data.
Unfortunately, most of the data warehouse solutions fail to offer such flexibilities. Finally, to
maximize the revenues, companies often dynamically adjust their business process and may
change their business partners. Therefore, the participants may join and leave the corporate
networks at will. The data warehouse solution has not been designed to handle such
dynamicity.
Second, companies want to fully customize the access control policy to determine which
business partners can see which part of their shared data.
Existing P2P search techniques are based on either unstructured hint-based routing or
structured Distributed Hash Table (DHT)-based routing neither of these two paradigms can
provide satisfactory solution to the DPM problem.
Unstructured techniques are not efficient in terms of the generated volume of search
messages; moreover, no guarantee on search completeness is provided.
Structured techniques, on the other hand, strive to build an additional layer on top of a cloud
protocol for supporting partial-prefix matching.
Cloud Peer to Peer mechanisms cluster keys based on numeric distance. But, for efficient
subset matching, keys should be clustered based on Hamming distance.
DISADVANTAGES OF EXISTING SYSTEM:
The corporate network needs to scale up to support thousands of participants, while

the installation of a large-scale centralized data warehouse system entails nontrivial
costs including huge hardware / software investments and high maintenance cost.
Its most of the data warehouse solutions fail to offer flexibilities.
Its warehousing solution has some deficiencies in real deployment.
It is expensive.
Its most of the data warehouse solutions fail to offer flexibilities.
Its warehousing solution has some deficiencies in real deployment.
It is expensive.
PROPOSED SYSTEM
Doubleguard achieves its query processing efficiency and is a promising approach for
corporate network applications, with the following distinguished features. Doubleguard is
deployed as service in the cloud. To form a corporate network, companies simply register
their sites with the Doubleguard service provider, Doubleguard instances in the cloud and
finally export data to those instances for sharing. Doubleguard adopts the pay-as-you-go
business model popularized by cloud computing. The total cost of ownership is therefore
substantially reduced since companies do not have to buy any hardware/software in advance.
Instead, they pay for what they use in terms of Doubleguard instances hours and storage
capacity. Doubleguard extends the role-based access control for the inherent distributed
environment of corporate networks. Through a web console interface, companies can easily
configure their access control policies and prevent undesired business partners to access their
shared data. Doubleguard employs P2P technology to retrieve data between business partners.
Doubleguard instances are organized as a structured P2P overlay network named BATON.
The data are indexed by the table name, column name and data range for efficient retrieval.
Doubleguard employs a hybrid design for achieving high performance query processing. The
major workload of a corporate network is simple, low overhead queries.
Such queries typically only involve querying a very small number of business partners
and can be processed in short time. Best- Peer++ is mainly optimized for these queries. For
infrequent time-consuming analytical tasks, we provide an interface for exporting the data
from Best- Peer++ to Hadoop and allow users to analyze those data using MapReduce.
The main contribution of this paper is the design of Doubleguard system that provides
economical, flexible and scalable solutions for corporate network applications. We
demonstrate the efficiency of Doubleguard by benchmarking Doubleguard against
HadoopDB, a recently proposed large-scale data processing system, over a set of
queries designed for data sharing applications. The results show that for simple, lowoverhead queries, the performance of Doubleguard is significantly better than
HadoopDB.
The unique challenges posed by sharing and processing data in an inter-businesses

environment and proposed Doubleguard, a system which delivers elastic data sharing
services, by integrating cloud computing, database, and peer-to-peer technologies.
ADVANTAGES OF PROPOSED SYSTEM
Our system can efficiently handle typical workloads in a corporate network and can
deliver near linear query throughput as the number of normal peers grows.
Doubleguard adopts the pay-as-you-go business model popularized by cloud

computing. The total cost of ownership is therefore substantially reduced since
companies do not have to buy any hardware/software in advance. Instead, they pay
for what they use in terms of Doubleguard instances hours and storage capacity.
Doubleguard extends the role-based access control for the inherent distributed
environment of corporate networks.
Doubleguard employs P2P technology to retrieve data between business partners.
Doubleguard is a promising solution for efficient data sharing within corporate

networks. It provides economical, flexible and scalable solutions for corporate
network applications.
It is more efficient.
It prevent undesired business partners to access their shared data.
SYSTEM SPECIFICATION
HARDWARE REQUIREMENTS
Hard disk
40 GB
RAM
512mb
Processor
Pentium IV
Monitor
17 Color monitor
Key board, Mouse
Multi media.
SOFTWARE REQUIREMENTS
Front End
VISUAL STUDIO.NET 2010
Platform
ASP.NET
Code Behind
C#.NET
Back End
SQL SERVER 2008
Operating System
Windows XP SP3, VISTA,7.
4.3 SOFTWARE DESCRIPTION

4.3.1 NET FRAMEWORK OVERVIEW
The .NET technology provides a new approach to software development. This is the
first development platform designed from the ground up with the Internet in mind.
Previously, Internet functionality has been simply bolted on to pre-Internet operating
systems like Unix and Windows. This has required Internet software developers to
understand a host of technologies and integration issues. .NET is designed and
intended for highly distributed software, making Internet functionality and
interoperability easier and more transparent to include in systems than ever before.
.NET was first introduced in the year 2002 as .NET 1.0 and was intended to compete
with Sun's Java. And .NET is very easy but the basics of the C language is required
and if you know them then by step you can know and do it well. Unlike Java, .Net is
not Free Software, yet source for the Base Class Library is available under the
Microsoft Reference License. .NET is designed for ease of creation of Windows
programs.
4.3.2 ABOUT DOT NET
Microsoft has invested millions in marketing, Advertising and development to
produce what it feels is the foundation of the future Internet. Its a corporate initiative,
the strategy of which was deemed so important, that Bill Gates himself, Microsoft
Chairman and CEO, decided to oversee personally its development. It is a technology
that Microsoft claims will reinvent the way companies carry out business globally for
years to come. In his opening speech at the Professional Developers Conference
(PDC) held in Orlando Florida in July of 2000, Gates stated that a transition of this
magnitude only comes around once every five to six years.
4.3.3 OVERVIEW OF THE .NET FRAMEWORK:
The .NET Framework is a new computing platform that simplifies application
development in the highly distributed environment of the Internet. The .NET
Framework is designed to fulfill the following objectives:
To provide a code-execution environment that minimizes software
deployment and versioning conflicts.
To provide a code-execution environment that guarantees safe execution of

code, including code created by an unknown or semi-trusted third party.
To provide a code-execution environment that eliminates the performance
problems of scripted or interpreted environments.
To make the developer experience consistent across widely varying types of
applications, such as Windows-based applications and Web-based
applications.
To build all communication on industry standards to ensure that code based
on the .NET Framework can integrate with any other code.
The .NET Framework has two main components: the common language
runtime and the .NET Framework class library. The common language runtime is the
foundation of the .NET Framework. You can think of the runtime as an agent that
manages code at execution time, providing core services such as memory
management, thread management, and remoting, while also enforcing strict type
safety and other forms of code accuracy that ensure security and robustness. In fact,
the concept of code management is a fundamental principle of the runtime. Code that
targets the runtime is known as managed code, while code that does not target the
runtime is known as unmanaged code. The class library, the other main component of
the .NET Framework, is a comprehensive, object-oriented collection of reusable types
that you can use to develop applications ranging from traditional command-line or
graphical user interface (GUI) applications to applications based on the latest
innovations provided by ASP.NET, such as Web Forms and XML Web services.
The .NET Framework can be hosted by unmanaged components that load the
common language runtime into their processes and initiate the execution of managed
code, thereby creating a software environment that can exploit both managed and
unmanaged features. The .NET Framework not only provides several runtime hosts,
but also supports the development of third-party runtime hosts.
SQL SERVER
The RDBMS concept is gaining momentum all over the world. Microsoft SQL
Server is a RDBMS for Windows, released in USA by the Microsoft Corporation.
Since processing calls for extensive data input and processing, retrieval of
required information must be quick and efficient. SQL Server supports the eventdriven nature of the windows environment and has many event trapping features like
on click, on open, on Dbl click, Before Update, After Update etc.
Event procedures are coded and tagged to those events according to the
necessity of the application. These procedures are run at those particular events and
thus the whole coding is based on event-driven methodology. The forms of SQL
Server help; to create Tables, Screen Queries aid in creation-complicated queries and
generation informative reports is made an easy task.
SQL Server stores records in organized lists called tables. One or more tables
in SQL Server make up a whole database. A table is just a collection lf records with
the same structure. All of the records in the table contain the same type of information.
SQL Server allows setting up tables and like them to other tables. Microsoft SQL
Server is relational database. This means that the data in several tables is linked
through one or more fields present in the tables. Its this business of linked tables that
separates database programs like SQL Server from the other types of database, a flat
file database which allows only single table in which to store all information.
Microsoft SQL Server extends the performance, reliability, quality, and ease-ofuse of Microsoft SQL Server version 7.0. Microsoft SQL Server includes several new
features that make it an excellent database platform for large-scale online transactional
processing (OLTP), data warehousing, and e-commerce applications. The OLAP
Services feature available in SQL Server version 7.0 is now called SQL Server
Analysis Services. The term OLAP Services has been replaced with the term Analysis
Services. Analysis Services also includes a new data mining component.
ABOUT C# .NET
C# (pronounced "see sharp" or "C Sharp") is one of many .NET programming
languages. It is object-oriented and allows you to build reusable components for a
wide variety of application types. Microsoft introduced C# on June 26th, 2000 and it
became a v1.0 product on Feb 13th 2002.
C# is an evolution of the C and C++ family of languages. However, it borrows
features from other programming languages, such as Delphi and Java. If you look at
the most basic syntax of both C# and Java, the code looks very similar, but then again,
the code looks a lot like C++ too, which is intentional. Developers often ask questions
about why C# supports certain features or works in a certain way. The answer is often
rooted in it's C++ heritage.
4.4 SOFTWARE TESTING
System testing provides the file assurance that software once validated must be
combined with all other system elements. System testing verifies whether all elements
have been combined properly and that overall system function and performance is
achieved.
Characteristics of a Good Test
Tests are likely to catch bugs
No redundancy
Not too simple or too complex
TYPES OF TESTING
Unit Testing
Antithesis of the big bang approve Unit testing begins at the vertex of the spiral and
concentrates on each unit of the software as implemented in source code. Initially
test focus on each module individually, assuring that it function properly as a unit.
Hence the name unit testing. Unit testing makes heavy use of white box testing
techniques, exercising specific paths in a modules control structure to ensure
complete coverage and maximum error detection.
Unit testing focuses verification effort on the smallest unit of software design
the module. Using the procedural design description as a guide important control
paths are tested to uncover the errors within the boundary of the module. The relative
complexity of the tests and uncovered errors is limited by the constrained scope
established for unit testing.
Unit Test Procedure
Unit testing is considered as an adjacent to the coding step. After source level code
has been developed, reviewed and verify for correct syntax, unit test case design
begins. A module is not a standalone program; hence a driver or stub software must
be developed for each unit test. Stubs serve to replace modules that are subordinate
to the module to be tested. Drivers and stubs represent overhead. Unit testing is
simplified when a module with high cohesion is designed. When only one function
is addressed by a module. The number of test cases is reduced and error can be
more easily predicted and uncovered.
Integration Testing
Integration testing is a systematic technique for constructing a program
structure while conducting tests to uncover errors associated with interfacing. The
objective is to take unit tested modules and build a program structure that has been
detected by design. There is a often a tendency to attempt non-incremental integration,
that is, to construct the program using big bang approach. All modules are combined
in advance. The entire program is tested as a whole. A set of errors is encountered.
Correction is difficult because isolation of causes is complicated by the vast expanse
of the entire program. Incremental integration is the ach. The program is constructed
and tested in a small segments, where errors are easier to isolate and correct;
interfaces are more likely to be tested completely; and a systematic approach may be
applied.
Different Incremental Integration Strategies

1. Top-Down Integration.
2. Bottom-up Integration.
3. Regression Testing
Validation Testing
The application is tested to check how it responds to various kinds of input
given. The user should be intimated of any kind of exceptions in a more
understandable manner so that debugging becomes easier.
At the culmination of the black box testing, software is completely assembled
as a package; interfacing errors have been uncovered and corrected. Next stage is the
validation testing and it can be defined in many ways, but a simple definition is that
the validation succeeds when the software functions in the manner that can be
reasonably expected by the user. When an user enters incorrect inputs it should not
display error messages, instead it should display helpful messages enabling user to
use the tool properly.
SYSTEM ARCHITECTURE
P2P
Shaping
Register
Connected
system
Short lived
Login
Work
Group
Key word
Long lived
Sharing of file
Search
Performan
ce
Evaluation
Best from
the list
Keyword Spilt
Bloom Filter
Possible
keywords
File name
File Size
Content
Duplicate
Identificati
on
Data alive
in best
Performing
system
Remove
Duplicatio
n
Indexing
Data
base
Network
Tree
Retrieved
Data
Trained
data
MODULES
1. NETWORK FORMATION
2. CLIENT AUTHENTICATION
3. SCAN: A Structural Clustering Algorithm for P-P Networks
4. BESTPEER ++ MODULE
5. Carry & Forward Approach
6. PERFORMANCE EVALUATION
CLIENT AUTHENTICATION
User should register and login to the searching.
User has to be in the specified workgroup.
Admin can login directly to view the reports of all data.
The authenticated page only followed to the user.
NETWORK FORMATION
Retrieve the connected systems in the specified workgroup.
The performance of the system to be evaluated.
The evaluation result will be long lived systems, and short lived systems to
make efficient search.
SCAN: A STRUCTURAL CLUSTERING ALGORITHM FOR P-P NETWORKS
List of p2p systems are structured in tree format.
The long lived systems are taken for the process.
Network performance classifies by this module. From the best

performance system the data be searched and resulted to the client.
BP++ search make the system very effective.
The result indexed to the client system for the future proposal.
We are also using the bloom filters to get more results from the p2p
network.
Specifically, we develop the notion of path and source level

redundancy.
When given QoS requirements of a query, we identify optimal path and

source redundancy such that not only QoS requirements are satisfied, but
also the lifetime of the system is maximized.
BESTPEER ++ MODULE
o The replication of file should be reduce the performance of the system.

o So we are make the structure as decentralized in the search.
o The file name, file size and the file content to be searched. If it matches more
than systems the replication of files to be deleted automatically. And the data
history is maintained by client side. So the memory efficiency should get
higher than the existing approaches.
Carry & Forward Approach
Carry and forward is a technique in which information is sent to an intermediate

station where it is kept and sent at a later time to the final destination or to another
intermediate station.
The intermediate station, or node in a networking context, verifies the integrity of

the message before forwarding it.
In general, this technique is used in networks with intermittent connectivity,

especially in the wilderness or environments requiring high mobility. It may also
be preferable in situations when there are long delays in transmission and variable
and high error rates, or if a direct, end-to-end connection is not available.
REPORTS
The reports all maintained by the administrator.
The user information and the network information and the data processing
are controlled by the administrator.
The indexing is initiated by the administrator.
And the total log files are maintained by the reports module.
From this module we can able to check and monitor the network for the
future data processings.
ALGORITHM
DOUBLEGUARD ALGORITHM:
Algorithm developed in this paper takes two forms of redundancy.
The first form is path redundancy. That is, instead of using a single path to connect a source
cluster to the processing centre, mp disjoint paths may be used. The second is source
redundancy. That is, instead of having one sensor node in a source cluster return requested
sensor data, ms sensor nodes may be used to return readings to cope with data transmission
and/or sensor faults. The above architecture illustrates a scenario in which mp = 2 (two paths
going from the CH to the processing centre) and ms = 5 (five SNs returning sensor readings
to the CH). Doubleguard extends the role-based access control for the inherent distributed
environment of corporate networks. Through a web console interface, companies can easily
congure their access control policies and prevent undesired business partners to access their
shared data. Doubleguard employs P2P technology to retrieve data between business partners.
Doubleguard instances are organized as a structured P2P overlay network named BATON.
The data are indexed by the table name, column name and data range for efcient retrieval.
Doubleguard employs a hybrid design for achieving high performance query processing. The
major workload of a corporate network is simple, low overhead queries. Such queries
typically only involve querying a very small number of business partners and can be
processed in short time. Doubleguard is mainly optimized for these queries. For infrequent
time consuming analytical tasks, we provide an interface for exporting the data from
Doubleguard to Hadoop and allow users to analyse those data using MapReduce. The
analysis performed thus far assumes that a source CH does not aggregate data. The CH may
receive up to ms redundant sensor readings due to source redundancy but will just forward
the first one received to the PC. Thus, the data packet size is the same. For more sophisticated
scenarios, conceivably the CH could also aggregate data for query processing and the size of
the aggregate packet may be larger than the average data packet size. We extend the analysis
to deal with data aggregation in two ways. The first is to set a larger size for the aggregated
packet that would be transmitted from a source CH to the PC.
CLUSTERING ALGORITHM:
A clustering algorithm that aims to fairly rotate SNs to take the role of CHs has been used to
organize sensors into clusters for energy conservation purposes. The function of a CH is to
manage the network within the cluster, gather sensor reading data from the SNs within the
cluster, and relay data in response to a query. clustering algorithm is executed during the
system lifetime.
Aggregation of readings
Each cluster has a CH
Users issue queries through any CH.
CH that receives the query is called the Processing Center (PC)
Each non-CH node selects the CH candidate with the highest residual energy, sends it
a cluster join message (includes the non-CH nodes location).
The CH will
acknowledge this message.
Randomly rotates role of CH among nodes -> nodes consume their energy evenly.
DATA FLOW DIAGRAM

Level 0
Cloud Users/Application
DataSet
WorkFlow
WorkFlow
Recommen
d System
Framework
Remote
Server
Level 1
User
Authentication
Resourc
e
Request
R
Data Set
Request
R
Workflow
Scheduler
Inter Process
Communicati
on
Cluster
Data Broker
TESTING
SYSTEM TESTING
Testing is done for each module. After testing all the modules, the modules are
integrated and testing of the final system is done with the test data, specially designed to
show that the system will operate successfully in all its aspects conditions. The procedure
level testing is made first. By giving improper inputs, the errors occurred are noted and
eliminated. Thus the system testing is a confirmation that all is correct and an opportunity to
show the user that the system works. The final step involves Validation testing, which
determines whether the software function as the user expected. The end-user rather than the
system developer conduct this test most software developers as a process called Alpha and
Beta test to uncover that only the end user seems able to find.
This is the final step in system life cycle. Here we implement the tested error-free
system into real-life environment and make necessary changes, which runs in an online
fashion. Here system maintenance is done every months or year based on company policies,
and is checked for errors like runtime errors, long run errors and other maintenances like
table verification and reports.
UNIT TESTING
Unit testing verification efforts on the smallest unit of software design,
module. This is known as Module Testing. The modules are tested separately. This testing
is carried out during programming stage itself. In these testing steps, each module is found to
be working satisfactorily as regard to the expected output from the module.
INTEGRATION TESTING
Integration testing is a systematic technique for constructing tests to
uncover error associated within the interface. In the project, all the modules are combined and
then the entire programmer is tested as a whole. In the integration-testing step, all the error
uncovered is corrected for the next testing steps.
VALIDATION TESTING
To uncover functional errors, that is, to check whether functional characteristics
confirm to specification or not specified.
CONCLUSION
The unique challenges posed by sharing and processing data in an inter-businesses
environment was efficiently outdone by the proposed Doubleguard method, which is a
system which delivers elastic data sharing services, by integrating cloud computing, database,
and peer-to-peer technologies. Therefore, Doubleguard is a promising solution for efcient
data sharing within corporate networks. Traditional file replication methods for P2P file
sharing systems replicate files close to file owners, file requesters, or query path to release the
owners load, and meanwhile, improve the file query efficiency. However, replicating files
close to the file owner may overload the nodes in the close proximity of the owner, and
cannot significantly improve query efficiency since replica nodes are close to the owners.
Replicating files close to or in the file requesters only brings benefits when the requester or
its nearby nodes always queries for the file. In addition, due to non-uniform and time-varying
file popularity and node interest variation, the replicas cannot be fully utilized and the query
efficiency cannot be improved significantly. Replicating files along the query path improves
the efficiency of file query, but it incurs significant overhead.
The Doubleguard file replication algorithm proposed that chooses query traffic hubs and
frequent requesters as replica nodes to guarantee high utilization of replicas and high query
efficiency. Unlike current methods in which file servers keep track of replicas, it creates and
deletes file replicas by dynamically adapting to non-uniform and time varying file popularity
and node interest in a decentralized manner based on experienced query traffic. It leads to
higher scalability and ensures high replica utilization.
REFERENCES
[1] S. Saroiu, P. Gummadi, and S. Gribble, A Measurement Study of Peer-to-Peer File
Sharing Systems, Proc. Conf. Multimedia Computing and Networking (MMCN), 2002.
[2] A. Rowstron and P. Druschel, Storage Management and Caching in PAST, a Large-Scale,
Persistent Peer-to-Peer Storage Utility, Proc. Symp. Operating Systems Principles (SOSP),
2001.
[3] F. Dabek et al., Wide Area Cooperative Storage with CFS, Proc. Symp. Operating
Systems Principles (SOSP), 2001.
[4] T. Stading et al., Peer-to-Peer Caching Schemes to Address Flash Crowds, Proc. Intl
Workshop Peer-to-Peer Systems (IPTPS), 2002.
[5] M. Theimer and M. Jones, Overlook: Scalable Name Service on an Overlay Network,
Proc. Intl Conf. Distributed Computing Systems (ICDCS), 2002.
[6] V. Gopalakrishnan et al., Adaptive Replication in Peer-to-Peer Systems, Proc. Intl
Conf. Distributed Computing Systems (ICDCS), 2004.
[7] Gnutella, http://www.gnutella.com, 2008.
[8] M. Roussopoulos and M. Baker, CUP: Controlled Update Propagation in Peer to Peer
Networks, Proc. USENIX, 2003.
[9] L. Yin and G. Cao, DUP: Dynamic-Tree Based Update Propagation in Peer-to-Peer
Networks, Proc. Intl Conf. Data Eng. (ICDE), 2005.
[10] R. Cox, A. Muthitacharoen, and R.T. Morris, Serving DNS Using a Peer-to-Peer
Lookup Service, Proc. Intl Workshop Peer-to-Peer Systems (IPTPS), 2002.
[11] P. Gummadi, R. Dunn, S. Saroiu, S. Gribble, H. Levy, and J. Zahorjan, Measurement,

Modeling, and Analysis of a Peer-to-Peer File-Sharing Workload, Proc. Symp. Operating
Systems Principles (SOSP), 2003.
[12] C. Plaxton, R. Rajaraman, and A. Richa, Accessing Nearby Copies of Replicated
Objects in a Distributed Environment, Proc. ACM Symp. Parallel Algorithms and
Architectures (SPAA), 1997.
[13] P. Godfrey and I. Stoica, Heterogeneity and Load Balance in Distributed Hash Tables,
Proc. IEEE INFOCOM, 2005.
[14] H. Shen and C. Xu, Elastic Routing Table with Provable Performance for Congestion
Control in DHT Networks, Proc. Intl Conf. Distributed Computing Systems (ICDCS),
2006.
[15] Q. Lv, S. Ratnasamy, and S. Shenker, Can Heterogeneity Make Gnutella Scalable?
Proc. Intl Workshop Peer-to-Peer Systems (IPTPS), 2002.
SAMPLE SCREENS
Server
SAMPLE CODINGS
File1.cs
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Data.SqlClient;
using System.Configuration;
namespace SourceMain
{
public partial class Form1 : Form
{
string constring =
Convert.ToString(ConfigurationSettings.AppSettings["ConnectionString"]);
int totreqcount;
string rproceedstatus = "Proceed", empty = "";
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)

{
ToolTip toolTip1 = new ToolTip();
toolTip1.AutoPopDelay = 5000;
toolTip1.InitialDelay = 500;
toolTip1.ReshowDelay = 500;
toolTip1.ShowAlways = true;
toolTip1.SetToolTip(this.pictureBox7, "Click To Reload");
toolTip1.SetToolTip(this.pictureBox1, "Click To Admin Login");
toolTip1.SetToolTip(this.label3, "Click To Admin Login");
toolTip1.SetToolTip(this.pictureBox2, "Click To Upload Files");
toolTip1.SetToolTip(this.label4, "Click To Upload Files");
toolTip1.SetToolTip(this.pictureBox8, "Click To Verify Requested Files");
toolTip1.SetToolTip(this.label9, "Click To Verify Requested Files");
toolTip1.SetToolTip(this.pictureBox3, "Click To Start Transaction");
toolTip1.SetToolTip(this.label5, "Click To File Start Transaction");
SqlConnection con = new SqlConnection(constring);

con.Open();
SqlCommand cmd = new SqlCommand("Delete From FileUpload", con);
cmd.ExecuteNonQuery();
con.Close();
groupBox1.Visible = false;
pictureBox2.Enabled = false;
label4.Enabled = false;
//pictureBox3.Enabled = false;
//label5.Enabled = false;
}
private void pictureBox1_Click(object sender, EventArgs e)

{
pictureBox4.Visible = false;
groupBox1.Visible = true;
}
private void label3_Click(object sender, EventArgs e)

{
groupBox1.Visible = true;
}

{
pictureBox1.Enabled = true;
label3.Enabled = true;
pictureBox4.Visible = true;
pictureBox5.Visible = true;
}
private void button2_Click(object sender, EventArgs e)

{
textBox1.Text = "";
textBox2.Text = "";
}

{
string txt1 = textBox1.Text.ToUpper();
string txt2 = textBox2.Text.ToUpper();
if (txt1 == "ADMIN" && txt2 == "ADMIN")

{
}
else
{
}
textBox1.Text = "";
textBox2.Text = "";
}

{
FileUpload fu = new FileUpload();
fu.ShowDialog();
}

{
FileUpload fu = new FileUpload();
fu.ShowDialog();
}

{
Transaction tr = new Transaction();
tr.ShowDialog();
}

{
tr.ShowDialog();
}

{
con.Open();
SqlDataAdapter adp1 = new SqlDataAdapter("Select COUNT(rstatus) as

reqstatus from FileUpload where rstatus='" + rproceedstatus + "'", con);
DataSet ds1 = new DataSet();
adp1.Fill(ds1);
totreqcount = Convert.ToInt32(ds1.Tables[0].Rows[0]
["reqstatus"].ToString());
if (totreqcount == 3)
{
//START TRANSACTION
tr.ShowDialog();
}
else
{
MessageBox.Show("ERROR - DO NOT PROCEED FILES.", "Message
Box", MessageBoxButtons.OK, MessageBoxIcon.Warning);
}
con.Close();
}

{
con.Open();
SqlDataAdapter adp1 = new SqlDataAdapter("Select COUNT(rstatus) as

reqstatus from FileUpload where rstatus='" + rproceedstatus + "'", con);
DataSet ds1 = new DataSet();
adp1.Fill(ds1);
totreqcount = Convert.ToInt32(ds1.Tables[0].Rows[0]
["reqstatus"].ToString());
if (totreqcount == 3)
{
//START TRANSACTION
tr.ShowDialog();
}
else
{
MessageBox.Show("ERROR - DO NOT PROCEED FILES.", "Message
Box", MessageBoxButtons.OK, MessageBoxIcon.Warning);
}
con.Close();
}

{
RequestFiles rf = new RequestFiles();
rf.ShowDialog();
}

{
RequestFiles rf = new RequestFiles();
rf.ShowDialog();
}
//private void button3_Click(object sender, EventArgs e)

//{
//
Form2 frm2 = new Form2();
//
frm2.ShowDialog();
//}
}
}
Fileupload.cs
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Data.SqlClient;
using System.Configuration;
using System.IO;
namespace SourceMain
{
public partial class FileUpload : Form
{
string constring =
Convert.ToString(ConfigurationSettings.AppSettings["ConnectionString"]);
Class1 cs = new Class1();
string fileDes, fileini, empty = "", rstatus = "Start";
int len;
string yes, yes1, yes2;
public FileUpload()
{
InitializeComponent();
}
private void FileUpload_Load(object sender, EventArgs e)

{
//textBox1.Text = Convert.ToString(cs.fileidgeneration());
}
private void btnbrowse_Click(object sender, EventArgs e)
{
textBox2.Text = "";
openFileDialog1.ShowDialog();
fileDes = openFileDialog1.FileName;
if (fileDes == "openFileDialog1")
{
//lblError.Text = "";
//lblError.Text = "Select a File first";
MessageBox.Show("Select any one File.", "Message Box",
MessageBoxButtons.OK, MessageBoxIcon.Warning);
textBox2.Text = "";
button1.Enabled = false;
yes = null;
}
else
{
yes = "yes";
textBox2.Text = openFileDialog1.FileName;
len = fileDes.Length;
fileini = fileDes.Substring(fileDes.IndexOf("\\") + 1);
button1.Enabled = true;
FileInfo fi = new FileInfo(openFileDialog1.FileName);

//byte[] data = new byte[fi.Length];
DateTime dt = fi.CreationTime;
FileStream fs = fi.Open(FileMode.Open, FileAccess.Read,
FileShare.Read);
//fs.Position = 0;
//fs.Read(data, 0, Convert.ToInt32(fi.Length));
label2.Text = fi.Name;
label3.Text = Convert.ToString(fi.Length) + " bytes";
label5.Text = Path.GetExtension(openFileDialog1.FileName);
}
}

{
textBox3.Text = "";
openFileDialog2.ShowDialog();
fileDes = openFileDialog2.FileName;
if (fileDes == "openFileDialog2")
{
//lblError.Text = "";
//lblError.Text = "Select a File first";
MessageBox.Show("Select any one File.", "Message Box",
MessageBoxButtons.OK, MessageBoxIcon.Warning);
textBox3.Text = "";
button1.Enabled = false;
yes1 = null;
}
else
{
yes1 = "yes";
textBox3.Text = openFileDialog2.FileName;
len = fileDes.Length;
fileini = fileDes.Substring(fileDes.IndexOf("\\") + 1);
button1.Enabled = true;
FileInfo fi = new FileInfo(openFileDialog2.FileName);

//byte[] data = new byte[fi.Length];
DateTime dt = fi.CreationTime;
FileStream fs = fi.Open(FileMode.Open, FileAccess.Read,
FileShare.Read);
//fs.Position = 0;
//fs.Read(data, 0, Convert.ToInt32(fi.Length));
label16.Text = fi.Name;
label13.Text = Convert.ToString(fi.Length) + " bytes";
label10.Text = Path.GetExtension(openFileDialog2.FileName);
}
}

Machine Learning Approach For Efficient Keyword Prediction

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Machine Learning Approach For Efficient Keyword Prediction

Enviado por

Direitos autorais:

Formatos disponíveis

ABSTRACT

Fig.1.1 A Cloud Peer-Peer System

By integrating cloud computing, database, and peer-to-peer (P2P) technologies,

Fig 1.2 Doubleguard Cloud Cluster

Measure of changes in neuronal activity.

2. A Wavelet-Based Anytime Algorithm for K-Means Clustering

Jessica Lin, Eamonn Keogh,

Applications of these algorithms cover a wide range of

activities found in finance, meteorology, industry, medicine etc.

In particular, the high dimensionality,

In particular, an anytime algorithm always has a best- so-far answer available,

Very coarse resolution representation of the data.

FMRI brain activity information collection not able to Wavelet analysis .

3. Model-based Classification of Data with Time Series-valued Attributes

Author: Claudia Plant, AndrewZherdin, Leonhard Laer

Increasing amounts of captured motion stream.

Multiple sensors to capture human movements.

4. CoRE: A Context-Aware Relation Extraction Method for Relation Completion

(NER) and relation extraction (RE).

DISADVANTAGES OF EXISTING SYSTEM:

The corporate network needs to scale up to support thousands of participants, while

Its most of the data warehouse solutions fail to offer flexibilities.

Its warehousing solution has some deficiencies in real deployment.

Its most of the data warehouse solutions fail to offer flexibilities.

Its warehousing solution has some deficiencies in real deployment.

The unique challenges posed by sharing and processing data in an inter-businesses

Doubleguard adopts the pay-as-you-go business model popularized by cloud

Doubleguard employs P2P technology to retrieve data between business partners.

Doubleguard is a promising solution for efficient data sharing within corporate

It prevent undesired business partners to access their shared data.

Key board, Mouse

VISUAL STUDIO.NET 2010

SQL SERVER 2008

Windows XP SP3, VISTA,7.

4.3 SOFTWARE DESCRIPTION

To provide a code-execution environment that guarantees safe execution of

Different Incremental Integration Strategies

User should register and login to the searching.

User has to be in the specified workgroup.

Admin can login directly to view the reports of all data.

The authenticated page only followed to the user.

Retrieve the connected systems in the specified workgroup.

The performance of the system to be evaluated.

SCAN: A STRUCTURAL CLUSTERING ALGORITHM FOR P-P NETWORKS

List of p2p systems are structured in tree format.

The long lived systems are taken for the process.

Network performance classifies by this module. From the best

BP++ search make the system very effective.

Specifically, we develop the notion of path and source level

When given QoS requirements of a query, we identify optimal path and

o The replication of file should be reduce the performance of the system.

Carry & Forward Approach

Carry and forward is a technique in which information is sent to an intermediate

The intermediate station, or node in a networking context, verifies the integrity of

In general, this technique is used in networks with intermittent connectivity,

The reports all maintained by the administrator.

The indexing is initiated by the administrator.

Each cluster has a CH

Users issue queries through any CH.

CH that receives the query is called the Processing Center (PC)

acknowledge this message.

DATA FLOW DIAGRAM

[11] P. Gummadi, R. Dunn, S. Saroiu, S. Gribble, H. Levy, and J. Zahorjan, Measurement,