Você está na página 1de 19

DWH techniques to formulate Counter Terror roadmap for

the Country
Col. Mukesh Rastogi, Rohit Agarwal & Shreyash Dugar

1. Aim of The Paper


To evolve a Country Counter Terrorism Roadmap by integrating the existing and forthcoming
data of government, regulatory authorities and nongovernmental organizations, by using the
Data warehousing and Data mining tools and techniques.

2. Introduction
The Data ware house techniques will be utilized to track the touch points of all citizens of the
country as also the visitors who have made entry into the country.
Data Sources- Data of National Unique Identity Project, data pertaining to Air Passengers, data
pertaining to visitors who arrive by land routes and data pertaining to visitors arriving by sea
routes will act as the sources of the database on which the process will be based.
The data will be integrated and then warehoused in the requirement specific locations. The Data
marts will act as the end terminals to extrapolate query specific results at national, regional and
city specific terminals.

3. Literature Review
Defeating terrorism requires a more nimble intelligence apparatus that operates more actively
within the United States and makes use of advanced information technology. Data-mining and
automated data-analysis techniques are powerful tools for intelligence and law enforcement
officials fighting terrorism.
“Data mining” actually is a process that uses algorithms to discover predictive patterns in data
sets. One technique under this is “Automated data analysis”. It applies models to data to predict
behaviour, assess risk, determine associations, or do other types of analysis. The models used for
automated data analysis can be based on patterns (from data mining or discovered by other
methods) or subject based, which start with a specific known subject.
Subject-based “link analysis” uses public records or other large collections of data to find links
between a subject—a suspect, an address, or other piece of relevant information—and other
people, places, or things. This technique is already being used for, among other things,
background investigations and as an investigatory tool in national security and law enforcement
investigations. [1]

Data mining can be used to model crime detection problems. Crimes are a social nuisance and
cost our society dearly in several ways. Any research that can help in solving crimes faster will
pay for itself. About 10% of the criminals commit about 50% of the crimes. Here we look at use
of clustering algorithm for a data mining approach to help detect the crimes patterns and speed
up the process of solving crime. We will look at k-means clustering with some enhancements to
aid in the process of identification of crime patterns. We applied these techniques to real crime
data from a sheriff’s office and validated our results. We also use semi-supervised learning
technique here for knowledge discovery from the crime records and to help increase the
predictive accuracy. We also developed a weighting scheme for attributes here to deal with
limitations of various out of the box clustering tools and techniques. This easy to implement data
mining framework works with the geospatial plot of crime and helps to improve the productivity
of the detectives and other law enforcement officers. It can also be applied for counter terrorism
for homeland security.[2]

There has been an enormous increase in the crime in the recent past. Crime deterrence has
become an upheaval task. The cops in their role to catch criminals are required to remain
convincingly ahead in the eternal race between law breakers and law enforcers. One of the key
concerns of the law enforcers is how to enhance investigative effectiveness of the police. There
is need for user interactive interfaces based on current technologies to give them the much
needed edge and fulfil the new emerging responsibilities of the police. The paper highlights the
existing systems used by Indian police as e-governance initiatives and also proposes an
interactive query based interface as crime analysis tool to assist police in their activities. The
proposed interface is used to extract useful information from the vast crime database maintained
by National Crime Record Bureau (NCRB) and find crime hot spots using crime data mining
techniques such as clustering etc. The effectiveness of the proposed interface has been illustrated
on Indian crime records. An interactive interface as crime analysis tool has been designed for this
purpose.[3]

Association Mining is one another technique which is used to search a relationship of attributes
and tuples, by discovering frequently occurring item sets in database. A result is patterns
described as rules that represent one-way relationship. Furthermore, result rules consist of a
confidential value and support value, a value of which is used to identify the pattern. The support
is a number of instances that complies with the rules, whereas the confidential is a percentage of
instances that must likewise be complied by rules. In basket analysis, for example, the
association mining is a customer’s behaviour analysis that determines the products the customer
frequently buy together.
Also the results from a clustering technique can be used to identify escape criminal. Incidents of
the same group may have some relationships rise from suspects who conduct crimes in similar
ways. Therefore, evidences from the incidents can be proved that it is collected from the same
group. For instance, injured person in the incident said that the offender was black hair, a witness
said that the offender was middle age, and injured person confirmed that he saw a tattoo on the
offender’s left arm. Hence, the law enforcement can take this technique to analyze the pattern of
crime data, by which the knowledge is used to catch the escape offender.[4]

Modern data collection and analysis techniques have had remarkable success in solving
information-related problems in the commercial sector; for example, they have been successfully
applied to detect consumer fraud. But such highly automated tools and techniques cannot be
easily applied to the much more difficult problem of detecting and pre-empting a terrorist attack,
and success in doing so may not be possible at all. Success, if it is indeed achievable, will require
a determined research and development effort focused on this particular problem. Detecting
indications of ongoing terrorist activity in vast amounts of communications, transactions, and
behavioral records will require technology-based counterterrorism tools. But even in well-
managed programs such tools are likely to return significant rates of false positives, especially if
the tools are highly automated. Because the data being analysed are primarily about ordinary,
law-abiding citizens and businesses, false positives can result in invasion of their privacy. Such
intrusions raise valid concerns about the misuse and abuse of data, about the accuracy [5]

Data mining and machine learning can be used to analyze data from numerous sources of high-
complexity for the purpose of preventing future terrorist activity. This is inherently a
multidisciplinary activity, drawing from areas such as intelligence, international relations, and
security methodology. From the data mining and machine-learning world this activity draws
from scalable text mining, data fusion, data visualization, data warehousing methods.

In terms of data management there are three different states


1. Data Collection,
2. Data retrieval and
3. Data Processing.

Data collection refers to network or host based sensors generating audit logs to record activity.
Data retrieval refers to the virtual movement of log data to serve as input to a data mining
application.Data Processing refers to using data mining algorithms to discover computer network
attacks within logs. [6]

Investigative data mining which is a newly emerging research area which is at the intersection of
link analysis, harvesting the web or web mining , graph mining , and social network analysis
.Graph mining and social network analysis in particular attracted attention from a wide audience
in the investigation of law enforcement and intelligence agencies. [7]
4. Architecture of the Counter Terrorism Process

5. Methodology
a. Data Source & Data Integration
Introduction

South Asia is one of the geopolitical hotspots in the world today. All major risks to India
emanate from its geopolitical and environmental complexities. India is a country with continental
dimensions. Its neighborhood remains the epicenter of global terrorism, which creates a variety
of internal and external security challenges, not only for India, but to entire world. Flanked by
China, Afghanistan, Pakistan, Bangladesh and Nepal, India’s borders are subject to constant
violence and instability that often spill over to the mainland. China and Pakistan remain its long
term adversaries, virtually sitting on the nuclear threshold. May be due to lost opportunities or
due to circumstances, India has not been able to forge warm and cordial relations with its
neighbors! Paradoxically, the smaller nations on the periphery also leave no opportunity to score
a point or two over India when situation so permits. India is a country of unparallel dimensions; a
federal, multi-party democracy it has its own share of contribution in shaping the affairs of the
world in the 21st Century. It is the 7th largest landmass on the globe which supports approx 1.14
billion population. It occupies a strategic position in South Asia and has over 15,000 KM land
borders and 7500 KM coastline. Flanked by eight neighboring countries in Asia it has
Afghanistan and Pakistan to its West. Growing Chinese Hegemony, fragile ethnic diversity,
diverse forms of Insurgency and terrorism add external and internal challenges to the Country.
Indian sub-continent is reeling under disturbances of major proportions which have direct impact
on the aspirations of the country as an entity. 26/11 terror attacks in Mumbai have further altered
the spectrum through which India needs to perceive and combat Physical Risks. Terrorism from
across the border is going to remain a major destabilizing factor in the times to come. India as a
nation has proved its mettle in Information Technology and the country has few of the best IT
solution providers in the world. Data warehousing and Data Mining are the latest techniques that
can assist in tracking terrorists footprints in the country

Aim

To evolve a Country Counter Terrorism Roadmap by integrating the existing and forthcoming
data of government, regulatory authorities and non-governmental organizations, by using the
Data warehousing and Data mining tools and techniques.

Data Sources

As per the methodology outlined in the paper, the initial stage of Data Warehousing is to
organize the source of data and thereafter ensuring the data integrity of this data. The Data to be
utilized for tracking the footprints of terrorists is going to be based on authentic data from all
agencies involved in movement of people in and out of the country by land, sea and air. Few of
the essential agencies that will act as the primary source of Data are National Unique Identity

Project (NUID), Border Crossing Check Posts, Passengers details at Sea Ports, Air Passenger
details of national and international airports, Control and Reporting System of Indian Airspace
and other agencies as on required basis.
Sourcing of Data

A data warehousing and business intelligence effort for counter terrorism is going to be only as
good as the data that is put into it. The saying "Garbage In, Garbage Out" is all too true. A
leading cause of data warehousing and business intelligence project failures is to obtain the
wrong or poor quality data. The source identification process is critical to the success of data
warehousing and business intelligence projects. Managing data warehouse input sources includes
a number of steps organized into two phases. In the first phase the following activities are
undertaken:

• Manage the Data Source Identification Process


• Identify Subject Matter Experts (SMEs)
• Identify Dimension Data Sources
• Identify Fact Data Sources

When the major data sources have been identified it is time to quickly gain detailed
understanding and commence following:

• Obtain Existing Documentation


• Model and Define the Input
• Profile the Input
• Improve Data Quality
• Save Results for Further Reuse

Important aspects are being elaborated in detail in next paragraphs.


Data Source Identification Process

The source identification process is critical to the success of data warehousing and business
intelligence projects. It is important to move through this effort quickly, obtaining enough
information about the data sources without being bogged down in excess detail while still
obtaining the needed information.

Identify Data Source Subject Matter Experts

Following questions need to be considered when determining the sources and costs of data
for the Data Warehouse:

• Where does the data come from?


• What processes are used to obtain the data?
• What does it cost to obtain the data?
• What does it cost to store the data?
• What does it cost to maintain the data?

Identify Dimension Data Sources

Dimensions enable business intelligence users to put information in context. They focus on
questions of: who, when, where and what. Typical dimensions include:

• Time period / calendar


• Product eg movement of people
• Indians and foreign travelers to the country
• Population areas
• People Segment ( land, Sea or air)
• Geographic Area

Master data is a complementary concept and may provide the best source of dimensional data
for the data warehouse. Master data is data shared between systems that describe entities
like: Product eg movement of people, Indians and foreign travelers to the country, Population
areas and People Segment (land, Sea or air). Master data is managed using a Master Data
Management (MDM) system and stored in an MDM-Hub. Benefits of this approach include:

• It is less expensive to access data from a single source (MDM-Hub) than


extracting from multiple sources.
• MDM data is rationalized.
• MDM data is of high quality

Identify Fact Data Sources

The Fact contains quantitative measurements while the Dimension contains classification
information. The data sources for Fact tend to be transactional software systems. Larger
entities for terror tracking will have multiple systems for the same kind data. In that case,
one will need to determine the best source of data - the System of Record (SOR) as the
source of data warehousing data.

Detailed Data Source Understanding for Data Warehousing

When the major data sources have been identified it is time to quickly gain detailed
understanding of each one. Consolidate the spreadsheet developed in the identification phase
by data source, and then create a new spreadsheet to track and control detailed
understanding:

• Data Subject Name


• Obtain Date of commencement Date
• Define Input Date
• Profile Input Date
• Map Date
• Data Quality Date
• Save Results
• Analyst Name
• SME Name(s)
• Status
This approach provides an effective workflow as well as a project planning and control
method. Due dates are assigned and actual complete dates and status are tracked.

Obtain Existing Documentation

When seeking to understand a data source, the first thing to do is look at existing
documentation. This avoids "re-inventing the wheel". If a data source is fully documented,
profiled and is of high quality, most of the job of data source discovery is complete. Existing
documentation may include:

• Data models
• Data dictionary
• Internal / technical documentation
• User guides
• Data profiles and data quality assessments

Model and Define the Input

The data model is a graphic representation of data structures that improves understanding and
provides automation linking database design to physical implementation. This section
assumes that the data source is stored in a relational database that modeled using typical
relational data modeling tools. If there is an existing data model, start with that, otherwise use
the reverse engineering capability of the data modeling to build a physical data model. Next,
group the tables that are of interest into a subject area for analysis.

For each selected data source table needs to be defined, which should include:

• Physical Name
• Logical Name
• Definition
• Notes
For each selected data source column define:

• Physical Name
• Logical Name
• Order in Table
• Data type
• Length
• Decimal Positions
• Null able/Required
• Default Value
• Edit Rules
• Definition
• Notes

Profile the Data Source

The actual use and behavior of data sources often tends not to match the name or definition
of the data. Sometimes this is called "dirty data" or "unrefined data" that may have problems
such as:

• Invalid code values


• Missing data values
• Multiple uses of a single data item
• Inconsistent code values
• Incorrect values such as sales revenue amounts

Data profile is an organized approach to examining data to better understand and later use it.
This can be accomplished by querying the data using tools like:

• SQL Queries
• Reporting tools
• Data quality tools
• Data exploration tools
When data from multiple sources is integrated in the data warehouse it is expected that it will
be standardized and integrated. Consistency within a database is another important factor to
determine thorough data profiling. Perform queries to determine whether this is true.

Improve Data Quality

Data profiling may reveal problems in data quality. For example, it might show invalid
values are entered for a particular column, such as entering 'Z' for gender when 'F' and 'M' are
the valid values. Some steps that could be taken to improve data quality include:

• Work with data owners to define the appropriate level of data quality. Build this
into a data governance program.
• Determine why there are data quality problems -- do a root cause analysis.
• Correct the data in the source system through manual or automated efforts.
• Add edits or database rules to prevent the problem.
• Change business processes to enter correct data.
• Make data quality visible to the business through scorecards, dashboards and
reports.

Save Results for Further Reuse

The information gathered during the data source discovery process is valuable metadata that
can be useful for future data warehousing or other projects. It should be ensured to save the
results and make them available for future efforts. This work can be a great step toward
building an improved data resource.

Now let us examine the Data Quality in detail.

Data Quality

According to industry analyst firm Gartner, more than 50 percent of business intelligence
and customer relationship management deployments will suffer limited acceptance, if not
outright failure, due to lack of attention to data quality issues. The impact of poor data
quality is far reaching and its affects are both tangible and intangible. If data quality
problems are allowed to persist, the users grow to mistrust the information in the data
warehouse and will be reluctant to use it for decision-making. Data quality is an increasingly
serious issue for organizations large and small. It is central to all data integration initiatives.
Before data can be used effectively in a data warehouse, it needs to be analyzed and
cleansed. To ensure high quality data is sustained, Counter Terror Organizations need to
apply ongoing data cleansing processes and procedures, and to monitor and track data
quality levels over time. Otherwise poor data quality will lead to increased costs,
breakdowns in the analysis process and inferior user interface. Defective data also hampers
internal decision making and efforts to meet regulatory compliance responsibilities. The key
to successfully addressing data quality is to get counter terrorism professionals centrally
involved in the process. Recognized and easy-to-use data quality software products,
specifically designed to bridge the gap and better align IT and the process, will fulfill all the
needs to control data quality processes, in order to reduce costs, increase effectiveness and
improve decision-making at point, region and national levels.

Focal Point for Data Quality

Once the data warehouse is built and established the issue of data quality will not go away.
The data warehouse is downstream of most other systems and applications, and as such
suffers for the data quality problems that are accumulated along the information chain. If the
set up needs to continue to reap the benefits of reliable business intelligence, forecasting and
analysis, ongoing data quality management and monitoring has to be high on the agenda.
With this in mind the data warehouse offers an ideal platform for which to initiate an
enterprise data quality program that will benefit the entire organization.

Structure of Investigative Data Warehouse


I. Overview of the Investigative Data Warehouse.

The Investigative Data Warehouse will be a massive data warehouse, which can be
described as "the Agency single largest repository of operational and intelligence
information." The "IDW is a centralized, web-enabled, closed system repository for
intelligence and investigative data.” “IDW system provides data storage, database
management, search, information presentation, and security services."
In addition to storing vast quantities of data, the IDW provides a content management and
data mining system that is designed to permit a wide range of investigative, analytical,
administrative, and intelligence agencies to access and analyze aggregated data from
previously separate datasets included in the warehouse. Moving forward, the agency should
intend to increase its use of the IDW for "link analysis" (looking for links between suspects
and other people and to start "pattern analysis" (defining a "predictive pattern of behavior"
and searching for that pattern in the IDW's datasets before any criminal offence is committed
– i.e. pre-crime).

II. IDW Systems Architecture

"The IDW system environment should consist of a collection of UNIX and NT servers that
provide secure access to a family of very large-scale storage devices. The servers provide
application, web servers, relational database servers, and security filtering servers. User
desktop units that have access to agency Net can access the IDW web application. This
provides browser-based access to the central databases and their access control units. The
environment is designed to allow the agency analytic and investigative users to access any of
the data sources and analytic capabilities of the system for which they are authorized. The
entire configuration is designed to be scalable to enable expansion as more data sources and
capabilities are added." "Data processing is to be conducted by a combination of
Commercial-Off-the-Shelf (COTS) applications, interpreted scripts, and open-source
software applications. Data storage is provided by several Oracle Relational Database
Management Systems (DBMS) and in proprietary data formats. Physical storage is
contained in Network Attached Storage (NAS) devices and component hard disks. Ethernet
switches provide connectivity between components and to agency LAN/WAN. An
integrated firewall appliance in the switch provides network filtering."
b. Data WareHousing
Data warehouses are computer based information systems that are home for "secondhand"
data that originated from either another application or from an external system or source.
Warehouses optimize database query and reporting tools because of their ability to analyze
data, often from disparate databases and in interesting ways. They are a way for managers
and decision makers to extract information quickly and easily in order to answer questions
about their business. In other words, data warehouses are read-only, integrated databases
designed to answer comparative and "what if" questions. Unlike operational databases that
are set up to handle transactions and that are kept up to date as of the last transaction, data
warehouses are analytical, subject-oriented and are structured to aggregate transactions as a
snapshot in time.

The steps in planning a data warehouse are identical to the steps for any other type of
computer application. Users must be involved to determine the scope of the warehouse and
what business requirements need to be met. After selecting a focus area, for example,
analyzing the use of state government records over time, a data warehouse team of business
users and information professionals compiles a list of different types of data that should go
into the warehouse. (See data acquisition/collection). After business requirements have been
gathered and validated, data elements are organized into a conceptual data model. The
conceptual model is used as a blueprint to develop a physical database design. As in all
systems design projects, there are a number of iterations, prototypes, and technical decisions
that need to be made between the steps of systems analysis, design, development,
implementation, and support.

The data warehouse team must determine what data should go into the warehouse and where
those particular pieces of information can be found. Some of the data will be internal to an
organization. In other cases, it can be obtained from another source. Another team of
analysts and programmers create extraction programs to collect data from the various
databases, files, and legacy systems that have been identified, copying certain data to a
staging area outside of the warehouse. At this point, they ensure that the data has no errors
(cleansing), and then copy it all into the data warehouse. This source data extraction,
selection, and transformation process is unique to data warehousing. Source data analysis
and the efficient and accurate movement of source data into the warehouse environment are
critical to the success of a data warehouse project.

Metadata
Good metadata is essential to the effective operation of a data warehouse and it is used in
data acquisition/collection, data transformation, and data access. Acquisition metadata maps
the translation of information from the operational system to the analytical system. This
includes an extract history describing data origins, updates, algorithms used to summarize
data, and frequency of extractions from operational systems. Transformation metadata
includes a history of data transformations, changes in names, and other physical
characteristics. Access metadata provides navigation and graphical user interfaces that allow
non-technical business users to interact intuitively with the contents of the warehouse. And
on top of these three types of metadata, a warehouse needs basic operational metadata, such
as procedures on how a data warehouse is used and accessed, procedures on monitoring the
growth of the data warehouse relative to the available storage space, and authorizations on
who is responsible for and who has access to the data in the data warehouse and data in the
operational system.

Steps of Implementing Data Warehousing

The primary objective of Data Warehousing is to bring together information from disparate
sources and put the information into a format that is conducive to making business
decisions. This objective necessitates a set of activities that are far more complex than just
collecting data and reporting against it. Data Warehousing requires both business and
technical expertise and involves the following activities:

• Accurately identifying the business information that must be contained in the Warehouse

• Identifying and prioritizing subject areas to be included in the Data Warehouse

• Managing the scope of each subject area which will be implemented into the Warehouse
on an iterative basis
• Developing a scalable architecture to serve as the Warehouse’s technical and application
foundation, and identifying and selecting the hardware/software/middleware components to
implement it

• Extracting, cleansing, aggregating, transforming and validating the data to ensure


accuracy and consistency

• Defining the correct level of summarization to support business decision making

• Establishing a refresh program that is consistent with business needs, timing and cycles

• Providing user-friendly, powerful tools at the desktop to access the data in the Warehouse

• Educating the business community about the realm of possibilities that are available to
them through Data Warehousing

• Establishing a Data Warehouse Help Desk and training users to effectively utilize the
desktop tools

Establishing processes for maintaining, enhancing, and ensuring the ongoing success and
applicability of the Warehouse

c. Data Mart
A data mart is a subset of an organizational data store, usually oriented to a specific purpose
or major data subject,that may be distributed to support business needs. Data marts are
analytical data stores designed to focus on specific business functions for a specific
community within an organization. Data marts are often derived from subsets of data in a
data warehouse, though in the bottom-up data warehouse design methodology the data
warehouse is created from the union of organizational data marts.

In practice, the terms data mart and data warehouse each tend to imply the presence of the
other in some form. However, most writers using the term seem to agree that the design of a
data mart tends to start from an analysis of user needs and that a data warehouse tends to
start from an analysis of what data already exists and how it can be collected in such a way
that the data can later be used. A data warehouse is a central aggregation of data (which can
be distributed physically); a data mart is a data repository that may or may not derive from a
data warehouse and that emphasizes ease of access and usability for a particular designed
purpose. In general, a data warehouse tends to be a strategic but somewhat unfinished
concept; a data mart tends to be tactical and aimed at meeting an immediate need.

There can be multiple data marts inside a single corporation; each one relevant to one or
more business units for which it was designed. DMs may or may not be dependent or related
to other data marts in a single corporation. If the data marts are designed using conformed
facts and dimensions, then they will be related. In some deployments, each department or
business unit is considered the owner of its data mart including all the hardware, software
and data. This enables each department to use, manipulate and develop their data any way
they see fit; without altering information inside other data marts or the data warehouse. In
other deployments where conformed dimensions are used, this business unit ownership will
not hold true for shared dimensions like customer, product, etc. The Data Mart for our
functional purview must be able to provide us with the District, State and Country level
information about terrorist footprints.

6. Conclusion
Thus through the proposed architecture, the Data ware house techniques will be utilized to
track the touch points of all citizens of the country as also the visitors who have made entry
into the country. Data of National Unique Identity Project, data pertaining to Air Passengers,
data pertaining to visitors who arrive by land routes and data pertaining to visitors arriving
by sea routes will act as the sources of the database on which the process will be based.

7. References
[1]Mary DeRosa, Data Mining and Data Analysis For counterterrorism
[2]Shyam Varan Nath, Crime Pattern Detection Using Data Mining
[3]Manish Gupta, B. Chandra and M. P. Gupta Crime Data Mining for Indian Police
Information System
[4]P. Thongtae and S. Srisuk, An Analysis of Data Mining Applications in Crime Domain
[5]Charles M. Vest and William J. Perry 2008 in ‘Protecting Individual Privacy in the
Struggle AgainstTerrorists: A Framework for Program Assessment’
[6]Workshop on Data Mining for Counter Terrorism and Security May 3, 2003 Cathedral
Hill Hotel San Francisco, CA
[7]Nasrullah Memon, Abdul Rasool Qureshi, Uffe Kock Wiil, David L. Hicks1 2009 in
Novel Algorithms for Subgroup Detection in Terrorist Networks.

Você também pode gostar