서울시 심야버스 노선 최적화 빅데이터 활용사례

ZERONOVA

아마도 국내에서 공공 데이터 활용 분야에서 가장 노력 많이하고 성과를 내고 있는 곳 중에 하나가 서울시청일 것이다. 이미 열린데이터광장을 통해 2-3년 전부터 공공 데이터를 오픈API 형태로 개방하고 있고, 최근에는 서울시 현안 문제를 빅데이터로 해결하려는 시도를 민간기업과 협력하여 수행하고 있다. 그 중에 가시적인 성과를 내고 있는 것이 단연 심야버스 노선 최적화에 빅데이터를 활용한 사례라고 할 수 있다. 심야에 대중교통이 끊어진 상태에서 택시를 잡기위해 고군분투를 해 본 사람이면, 자정부터 새벽 5시까지 운행하는 심야버스에 기대를 걸어봄직하다. 하지만 문제는 내가 승차하고자 하는 곳에 심야버스가 정차하느냐이다. 즉, 좋은 의도에서 시작했지만, 실제 시민들의 활용도가 높지 않다면 무용지물이 되는 것이다.

이것을 결정하는 것이 심야버스 노선 최적화 문제다. 4월19일부터 3개월동안 2개 노선이 시범적으로 운영되고 있고, 이후 6개 노선으로 확대할 계획이다. 자, 여기서 그럼 노선을 어떻게 정할 것이냐는 과제가 있는데, 이는 결국 밤시간대 유동인구가 많은 구간을 묶어서 노선을 만드는 문제로 귀결된다. 그럼 유동인구가 많은 구간을 어떻게 정할 것인가? 큰 고민없이 전형적인 방법으로 접근한다면, 아마도 버스 노선을…

원본 글 보기 442단어 남음

Hadoop: Your Partner in Crime

Hadoop: Your Partner in Crime

August 24th, 2012

Pre-crime? Pretty close…

If you have seen the futuristic movie Minority Report, you most likely have an idea of how many factors and decisions go into crime prevention. Yes, Pre-crime is an aspect of the future but even today it is clear that many social, economic, psychological, racial, and geographical circumstances must be thoroughly considered in order to make crime prediction even partially possible and accurate. The predictive analytics made possible with Apache Hadoop can significantly benefit this area of government security.

The essence of crime prevention is to understand and narrow down thousands of “what if” cases to a manageable and plausible handful of scenarios. Crime can happen anywhere and can be categorized as anything from cyber fraud to kidnapping, which provides a lot of combinations for possible misdemeanors or felonies. With the help of big data analytics, government agencies can zone in on certain areas, demographics, and age groups to pick out specific types of crimes and move towards decreasing the one trillion dollar annual cost of crime in the United States.

Zach Friend, a crime analyst for the Santa Cruz Police Department, explained that there aren’t enough cops on the streets due to insufficient funds. Not only that, but many police departments are still technologically behind in the crime-monitoring field, so big data analytics tools could be a huge step forward for police all over the country. Evidence and information about cases could be stored much more efficiently, police action could be more proactive, and crime awareness could be much more prevalent.

Who’s on the case?

The Crime and Corruption Observatory (created by the European company, FuturICT) is pushing for this kind of development and aims to predict the dynamics of criminal phenomena by running massive data mining and large-scale computer simulations. The Observatory is structured as a network that involves scientists from varying fields – “from cognitive and social science to criminology, from artificial intelligence to complexity science, from statistics to economics and psychology”.

This Observatory will be used through the framework of the developing Living Earth Simulator project – “a big data and supercomputing project that will attempt to uncover the underlying sociological and psychological laws that underpin human civilization.” The project, funded by the European Union, is an impressive advancement in technology, which will not only aid in pin pointing crime but will also effectively utilize the big data of today’s world.

PredPol has made predictive crime analytics available to police departments so that “pre-crime”, in a sense, could be put into action. Zach Friend explains, “We’re facing a situation where we have 30 percent more calls for service but 20 percent less staff than in the year 2000, and that is going to continue to be our reality. So we have to deploy our resources in a more effective way. This model does that.” PredPol allows law enforcement agencies to collect and organize data about crimes that have already happened and to use this data to predict future incidents in certain areas at a radius of 500 square foot blocks. It may not be the same as knowing the exact perpetrator, victim, and cause of the crime ahead of time as was possible in Minority Report but it is an impressive step towards perfecting crime prediction.

The Santa Cruz Police Department, which is using PredPol’s software, has already seen significant improvements in police work. SCPD began by locating areas of possible burglaries, battery, and assault and handing out maps of these areas to officers so they could patrol them. Since then, the department has seen a 19% decrease in these types of crimes.

PredPol software is able to make calculations about crimes based on previous times and locations of other incidents while cross-referencing these with criminal behavior and patterns. Here is an example of how large-scale this could get: George Mohler, a UCLA mathematician who was testing the effectiveness of PredPol, looked at 5,000 crimes which required 5,000! comparisons (i.e. 5,000 x 4,999 x 4,998…). With impressive results already materializing from calculations like these, it is exciting to think how much more accurate predictive crime analytics could become.

Hadoop lays down the law

With Apache Hadoop, perfecting crime prevention becomes an attainable goal. CTOlabs presented some very important points in a recent white paper about big data and law enforcement, showing how Hadoop could be beneficial to smaller police departments that don’t have very much financial leeway. The LAPD for example, is very well-funded and can afford to work with companies such as IBM to develop crime predicting techniques.

Smaller or less advanced departments, however, do not have the financial advantage to use supercomputers or extensive command centers and will use less efficient techniques (such as simple spreadsheets and homegrown databases) to keep track of all of the information involved in law enforcement. “Nationwide, agencies and departments have to reduce their resources and even their manpower but are expected to continue the trend of a decreasing crime rate. To do so requires better service with fewer resources.” Open source presents an extremely effective and less expensive option – Apache Hadoop is the super hero that can save the day, one cluster at a time.

With Hadoop’s capability to store and organize data, police departments can filter through unnecessary information in order to focus on the aspects of crime that are more important. By applying advanced analytics to historical crime patterns, weather trends, traffic sensor data, and a wealth of other sources, police can place patrol cops in areas with higher crime probability instead of evenly distributing man power throughout quiet anddangerous neighborhoods. This conserves money, effort, and time. Hadoop can also help organize a number of other factors such as police back up, calls for service, or screening for biases and confounding variables. Phone calls, videos, historical records, suspect profiles, or any other important information that is necessary for law agencies to keep for a long time can be systematized and referenced whenever need be.

Increasing public safety through effective use of technology is not a panacea but it is here and is an effective tool in combating crime. Apache Hadoop serves as a foundation for this new approach and, most importantly, it is accessible to a wider range of police departments all over the country and the world. Yes, predictive policing and crime prevention still have a lot of room for development and have yet to tackle issues like specific crimes that depend on interpersonal relationships or random events. However, it is all very possible, especially with the use of Hadoop as a predictive analytics platform. Crime can be stopped. No PreCogs necessary.

What is Apache Hadoop?

  • What is Apache Hadoop?
  • Apache Hadoop이란 무엇인가?    

                         

  • Apache Hadoop has been the driving force behind the growth of the big data industry. You’ll hear it mentioned often, along with associated technologies such as Hive and Pig. But what does it do, and why do you need all its strangely-named friends, such as Oozie, Zookeeper and Flume?
  • Apache Hadoop은 빅 데이터 산업 발전의 원동력이 되어왔습니다. Hive나 Pig같은 관련된 기술들에 대해서도 종종 들어보셨을 것입니다. 하지만 이것이 무엇이고, 왜 여러분들은 Oozie나 Zookeeper, Flume처럼 이상한 이름을 가진 것들을 써야 하는 것일까요?
  • Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure. By large, we mean from 10-100 gigabytes and above. How is this different from what went before?
  • Hadoop은 데이터의 구조에 상관없이, 저렴한 비용으로 큰 데이터를 처리할 수 있게 해줍니다. 여기서 우리가 말하는 ‘크다’는 의미는, 10-100 기가바이트, 그 이상을 말합니다. Hadoop을 사용하는 방법은 우리가 전부터 행해오던 방식들과 어떻게 다를까요?
  • Existing enterprise data warehouses and relational databases excel at processing structured data and can store massive amounts of data, though at a cost: This requirement for structure restricts the kinds of data that can be processed, and it imposes an inertia that makes data warehouses unsuited for agile exploration of massive heterogenous data. The amount of effort required to warehouse data often means that valuable data sources in organizations are never mined. This is where Hadoop can make a big difference.
  • 비용의 문제가 있지만, 기존에 존재하는 엔터프라이즈 데이터 웨어하우스와 관계형 데이터베이스는 구조화된 데이터를 처리하는데에 능숙하고, 굉장히 많은 데이터를 저장할 수 있습니다. 구조적인 요구사항이 처리할 수 있는 데이터의 종류를 제한하기 때문이죠. 그리고 이러한 특성은 데이터 웨어하우스가 엄청나게 많은 데이터를 빠르게 탐색하기 힘들게 합니다. 웨어하우스 데이터에 필요로 하는 엄청난 수고는 조직 내의 가치있는 데이터들의 원천이 방치되고 있음을 의미합니다. 이것이야말로 Hadoop이 큰 차이를 만들어낼 수 있는 부분입니다.
  • This article examines the components of the Hadoop ecosystem and explains the functions of each.
  • 이 글은 Hadooop 생태계의 컴포넌트들에 대해 나열하고, 각각의 기능들에 대해 설명하도록 하겠습니다.
  • The core of Hadoop: MapReduce
  • Hadoop의 핵심 : MapReduce
  • Created at Google in response to the problem of creating web search indexes, the MapReduce framework is the powerhouse behind most of today’s big data processing. In addition to Hadoop, you’ll find MapReduce inside MPP and NoSQL databases, such as Vertica or MongoDB.
  • 웹 검색 인덱스를 만드는 것에 대한 문제의 해결책으로 구글이 만들어낸 MapReduce 프레임워크는 오늘날 대부분의 빅 데이터 처리에 대해 매우 확고한 위치를 차지하고 있습니다. Hadoop 외에도, 여러분들은 Vertica나 MongoDB와 같은 MPP나 NoSQL 데이터베이스에서 MapReduce를 만나게 될 것입니다.
  • The important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing the computation solves the issue of data too large to fit onto a single machine. Combine this technique with commodity Linux servers and you have a cost-effective alternative to massive computing arrays.
  • MapReduce의 가장 중요한 혁신은 데이터셋을 나누고, 다수의 노드들에 대해 병렬적으로 실행하는 방식으로 통해 질의할 수 있는 능력입니다. 연산을 나누는 것은 하나의 머신에 대하여 지나치게 큰 데이터를 처리하는 것에 대한 문제를 해결해 줍니다. 이러한 기법들과 리눅스 서버들을 조합하여 여러분들은 대규모의 Computing Array들에 대한 비용적으로 효율적인 대안적 솔루션을 가질 수 있습니다.
  • At its core, Hadoop is an open source MapReduce implementation. Funded by Yahoo, it emerged in 2006 and,according to its creator Doug Cutting, reached “web scale” capability in early 2008.
  • Hadoop의 핵심은 오픈 소스 MapReduce 구현체라고 할 수 있습니다. Yahoo가 후원하여 2006년도에 출범하였으며, 창시자 Doug Cutting에 의하면 2008년 초에 “Web Scale” 능력에 도달하였다고 합니다.
  • As the Hadoop project matured, it acquired further components to enhance its usability and functionality. The name “Hadoop” has come to represent this entire ecosystem. There are parallels with the emergence of Linux: The name refers strictly to the Linux kernel, but it has gained acceptance as referring to a complete operating system.
  • Hadoop 프로젝트가 성숙하면서, Hadoop은 Hadoop의 사용성과 기능성을 위한 많은 컴포넌트들을 수용하였고, “Hadoop”이라는 단어는 Hadoop, 그리고 Hadoop과 관련된 다양한 컴포넌트들을 아우르는 전반적인 생태계를 지칭하게 되었습니다. 이것은 Linux의 경우와 비슷합니다. 엄밀히 말하자면 Linux는 Linux Kernel을 지칭하는 것이지만, 결국에 Linux를 하나의 완전한 운영체제로 부르고 있는 것과 같습니다.
  • Hadoop’s lower levels: HDFS and MapReduce
  • Hadoop의 로우 레벨들 : HDFS와 MapReduce
  • Above, we discussed the ability of MapReduce to distribute computation over multiple servers. For that computation to take place, each server must have access to the data. This is the role of HDFS, the Hadoop Distributed File System.
  • 앞에서, 우리는 MapReduce부터 다수의 서버들을 통해 분산 컴퓨팅까지 논했습니다. 컴퓨팅이 일어나기 위해서는 각 서버들은 데이터를 가지고 있어야 합니다. 이것이 바로 HDFS(Hadoop Distributed File System)의 역할입니다.)
  • HDFS and MapReduce are robust. Servers in a Hadoop cluster can fail and not abort the computation process. HDFS ensures data is replicated with redundancy across the cluster. On completion of a calculation, a node will write its results back into HDFS.
  • HDFS와 MapReduce는 견고합니다. Hadoop 클러스터에 존재하는 서버들은 연산에 실패할 수 있고, 연산 과정을 중단할 수 없습니다. HDFS는 데이터가 복제되고 클러스터를 통해 분산될 수 있도록 해줍니다. 연산이 끝나게 되면, 노드는 연산의 결과를 HDFS에 기록합니다.
  • There are no restrictions on the data that HDFS stores. Data may be unstructured and schemaless. By contrast, relational databases require that data be structured and schemas be defined before storing the data. With HDFS, making sense of the data is the responsibility of the developer’s code.
  • HDFS 저장소에는 데이터에 대한 제약이 존재하지 않습니다. 데이터는 구조화되지 않았을 수도 있고, 일정한 스키마가 존재하지 않을 수도 있습니다. 반면에 관계형 데이터베이스의 경우에는 반드시 데이터를 구조화 하여야 하고, 데이터를 저장하기 전에 미리 정의된 스키마들을 가지고 있어야 합니다. HDFS를 이용하면 데이터에 대한 책임은 개발자의 코드가 갖게 됩니다.
  • Programming Hadoop at the MapReduce level is a case of working with the Java APIs, and manually loading data files into HDFS.
  • Mapreduce 레벨에서 Hadoop을 프로그래밍하는 것은 Java API를 통해 작업을 하는 것이라 할 수 있고, HDFS 안으로 데이터를 수동적으로 데이터 파일을 HDFS로 로드해 오는 것이라 할 수 있습니다.
  • Improving programmability: Pig and Hive
  • 프로그래밍능력의 향상 : Pig 와 Hive
  • Working directly with Java APIs can be tedious and error prone. It also restricts usage of Hadoop to Java programmers. Hadoop offers two solutions for making Hadoop programming easier.
  • Java API를 통해 직접 작업하는 것은 다소 지루한 작업이고, 에러가 발생하기 쉽습니다. 또한, 이것은 Java 프로그래머들에게 Hadoop의 사용법을 제한하기도 합니다. Hadoop은 Hadoop 프로그래밍을 보다 쉽게 할 수 있게 도와주는 두가지 솔루션을 제공합니다.
  • Pig is a programming language that simplifies the common tasks of working with Hadoop: loading data, expressing transformations on the data, and storing the final results. Pig’s built-in operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations.
  • Pig는 Hadoop의 공통 Task들(데이터를 로드해오거나, 데이터 변환을 표현하는 방법, 마지막 결과를 저장하는 방법 등)을 단순화 시켜놓은 프로그래밍 언어입니다. Pig의 내장 연산들은 마치 로그 파일 같은 반 구조화된 데이터라 할 수 있습니다. 그리고 Pig는 Java를 통해 확장이 가능하고, 커스텀 데이터 타입이나 변환 등을 제공할 수 있습니다.
  • Hive enables Hadoop to operate as a data warehouse. It superimposes structure on data in HDFS and then permits queries over the data using a familiar SQL-like syntax. As with Pig, Hive’s core capabilities are extensible.
  • Hive는 Hadoop을 데이터 웨어하우스처럼 사용할 수 있게 해줍니다. Hive는 HDFS에 있는 데이터를 구조화 시키고, SQL과 유사한 문법으로 데이터를 질의할 수 있도록 해줍니다. Pig처럼 Hive의 핵심 요소들은 확장 가능합니다.
  • Choosing between Hive and Pig can be confusing. Hive is more suitable for data warehousing tasks, with predominantly static structure and the need for frequent analysis. Hive’s closeness to SQL makes it an ideal point of integration between Hadoop and other business intelligence tools.
  • Hive와 Pig 중에 무엇을 선택해야할지 혼란스러울 것입니다. Hive는 데이터 웨어하우징 작업, 정적 데이터, 빈번한 분석이 요구되는 경우에 적합합니다. Hive의 SQL에 가까운 문법은 Hadoop과 다른 비즈니스 도구들과의 통합에 이상적일 것입니다.
  • Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming data flows for incorporation into larger applications. Pig is a thinner layer over Hadoop than Hive, and its main advantage is to drastically cut the amount of code needed compared to direct use of Hadoop’s Java APIs. As such, Pig’s intended audience remains primarily the software developer.
  • 보다 큰 어플리케이션들의 결합을 위한 데이터의 변환 과정을 간결한 스크립트로 개발이 가능하게 함으로서 Pig는 개발자들에게 보다 빠르게 많은 데이터 셋을 탐색할 수 있게 해줍니다. Pig는 Hadoop 위에 Hive에 비하여 얇은 레이어를 형성하는데, Pig를 사용함으로서 Hadoop의 Java API를 직접 사용 하여 코드를 작성하는 것보다 더 적은 코드만으로도 동일한 작업을 수행할 수 있게 해줍니다. 따라서, Pig의 주된 사용자는 주로 소프트웨어 개발자가 될 것입니다.
  • Improving data access: HBase, Sqoop and Flume
  • 데이터 접근의 향상 : HBase, Sqoop and Flume
  • At its heart, Hadoop is a batch-oriented system. Data are loaded into HDFS, processed, and then retrieved. This is somewhat of a computing throwback, and often, interactive and random access to data is required.
  • 태생적으로, Hadoop은 Batch 기반 시스템입니다. 데이터들은 HDFS로 로드되어지고, 처리되며, 검색됩니다. 이는 다소 구식의 방식이고, 종종 대화식으로 작업되거나 랜덤 엑세스가 되어야 하기도 합니다.
  • Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google’s BigTable, the project’s goal is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase.
  • HBase는 HDFS의 위에서 동작하는 컬럼 기반 데이터베이스입니다. Google의 BigTable이 설계된 후, 프로젝트의 목표는 매우 많은 데이터를 빠른 속도로 접근하는 것이 되었습니다. MapReduce는 HBase를 MapReduce 연산을 시작에서부터 끝까지 사용할 수 있습니다. 그리고, Hive와 Pig는 HBase와 조합하여 사용되어질 수 있습니다.
  • In order to grant random access to the data, HBase does impose a few restrictions: Hive performance with HBase is 4-5 times slower than with plain HDFS, and the maximum amount of data you can store in HBase is approximately a petabyte, versus HDFS’ limit of over 30PB.
  • 데이터에 대한 랜덤 엑세스를 허용하기 위하여, HBase는 몇가지 제약을 가합니다. HBase와 Hive를 같이 사용하는 경우의 퍼포먼스는 일반 HDFS에 비하여 4-5배 느리고, HBase에 저장할 수 있는 최대 데이터는 대략 1 페타바이트 정도 되는 반면, HDFS의 최대 용량은 30페타 바이트입니다.
  • HBase is ill-suited to ad-hoc analytics and more appropriate for integrating big data as part of a larger application. Use cases include logging, counting and storing time-series data.
  • HBase는 ad-hoc 분석에 적합하지 않고, 큰 어플리케이션의 파트로서 빅 데이터를 통합하는데에 더 적합합니다. 주된 유스케이스로는 로깅이나 카운팅, 시간순 데이터를 저장하는 것이 있겠습니다.
  • The Hadoop Bestiary
  • Hadoop의 조각들
  • Ambari Deployment, configuration and monitoring
  • Ambari : 디플로이, 환경설정, 모니터링
  • Flume Collection and import of log and event data
  • Flume : 수집 및 로그/이벤트 데이터 임포트
  • HBase Column-oriented database scaling to billions of rows
  • HBase : 수 억만 로우의 데이터를 스케일링 할 수 있는 컬럼 기반 데이터베이스
  • HCatalog Schema and data type sharing over Pig, Hive and MapReduce
  • HCatalog : Pig와 Hive, MapReduce간의 스키마와 데이터 타입을 공유하게 하는 도구
  • HDFS Distributed redundant file system for Hadoop
  • HDFS : Hadoop을 위한 분산 복제 파일 시스템
  • Hive Data warehouse with SQL-like access
  • Hive : SQL 스타일로 접근할 수 있는 데이터 웨어하우스
  • Mahout Library of machine learning and data mining algorithms
  • Mahout : 머신러닝과 데이터 마이닝 알고리즘 라이브러리
  • MapReduce Parallel computation on server clusters
  • MapReduce : 서버 클러스터 기반의 병렬 컴퓨팅을 위한 솔루션
  • Pig High-level programming language for Hadoop computations
  • Pig : Hadoop 컴퓨팅을 위한 고수준 프로그래밍 언어
  • Oozie Orchestration and workflow management
  • Oozie : 오케스트레이션 및 워크플로우 관리
  • Sqoop Imports data from relational databases
  • Sqoop : 관계형 데이터베이스에서 데이터를 임포트하기 위한 솔루션
  • Whirr Cloud-agnostic deployment of clusters
  • Whirr : 클러스터들의 클라우드 불가지론 배포
  • Zookeeper Configuration management and coordination
  • Zookeeper : 환경설정 관리 및 코디네이션을 위한 솔루션
  • Getting data in and out
  • 데이터를 집어넣고 가져오기
  • Improved interoperability with the rest of the data world is provided by Sqoop and Flume. Sqoop is a tool designed to import data from relational databases into Hadoop, either directly into HDFS or into Hive. Flume is designed to import streaming flows of log data directly into HDFS.
  • Sqoop와 Flume을 통해 수많은 데이터들간의 상호 운용성을 획득할 수 있습니다. Sqoop은 관계형 데이터베이스들의 데이터를 Hadoop으로 가져오기 위해 고안된 도구이고, HDFS나 Hive로 직접 가져오게 할 수도 있습니다. Flume은 HDFS로 직접 로그 데이터들을 스트리밍할 수 있게 고안된 도구입니다.
  • Hive’s SQL friendliness means that it can be used as a point of integration with the vast universe of database tools capable of making connections via JBDC or ODBC database drivers.
  • Hive의 유사 SQL은 Hive가 다양한 데이터베이스 도구들의 통합에 사용될 수 있고, JDBC나 ODBC 데이터베이스 드라이버를 통해 커넥션을 만들 수 있음을 의미합니다.
  • Coordination and workflow: Zookeeper and Oozie
  • 협동과 작업흐름 : Zookeeper와 Oozie
  • With a growing family of services running as part of a Hadoop cluster, there’s a need for coordination and naming services. As computing nodes can come and go, members of the cluster need to synchronize with each other, know where to access services, and know how they should be configured. This is the purpose ofZookeeper.
  • Hadoop 클러스터의 부분으로서 실행중인 서비스들이 많아지면, 코디네이션과 네이밍 서비스가 필요해질 것입니다. 컴퓨팅 노드가 들어오고 나가면서, 클러스터의 멤버들간에 동기화가 이루어질 필요가 생길 것이고, 어느 서비스로 접근해야 할 것인지 알아야 할 것이며, 어떻게 조정하게 될 것인지에 대해서도 알아야 할 것입니다. 이것이 Zookeeper의 임무입니다.
  • Production systems utilizing Hadoop can often contain complex pipelines of transformations, each with dependencies on each other. For example, the arrival of a new batch of data will trigger an import, which must then trigger recalculations in dependent datasets. TheOozie component provides features to manage the workflow and dependencies, removing the need for developers to code custom solutions.
  • Hadoop을 설정하는 생산 시스템은 종종 상호간에 의존성을 갖는 복잡한 변환 처리들을 위한 파이라인을 포함합니다. 이를테면 새로운 데이터 배치 작업은 데이터 임포트를 실행하게 하고, 임포트 작업은 의존되는 데이터셋을 다시 연산해야 하는 식으로 말입니다. Oozie 컴포넌트는 워크플로우와 의존성을 관리하는 기능을 제공하고, 개발자로 하여금 커스텀 솔루션 코드를 수정해야하는 필요를 줄여줍니다.
  • Management and deployment: Ambari and Whirr
  • 관리와 배포 : Ambari 와 Whirr
  • One of the commonly added features incorporated into Hadoop by distributors such as IBM and Microsoft is monitoring and administration. Though in an early stage, Ambari aims to add these features to the core Hadoop project. Ambari is intended to help system administrators deploy and configure Hadoop, upgrade clusters, and monitor services. Through an API, it may be integrated with other system management tools.
  • IBM과 Microsoft 같은 기업들이 추가한 기능들은 보통 모니터링이나 관리 도구입니다. 초기에는 Ambari는 이러한 특징들을 Hadoop의 핵심 프로젝트로 포함하는데에 집중했었습니다. Ambari는 Hadoop을 설정하고, 배포하고, 클러스터를 업그레이드하거나, 모니터 서비스를 제공하여 시스템 관리자에게 도움이 될 수 있는 시스템으로 만드는 것을 목표로 하였습니다. 이것들은 API를 통해서 다른 시스템 관리 도구들과 합쳐질 수 있을 것입니다.
  • Though not strictly part of Hadoop, Whirr is a highly complementary component. It offers a way of running services, including Hadoop, on cloud platforms. Whirr is cloud neutral and currently supports the Amazon EC2 and Rackspace services.
  • 비록 Hadoop의 어느 파트로 정해진 것은 아니지만, Whirr은 매우 상호 보완적인 컴포넌트입니다. Whirr은 클라우드 시스템에서 Hadoop을 포함하여 서비스들을 실행하는 방법을 제공합니다. Whirr은 클라우드에 특화되어 있고, 현재 Amazon EC2와 Rackspace 서비스를 지원하고 있습니다.
  • Machine learning: Mahout
  • 기계 학습 : Mahout
  • Every organization’s data are diverse and particular to their needs. However, there is much less diversity in the kinds of analyses performed on that data. The Mahout project is a library of Hadoop implementations of common analytical computations. Use cases include user collaborative filtering, user recommendations, clustering and classification.
  • 모든 조직의 데이터는 다양하고 그들만의 수요에 특화되어 있습니다. 그러나 데이터를 분석하는데에 있어서는 그다지 다양한 종류의 데이터가 필요하지는 않습니다. Mahout 프로젝트는 일반적인 분석적 컴퓨팅에 관한 Hadoop의 구현체입니다. 주된 유스케이스는 사용자간의 필터링이나 사용자 추천, 클러스터링, 분류 등이 있습니다.
  • Using Hadoop
  • Hadoop의 사용
  • Normally, you will use Hadoop in the form of a distribution. Much as with Linux before it, vendors integrate and test the components of the Apache Hadoop ecosystem and add in tools and administrative features of their own.
  • 일반적으로 여러분들을 Hadoop을 분산된 형태로 사용하고자 할 것입니다. Linux가 그러했던 것처럼, 수많은 벤더들이 Apache Hadoop 생태계의 컴포넌트들을 통합하고 테스트할 것입니다. 그리고 그들 나름대로의 도구나 관리 정책을 만들어낼 것입니다.
  • Though not per se a distribution, a managed cloud installation of Hadoop’s MapReduce is also available through Amazon’s Elastic MapReduce service.
  • 비록 Hadoop이 하나의 정형화된 배포판으로 제공되어지고 있지 않지만, 클라우드로 관리되는 설치 방법으로는 Hadoop의 MapReduce는 Amazon의 Elastic MapReduce 서비스를 통해 제공되어지고 있습니다.

After Boston: Terrorism and the Technology Gap

The Boston Marathon bombing, subsequent manhunt and current investigation are unprecedented – not only due to the nature of the attack but because of how much information has been available to law enforcement, the public and the suspects. Unlike any previous large-scale attack, data came in at a staggering velocity within seconds of the twin explosions, yielding constant changes and misreporting, but also the timely apprehension of the suspects.

For all its evident successes, however, this “big data” event exposed many limitations in existing technologies, demonstrating the need for new capabilities and providing new collaborative opportunities for law enforcement and technology developers. This article is about the technological capabilities Boston demonstrated we need, rather than about the victims, the heroism of the Boston responders or cooperation among the agencies involved. Some of the technologies discussed below may already exist in some form, but still are not ideally suited to the needs of this kind of event.

Here’s my punch list:
Avid for Intel: The FBI and Boston Police Department (BPD) requested and received video and photos from witnesses to the blast and from private security cameras in the vicinity. Capabilities in this area are not ready for “prime time”. To my knowledge there is no image and video management system that can operate at scale to quickly to stitch together the various images in both space and time. The best seem to be in the entertainment industry – something akin to the Avid video editing and production suite – than the intelligence community. Each image from a smart phone carries telemetry data that can be used to orient it in space and time. Add hundreds or even thousands of those images together, taken from different vantage points and different times, and you have an amazingly detailed mosaic of the environment. Being able to ‘play it back’ to particular time stamps, say to see who put a package where, is an enormous challenge and opportunity. I suspect that Boston pulled off this feat with brute force but a technology solution to this type of image management capability seems to be in order. Similar ideas can be seen in movies, but these haven’t yet made the trip from the big screen to the real world. No matter how many video cameras a city installs, we should expect that there will be increasing amounts of consumer imagery and video available and we must develop the technology to harness it.

Complex Event Processing (CEP): It’s difficult to imagine the barrage of information flying at the Boston law enforcement team on April 15th: citizen tips, social media posts, 911 calls and forensic evidence to name a few. But I could imagine that the primary information management system was email, and that it wouldn’t take long in a rapidly evolving event such as this to be drowned in message traffic and miss key pieces of information. CEP is an idea typically found in machine automation, but automating alerts based on key events could ensure that the right message gets to the right people automatically. That might mean any small event in a key location (or a certain type of activity anywhere) generates an alert. In order for CEP to be effective for a rapidly evolving situation, it would require a very simple configuration interface and easy integration into data streams and messaging systems. In the next Boston-type event there will be no time to call a support contractor for help configuring rules; this has to be almost consumer-friendly out of the box.

Link Analysis: There is a critical need in rapidly unfolding situations to organize the information you have and tie it together in a way that allows you to tell a story or build a case. As the Boston authorities tried to figure out who the suspects were, little pieces of information came in all the time, answering critical questions like: How many suspects are there? Where do they live? Where do they work? How are they tied together? This is certainly the promise of link analysis software from vendors like Palantir, IBM/i2, Visual Analytics, Centrifuge and others. Unfortunately, without a room full of engineers from the vendor, customers don’t have the capability to use these tools rapidly enough and with the level of sophistication this type of event requires, and most agencies end up using these tools for a few simple activities and as basic drawing packages. The products, business models and capabilities destined for use in crises must evolve in order to make the kind of headway needed during a fast-moving event. Even a city the size of Boston doesn’t have the budget or the day-to-day need for the level of investment that would be required to have those capabilities using today’s solutions.

Geographic Information Systems (GIS): Every law enforcement and homeland security agency has GIS tools. But let’s face it: nobody can use them at the pace and level of complexity that Boston required. And that’s not Boston’s fault; it’s the tools’. Modern GIS systems are built on old software architectures to support geographers. But they need to be rebuilt for the velocity of social media data, for easy and rapid data entry, for simple analysis, and for quick information sharing and reporting. The needs of law enforcement to see the locations of detonations, devices that were discovered, suspect homes and other parts of the crime scenes and then correlate that data with reporting from social media, random tips and their own personnel was just out of reach. They had the tools and they knew how to use them, but the tools are not up to the task. Given the revolution in geo-enabled consumer apps such as Foursquare, Google Maps, Yelp and Find My iPhone, it’s disappointing that the professional tools are so lacking in capability.

Crowd Analytics: From the DARPA Challenge to the recent Intelligence Advanced Research Projects Activity (IARPA) crowd forecasting program, this has been a pretty hot topic for research. The FBI’s release of suspect photos proved that the crowd was able to identify the suspects better than facial recognition algorithms were apparently able to do against their drivers’ licenses and other publicly available photos. In addition to allowing witnesses who saw or knew the suspects to identify them, the crowd presents a massive computational reasoning capability with the entire Internet at its disposal. The crowd was able to find the suspects’ Russian-language social network VKontakte (VK), Twitter and other social media accounts faster than the government. Leveraging the crowd for search, translation, information dissemination and such bears much promise and much peril. More will be written, I’m sure, about the ill-fated reddit community attempt to analyze crime scene imagery, but make no mistake: a well-organized crowd can be a powerful tool.

Social Identity: Identity resolution and identity management capabilities are used every day by law enforcement and intelligence agencies. But these capabilities struggle with low-quality data sources. It’s one thing to find an identity match with a name, date of birth and social security number; it’s something else entirely when the name has multiple spellings and there’s no other good information. It’s particularly hard to find that person’s social media identity, perhaps the first place you’ll see their extreme views or other information that may provide additional leads or explanations of motives. And, in this case like many others, fraudulent websites are created as quickly as the event unfolds, further confusing the search for suspect identities. High quality but rapid social identity solutions are needed to understand a person’s identity when their official government identity is either unknown or insufficient. And these tools must not only be timely in order to have any value to law enforcement, they must also be accurate.

Social TTL: The concept of tagging, tracking, and locating (TTL) is well known in the intel and special operations communities. But, as we could see that one of the suspects was logging into his VK and Twitter accounts from his smart phone during the event, it exposed the need for a different kind of TTL. All of the technology capabilities to identify the user and track the location of his mobile phone exist, but were not readily available in a timely manner in Boston.

Phone Neutralization/Intercept: The explosive devices used during the marathon were apparently triggered with controllers from a radio-operated toy, but they first appeared to have been detonated by mobile calls or messages, as with many other attacks of this nature. After the suspects were identified there was concern that they possessed additional devices and that those devices could be remotely detonated using mobile phones as well. Along with the Social TTL idea, there is a need to either neutralize, intercept or exploit the mobile phones of the suspects. This would have been even more essential with more assailants or a protracted standoff. Products exist that would allow law enforcement to disable a phone from communicating on the network, track it precisely and even send it direct messages.

Digital Canvassing: Digital cameras and video were not the only sources of information available at the time of, or leading up to, the explosions. There was also a high volume of Tweets, Facebook updates, Yelp check-ins, Instagram posts and even YouTube uploads. One idea for identifying potential witnesses or suspects is to play back all of those time-stamped posts to determine who was in the vicinity, and when. Similar to deploying policemen to canvass a neighborhood, a digital canvass would allow investigators to review what was in the public social space that might yield clues.

Behavioral Markers: Every friend of the suspects interviewed by the media said that they were shocked by the attack. That their friends had been normal Americans but that something must have triggered a fundamental change. Each time there’s an event like Boston or Sandy Hook or the Gabrielle Giffords attack or the Aurora movie theater shooting, we seem surprised that these acts occurred, that we could only see the evidence after the fact. In reality, the behavioral ‘markers’ were there more often than not. But any attempt at analytical prevention or detection approaches quickly encroaches on the privacy and civil liberties of people with psychological disorders or those of a given race or chosen religion. In light of the potential to save many lives, we must have the courage to do responsible research on the behavioral markers of people who are mentally or ideologically capable of committing mass murder. We must address the root causes and find signals that we can detect in advance so that we can prevent these events from happening.

Smart Phones for Law Enforcement: Government, from the Pentagon to local police departments, have been slow to embrace smart phones. This mainly stems from a legitimate concern for protecting sensitive information, determining acceptable use, limiting the high cost of migrating to a new device – even from the uncertainty of choosing the right vendor. But it seems obvious that the Boston suspects had a real-time information advantage over those responsible for tracking them down. The smart phones the suspects carried would have allowed them to listen to the police scanners (I’m not sure they did, but I did – so they could have), tweet to their growing list of followers, monitor the news and call their mother. This “net centric warfare” provided a time and information advantage over the chain-of-command information flow to radios and outmoded Blackberry email devices. Equipping cops with smart phones, connected to some of the information sources described above, would tip the playing field back in favor of law enforcement.

Information Security: The International Association of Chiefs of Police (IACP) and others have reported recently that law enforcement’s use of social media is primarily to disseminate information rather than to monitor or engage. As former Homeland Security Secretary Michael Chertoff wrote in The Wall Street Journal recently, BPD did a fantastic job of using Twitter as an authoritative information source to quell rumors and enlist the public’s help. However, this event also showed the need to be able to control publicly available information that may be used by the adversary. I suspect BPD had forgotten or didn’t know that its police scanners with detailed operational information were being streamed over the Internet. The rapid flow of information that is easily accessed by even the simplest smart phone raises the stakes for information- and cyber-security during events like Boston.
To close, I welcome your ideas, your comments, your additions, and your opposing viewpoints. In such a dialogue lies a tremendous opportunity for refinement and innovation of the tools and products that support our public safety and intelligence agencies.
# # #

Disclaimer: These observations are made from a distance; I was not part of the Boston response nor do I have input on these technologies from anyone who was. Moreover, this is being written while the event is still unfolding and nothing has yet been published about the tools and technologies that were actually used during the event. These observations and opinions, and any errors, are my own.

Bryan Ware is the CTO of Haystax Technology, a new analytics company focused on the defense and intelligence sector. Mr. Ware was the co-Founder and chief technology strategist for Digital Sandbox until its acquisition by Haystax. His current work is focused on intelligence, law enforcement, and financial industry applications particularly in real-time analytics, social media intelligence, and mobility.

미국, 총기 범죄 방지 대책의 일환으로 빅데이터 기술 활용성 주목

□ 미국, 총기 범죄 방지 대책의 일환으로 빅데이터 기술 활용성 주목

○ 미국에서 총기를 사용한 범죄가 증가하면서 심각한 사회 문제로 대두되는 가운데, 빅데이터 기술을 이러한 범죄 발생의 예방책으로 활용할 수 있다는 주장이 제기
– 최근 미국 내에서는 버지니아 공대, 샌 안토니오 극장, 뉴타운 샌디후크 초등학교 등과 같이 다수의 인명 피해자가 발생한 총기 난사 사건이 급증하고 있는 상황
– 이에 따라 총기 소지 반대와 총기 규제 강화 등 법적인 대응책 마련에 대한 목소리가 거세지고 있음에도 불구하고 총기 구입이 계속해서 증가하는 추세
– 특히 미국은 베스트바이(BestBuy)나 월마트(Walmart)를 비롯한 일반 유통점에서 손쉽게 총기를 구매할 수 있으며, 일부 지역에서는 총기 소지 면허가 없어도 구매가 가능한 상황
– 따라서 미국 내 총기 관련 거래 통계는 정확한 공식 기록조차 이루어지고 있지 않은 상황

※ 美 재무부의 보고에 따르면 미국 내에서 매년 판매되는 450만대의 소형 화기 가운데 약 200만대가 개인용 소형 총기에 해당되는 것으로 추정, 국방부 역시 2005년 기준 약 18억개의 총알이 판매되었다고 보고하는 등 총기 소지 및 활용이 보편화

○ 전문가들은 국가 차원에서 총기 및 총알 구매 정보를 DB화하고 체계적으로 관리하는 빅데이터 분석 기반 시스템을 구축함으로써 문제 해결에 나서야 한다고 조언
– 전문가들은 대규모 총기 범죄를 실행하기 전까지 대부분의 범인들이 일정 기간 동안 다수의 총기 판매점을 통해 대량의 총기를 구매하는 등 유사한 행태를 보인다고 지적
– 또한 범인들은 무기 구매 시 대형 유통점이나 이베이와 같은 온․오프라인 유통업체를 이용했으며, 샌 안토니오 극장 사건의 피의자인 제임스 홈즈(James Homes)의 경우에는 미국 최대 배송 업체인 UPS(United Parcel
Service)을 통해 무기를 수령한 것으로 확인
– 따라서 모든 총기 관련 구매 정보를 DB화하여 실시간으로 모니터링하고 갑작스런 대량 구매가 파악된 경우에 한해 감시 및 조사가 진행될 수 있다고 주장

□ 실시간 상황 정보 분석 통해 범죄 예측 및 대응력 향상…프라이버시 침해 여부도 제기

○ 대량의 정형 및 비정형 데이터를 기반으로 이용 행태 등을 분석하는 빅데이터 기술의 활용성이 주목받고 있는 가운데 최근 범죄 분야 역시 빅데이터 적용 분야로 각광
– 최근 SAP와 같은 데이터 기반 분석 소프트웨어 사업자가 개발한 상황 (Situational Awareness) 기술 기반 솔루션은 공공 안전 및 치안 유지에 활용 가능
– 빅데이터 기반 분석 정보를 보다 효과적으로 파악하고 전달할 수 있는 시각화 기술도 개발되어 적용 중
※ SAP이 제공하는 상황별 인식 기술 솔루션 :
① SAP 하나(SAP HANA) – 공공 안전 데이터를 활용해 원하는 분석 결과를 제공
② SAP 비즈니스 오브젝트 BI(SAP Business Object BI) – 명령 제어 센터에 특정 정보를 전달
③ SAP 사이베이스 모바일 플랫폼(SAP Sybase Mobile Platform) – 경찰관 등이 소지한 모바일 단말기로 범죄 발생 관련 데이터를 제공
– 이러한 빅데이터 기술은 국방 및 경찰 조직 등 치안 관련 조직들에 관련 정보를 실시간으로 제공함으로써 감시, 예측, 행동력을 향상시킬 수 있을 것으로 기대

○ 대량 정보를 대상으로 한 실시간 수집, 분석 및 제어 시스템은 위급 상황이나 사건 발생 시 필요한 신속한 알림체계 구축에도 영향
– 사건의 피해를 최소화하기 위해서는 신속한 보고 및 알림을 통해 사건 발생 시 한시라도 빨리 현장에 인력을 배치하는 것이 중요
– 런던 메트로폴리탄 경찰 서비스 조직(Greater London’s Metropolitan Police Service)의 앤드류 왓슨(Andrew Watsion)은 경찰의 의무는 범죄 예방과 감시에 있기 때문에 상황 정보의 실시간 파악이 의무 완수를 위한 필수적 사안이라고 지적
– 미국 경찰공제조합(Fraternal Order of Police)의 회장인 척 캔터버리(Chuck Canterbury) 역시 경찰 조직의 실시간 정보 접근성은 범죄 발생에 대한 대비 및 대응에 반드시 필요한 부분임을 강조

○ 반면, 국민들의 개인정보 침해 등을 이유로 이러한 시스템 구축에 반대하는 의견도 제기되고 있어 관련 시스템 구축이 쉽지만은 않을 것으로 예상
– 미국총기협회(National Rifle Association, NRA)는 총기 구매 및 소지와 관련한 정보 수집은 자신의 신변 안전을 위한 목적으로 총기를 구매하는 일반인 또는 기존 총기 소지자들의 프라이버시를 침해한다며 반대 입장을 강하게 피력

[출처]

1. Forbes, “Situational Awareness Technology Uses Big Data to Fight Terrorism”, 2012.12.27
2. The Atlantic, “How Big Data Can Solve America’s Gun Problem”, 2012.12.27