The Closing of the Internet’s Open-Source Data

A few weeks ago I overheard a colleague, one of Thetus’ resident data scientists,  as he wrestled with the interface for creating an account on a Chinese geocoding website.

Mandarin-language CAPTCHA dialogue

“Every stroke is perfect,” he told me as he stared at a Mandarin-language CAPTCHA dialogue. “I spent two hours getting it right.” All he wanted to do is find map coordinates for a large group of factories in mainland China, but Google doesn’t recognize the addresses and to Baidu Maps he might as well have been a poorly-programmed bot. To this day we don’t know exactly where those buildings are.

But the language difficulties he faced then are just a small and naturally-occurring symptom of a larger phenomenon that stands between open-source analysts and the data their duties require: a balkanization of information. By design the Internet is an open book, equally accessible to anyone with a connected device. Today, however, it is in the process of fracturing into a collection of tenuously-connected shards.

The first cracks appeared in the 1990s as China’s ruling party began to institute controls on their citizens’ access to outside information. Implemented through content filters, IP blocking, and widespread monitoring of Internet users, the Great Firewall of China’s primary goal is to suppress outside data on its way to the viewer. Of greater interest to an outside analyst are its secondary effects: stricter controls on outside sites cause users to prefer China’s native alternatives for social media, searching, and commerce. The result is that outside analysts must exert greater efforts across a wider variety of sites to see the same information that their subjects do. Another challenge is the chilling effect that Internet monitoring has on speech – the more users feel obliged to self-censor, the less likely it is that observers can expect to find or exploit true and complete information.

Deep Maze (Flickr image by Paul Downey)
Deep Maze (Flickr image by Paul Downey)

China is not the only authority attempting to filter or isolate its national networks. World governments have only grown more interested in similar policies, especially since the 2013 disclosures by former National Security Agency contractor Edward Snowden about the extent of the NSA’s electronic spying. Iran’s government has taken concrete steps to lay the foundation of an isolated “halal Internet”; most tech-savvy North Koreans have only ever seen the inside of the DPRK’s “walled garden” national intranet. Even Brazil has expressed serious interest in taking greater ownership of its Internet infrastructure, laying plans for its own undersea cable to Europe to avoid interception by America and flirting briefly with the idea of requiring web-based companies to store data on Brazilian clients inside the country.

At the same time, the Internet as a whole is expanding rapidly: almost three hundred million new websites emerged between 2013 and 2014, though most of these are probably of no interest to observers. Even the tiny minority that could have useful information expand an analyst’s search space massively. With access to certain viewpoints simultaneously becoming more difficult to gain, we risk becoming trapped in an echo chamber repeating homogenous opinions and facts.

The takeaway for analysts working in the open-source environment is that we have to be better than ever in order to stay competitive. We have to seek out opposing viewpoints, verify our sources, and work harder to keep our biases in check. Perhaps the most important edge we can seize is technology – tools that can sift the ocean of unsorted data at our fingertips, bringing it some efficiency and structure. At Thetus, we are working to build better nets.

~ Colin McWrightman, Analyst

What’s in a Word?

You’re Probably Using the Wrong Dictionary is a delightful and stimulating essay by writer and programmer James Somers. Somers’ essay explores one’s relationship with words and their meaning, and the surprising influence the source of those meanings can have on that relationship.  In it, he discusses his search for the unusual dictionary described – but not named – by John McPhee.  A single definition proves enough to discover the rich and evocative definitions and usage notes from Noah Webster’s original 1828 dictionary of American English, which was the first of its kind. Somers then describes how those definitions have changed his relationship with even the most mundane of words when compared to the pedestrian entries found in modern dictionaries, which he describes as “desiccated little husks of technocratic meaningese, as if a word were no more than its coordinates in semantic space.”

Webster’s Revised Unabridged Dictionary (1913 + 1828) can be searched online.  I began looking up the words we use for the top-level modeling concepts at Thetus before I even finished reading the essay.  Some sample entries from the 1913 edition:

  • Thing: Whatever exists, or is conceived to exist, as a separate entity, whether animate or inanimate; any separable or distinguishable object of thought.
  • Person: self-conscious being, as distinct from an animal or a thing; a moral agent; a human being; a man, woman, or child.
  • Organization: The state of being organized; also, the relations included in such a state or condition.
  • Place: Any portion of space regarded as measured off or distinct from all other space, or appropriated to some definite object or use; position; ground; site; spot; rarely, unbounded space.
  • Event: That which comes, arrives, or happens; that which falls out; any incident, good or bad.

Compared to say, the built-in dictionary found in OS X, these definitions are much more subtle and far less sterile.  Also, Event has a particularly nice example of usage notes.

perspectiveSomers recommends you go look up some words in Webster, and I do too.  It may just change the way you feel about the words you use in everyday life.  It’s already changing the way I think about the words we use to describe models at Thetus.

Finally, the post has an appendix where Somers explains how to use the 1913 edition on a Mac, iPhone, or Kindle if you want to be able to use this dictionary every day.  Enjoy exploring!

~ Marijane White, Data Scientist


“You’re probably using the wrong dictionary,” May 18, 2014,

Unintended Consequences: Cause and Effect

American sociologist Robert K. Merton popularized the term unintended consequences, meaning the unanticipated or unforeseen outcomes of purposeful actions, in the twentieth century. Unintended consequences can be positive, but many times they are disastrous.

Giant African SnailThese consequences can be seen in all aspects of life. In nature, there are examples like the introduction of the giant African snail to Hawaii. The snails, at first prized, quickly became major pests throughout the Hawaiian Islands. The Hawaii State Department of Agriculture introduced the rosy wolf snail to curb the giant African snail population, only to find that the carnivorous rosy wolf snail had hunted all of the local snail species almost to extinction, including the indigenous Oahu tree snail.[1]

For analysts, unintended consequences are one of the biggest hurdles they face while doing their jobs. For any decision they make, analysts must not only speculate on what the unintended consequences of an event might be, but must also determine how to lessen or circumvent their impact.

Recent news articles have touched on the unintended consequences of terrorist violence, which has pushed refugees into countries that cannot support these immigrant populations, causing not only a strain on the refugees but also on the hosting countries’ populations.[2]

Ebola Quarantine Another recent example of unintended consequences in the news is the Ebola epidemic. Ebola was a hot issue in the United States for about a month, in which time we saw only a few cases of Ebola in the US. However, in Africa, the unintended consequences are just starting to emerge. When Ebola was shown to be spreading beyond its original borders, many countries refused to allow traffic to and from these places. This isolation hampered the medical community, making it harder to get medical supplies and practitioners to the places that needed them the most. It also impacted the economies of the countries most affected by Ebola. The African Development Bank expects the disruption in trade caused by these isolation measures to cut GDP growth in the three hardest-hit countries (Guinea, Sierra Leone and Liberia) by 1.5 – 3.4 percent.[3]

The UN Food and Agriculture Organization says the epidemic is endangering harvests[4] and raising food prices. The creation of quarantine zones has created labor shortages, hampered cash-crop production and led to panic buying. Countries that already have less money are now unable to import resources they need for their populations—populations that now have a higher unemployment rate along with the inability to secure the resources that they need to live.[5]

Every day, we see news reports about events like these all over the world. While some of them are interesting and sometimes scary, a lot of the time it’s hard to care because these events are happening so far away. Many times it’s hard to see how these things might affect us, but it’s important to remember that the chain of events leading from a single decision can affect not only the originally intended populace, but you as well.

~ Sierra Payne, Analyst Lead







Cartel Alliances and Tactics Through the Eyes of Savanna

Say hello to my little friend!

When envisioning cartels, Tony Montana’s violent, drug-lord lifestyle is what comes to mind for many: a distant stylized life of crime and deception far removed from us. And while Scarface is an entertaining film, its depiction of cartels is a perfect example of Hollywood’s misleading portrayal of an increasingly real problem throughout the world.

In the last decade, Mexico has been torn apart by the rise of numerous cartels. Over 60,000 people have been killed as the Mexican military, police force, cartels and vigilantes continue to fight one another (Business With the introduction of a new President three years ago, Mexico saw a shift in cartel power when two major cartel leaders were arrested. However, with the leaders gone, cartels fight for territory, with the byproduct being an increase in extortion and kidnapping. So what can the Mexican government do to prevent cartel crimes?

Thetus decided to take a closer look at the nature of cartels and how they operate. With Savanna, our all-source analysis software, and the powerful data analysis capability of IBM’s Analyst’s Notebook, we investigated cartel operations and tactics to determine potential prevention strategies.

Crumbnet of cartel alliances on the left with a Map of Instagram activity on the right.

To begin, we built a Crumbnet, Savanna’s narrative analysis tool, to outline cartel alliances in Mexico, focusing on relationships and connections between the 7 major cartels. Savanna’s dynamic Occurrence dossiers helped us gather existing information on each Mexican cartel and its relationships, with connections to related events and participants. Through Savanna’s powerful Search tool, we found an existing Analyst’s Notebook Chart that outlined the leadership change in the Los Antrax cartel and its connections to the Sinaloa cartel. We then uploaded Instagram hashtag data to the Map tool to get a geospatial view of the presence and influence of Los Antrax in Mexico. At this point, it was clear that the connections between cartels and the consequences of these relationships required further investigation, so we built a report in Savanna’s Note tool to share with team members and send to investigators for further action.

View each step of our cartel analysis in the demo below.


“Mexico’s 7 Most Notorious Drug Cartels,”, last modified October 20, 2014,

Introducing Analyst’s Notebook in Savanna 4.2

It seems like only yesterday that we released Savanna 4.1, and here we are with a new release and exciting news; IBM Analyst’s Notebook is here! Savanna 4.2, our all-source analysis solution, adds many new features and enhancements, including integration with Analyst’s Notebook, giving you comprehensive, easy-to-use tools to make your analysis experience intuitive, fun and fast.

While there were a variety of added features and enhancements, we will focus on three key developments: integration with IBM Analyst’s Notebook, enhanced Dashboard and temporal Map filters.

IBM Analyst’s Notebook integration

The integration of Analyst’s Notebook’s data analysis capabilities and Savanna’s dynamic concept modeling tools offers a holistic suite of tools catering to all areas of investigation.

4.2’s integration with Analyst’s Notebook enables analysts to work seamlessly with Analyst’s Notebook Charts (ANB Charts) within the Savanna environment. After compiling data in Analyst’s Notebook, you can easily upload, search and view ANB Charts in Savanna, with all data indexed to allow Savanna’s Search tool to pull key terms from within a Chart for quick discovery and analysis.

View Charts in Savanna
View Charts in Savanna

For example, you might be investigating cartel movements in Mexico and have a Chart file showing connections between numerous cartel members. After uploading the Chart to Savanna, you can Search, view and interact with the Chart. For instance, you might want to zoom in on key relationships between cartel members and take a screenshot to be used later in your analysis.

Dashboards: Customizable, Collaborative and Really Cool

The new Dashboard feature offers customizable, problem-specific hubs from which to launch your analysis workflow.

Analysts have long used Savanna to create problem areas (Spaces) where they can house and organize the information that they have collected and created. For example, an analyst might create a Space to house content and findings related to their analysis work about Mexican cartel movements.

Now, Dashboards act as the home page of each Space, giving you easy access to important information and providing alerts for recent activities, uploads, and models created within a Space. For example, a fellow team member may have uploaded an ANB Chart related to your current analysis. The Dashboard provides an alert to help you stay on top of the most current information available.

4.2's new Dashboard is easy to customize for quick collaboration.
4.2’s new Dashboard is easy to customize for quick collaboration.

Temporal Filters for Timely Information

With Map’s new temporal filter, you can quickly filter data to reveal date and location patterns at a glance. Simply drag the filter over a specific period of time to view data points on Map relevant only to the selected period. The temporal filter also helps you view data as part of a larger historical whole, making it easy to discover trends that reveal themselves over time in a geospatial context.

Filter Map data temporally to quickly view date range and count.
Filter Map data temporally to quickly view date range and count

To keep up with upcoming releases and new features, follow our blog. In the meantime, you can keep yourself entertained by watching Savanna in action on our YouTube page at Until next time.

Investigating Visa Fraud with Savanna

Although it’s not a story often portrayed in the news, visa fraud is a widespread problem with a multitude of variables, making it tricky to track and prevent. One type of visa that is of particular interest is the H-2B visa for temporary or seasonal nonagricultural labor. The US plans to admit 66,000 workers under H-2B visas in 2015, and the cap of 33,000 for the first half of the year was reached on January 26th (“Cap Count for H-2B Nonimmigrants”). While many applications are legitimate, and criminal prosecutions for H-2B violations are rare, abuse of the program is common (“Officials at N.C. company International Labor Management are charged with visa fraud”). Employers ask for more workers than they need, or ask for workers for longer periods of time than the standard seasonal time period. Because employers aren’t responsible for housing costs, H-2B visas cost employers less, unlike the H-2A program. Because of this, fraud and abuse of the H-2B visa program becomes more prevalent.

Analyst's Notebook Chart finds connections between pending H-2B applicant and suspicious sponsor employer
Analyst’s Notebook Chart finds connections between pending H-2B applicant and suspicious sponsor employer

Combining Savanna, our all-source analysis software, with the powerful data analysis of IBM Analyst’s Notebook, Thetus decided to take a closer look at H-2B visa fraud to determine potential prevention strategies.

With IBM’s Identity Insight and Analyst’s Notebook, we were able to visualize a suspicious pending H-2B application and all of its related connections. This helped us identify and flag several pending H-2B applications of concern that have multiple sponsor contact names but only one sponsor employer, which is a common pattern of fraudulent activity.

After importing the Analyst’s Notebook Chart built around the suspicious H-2B applicant and his connections into Savanna, we can capture and expand on discoveries made in Analyst’s Notebook and find out more information on the suspicious sponsor employer.

We build a Crumbnet, Savanna’s mind-mapping tool, to outline the discoveries from Analyst’s Notebook and frame our analysis in narrative form. Savanna’s dynamic Occurrence documents helps us compile existing knowledge and connections on each pending H-2B applicants and the suspicious sponsor employer. A quick key word Search reveals previously built Charts uploaded by another Savanna user identifying the suspicious sponsor employer as a previously investigated company with ties to drug cartels. At this point, it is clear a larger visa fraud investigation is needed and we compile our findings in a Note to share with team members and send to investigators for further action.

View each step of the visa fraud analysis in further detail in the full demo video below:


“Cap Count for H-2B Nonimmigrants,”, last modified March 4, 2015,

Ken Otterbourg, “Officials at N.C. Company International Labor Management are Charged with Visa Fraud,”, last modified February 20, 2014,

Artisanal Handcrafted Ontologies

Macrame_OwlsHere at Thetus, we use the OWL Web Ontology Language to create our semantic knowledge models. Engineers working with OWL typically edit their models with tools like Protégé or TopBraid Composer, but sometimes you want or need to create a model by hand.  Semantic Modelers at Thetus often edit ontologies by hand because our knowledge modeling engine, Publisher, uses our own in-house OWL serialization syntax, known as Thetus Markup Language (TML).  TML is an XML syntax that was created as a friendlier alternative to RDF/XML.

Maintaining a proprietary serialization syntax and the toolchain to support it is a lot of work, so we’ve begun to consider alternatives.  As the Semantic Web has matured, many serialization syntaxes for RDF and OWL have been proposed, and choosing one can be a little overwhelming.  For handcrafted ontologies, many Semantic Web veterans recommend using Turtle, but they don’t often explain why it’s the best choice.  Working with our own syntax has given modelers at Thetus some strong opinions about what we were looking for in a new one, and since we weren’t very familiar with many of the available syntaxes, we decided to do some research and draw our own conclusions.

The modeling team at Thetus reviewed about a dozen syntaxes for this effort: the OWL Functional Syntaxes, OWL/XML, RDF/XML, Manchester Syntax, Turtle, N-Triples, JSON-LD, Notation 3 (N3), TriG, TriX, and N-Quads.  With the help of various syntax conversion tools, we converted some ontologies we’ve worked with to each of these syntaxes so we could assess their readability.  We also spent a lot of time reviewing the documentation for each syntax and scouring the web for expert opinions on them.  We developed a set of criteria for evaluating syntaxes, and then weighted the criteria and ranked them.

Click to expand serialization decision matrix

Our criteria for editing models by hand fell into two main categories. First, the syntax should be easy for humans to read and write. Specifically, during the review, four human-friendly qualities stood out in particular:

  1. The syntax provides an alternative to using absolute IRIs, such as namespace prefixes or relative IRIs.
  2. The syntax doesn’t force the author to write flat triples, offering features such as inline or nested blank nodes, collections, or lists.
  3. The syntax provides a shorthand for literals and common predicates.
  4. The syntax is not XML, as many people do not enjoy typing numerous angle brackets.

Second, the syntax should have good support from existing Semantic Web tools, such as Apache Jena, the OWL API and various RDF and OWL conversion tools, because you don’t want to invest a lot of effort into building a model only to find that existing tools can’t read it. Since we were focused on hand-editing models, we didn’t take into consideration whether Protégé or TopBraid Composer supported the syntaxes.

Given the human-friendly qualities above, it became easy to rule out several syntaxes immediately.  We learned that four of the syntaxes are closely related: Turtle, N3, N-Triples and TriG, which we began to refer to collectively as the Turtle family. Of the Turtle family, N-Triples lacks the first three qualities, so it was easy to rule out.

Of the other syntaxes, N-Quads also lacks the first 3 qualities. OWL/XML and TriX don’t do so well on the second quality because they lack nesting features.  All three of the normative syntaxes (OWL Functional, OWL/XML and RDF/XML) lack shorthand features.  OWL/XML, RDF/XML, and TriX were further ruled out because they are all XML formats. And if avoiding XML is important, JSON-LD also starts to look less attractive, because it requires typing lots of curly braces and is somewhat verbose.

This leaves us with three of the four Turtle family syntaxes—Turtle, N3, TriG—and the Manchester Syntax.  In our weighted ranking, they were in a four-way tie for first place across the human-friendly ease of use qualities, with good support for all four of our desired qualities and only minor differences in their coverage of the third quality.

We used our second main category of criteria—tool support—to break this tie, which is how Turtle ended up leading the pack.  It has the most support of any serialization syntax other than RDF/XML , which is the only syntax that OWL 2 tools are required to support.  Additionally, if you are a SPARQL user, familiarity with Turtle is useful because it has a great deal of overlap with the syntax of SPARQL’s WHERE clause.

So now you know why Thetus thinks Turtle is the best syntax for writing ontologies by hand! What do you think?

~ Marijane White, Principal Engineer