graph data science Archives

In the rapidly evolving landscape of data analytics and artificial intelligence (AI), the recent talk by Dan McCreary, Head of AI at TigerGraph, at the NVIDIA GTC event stands out as a significant milestone. His presentation, titled “Enhanced Data Analytics: Integrating NVIDIA Rapids cuGraph with TigerGraph,” shed light on the critical importance of graph databases in AI and the groundbreaking work TigerGraph is doing in collaboration with NVIDIA. This blog dives into the key insights from Dan’s talk and the implications for the future of AI and data analytics.

The Critical Role of Graph Databases in AI

Dan McCreary kicked off his presentation by emphasizing the crucial role of graph databases in the realm of AI. Graph databases, unlike their relational and non-relational counterparts, are designed to handle highly interconnected data efficiently. This characteristic makes them particularly suited for applications that require the analysis of complex relationships between data points, such as fraud detection in banking—a field where TigerGraph has already marked its prowess with several successful implementations.

Drawing inspiration from Jeff Hawkins’ theories on the brain, as outlined in his books, Dan used a poignant quote to set the stage: “The key to artificial intelligence has always been the representation.” This statement highlights a fundamental challenge in AI: accurately modeling and representing the data in a way that machines can effectively process and learn from.

Navigating the Representation Problem in AI

Dan’s talk delved into the representation problem in AI, a crucial hurdle to achieving more advanced and efficient AI systems. He identified four key types of data representations used in AI today: images, sequences, tables, and graphs. Each of these representations has its domain of applicability and associated challenges, but Dan’s focus was on graph representations due to their ability to model complex relationships and dynamics.

One of the main challenges with graph data is its inherent sparsity and the difficulty of optimizing these representations for hardware. This is where the collaboration between TigerGraph and NVIDIA becomes pivotal. Dan walked the audience through the complexities of dense and sparse matrix representations and discussed the journey towards achieving a fully hardware-optimized graph system.

Leveraging NVIDIA’s RAPIDS cuGraph for Breakthroughs in Performance

The partnership between TigerGraph and NVIDIA has been instrumental in addressing the challenges of graph data analytics. Dan highlighted how TigerGraph is leveraging NVIDIA’s RAPIDS cuGraph libraries to tackle the problems associated with sparse matrix representations. The discussion touched upon the pros and cons of using Python for these tasks but underscored the substantial performance improvements enabled by NVIDIA’s RAPIDS libraries.

A highlight of Dan’s presentation was the demonstration of up to 100x speedups in performance when utilizing NVIDIA GPUs for algorithms like PageRank. This impressive achievement underscores the potential of graph analytics when combined with powerful hardware acceleration, offering a glimpse into the future of AI where graph representations play a central role.

The Synergy Between TigerGraph and NVIDIA: Pioneering the Future of AI Hardware

In closing, Dan McCreary expressed his gratitude towards NVIDIA for their partnership. This collaboration is not just about achieving short-term gains in performance but about jointly paving the way for the next generation of graph-optimized hardware. By combining TigerGraph’s expertise in graph database technology with NVIDIA’s leadership in GPU technology, the two companies are at the forefront of creating solutions that can handle the complexity and scale of tomorrow’s AI challenges.

The significance of Dan McCreary’s talk at NVIDIA GTC extends beyond the technical details of integrating cuGraph with TigerGraph. It represents a pivotal moment in the evolution of AI and data analytics, highlighting the shift towards graph representations as a key enabler of more sophisticated and effective AI systems. As companies increasingly migrate to graph representations to enhance their predictive capabilities, the work being done by TigerGraph and NVIDIA will undoubtedly play a crucial role in shaping the future of AI.

In an era where the ability to analyze and leverage complex relationships in data can provide a competitive edge, the advancements discussed in Dan’s presentation offer exciting possibilities. Whether in detecting banking fraud more accurately or in understanding customer behaviors and product dynamics, the integration of NVIDIA Rapids cuGraph with TigerGraph is setting new benchmarks for what is possible in AI and data analytics.

The journey towards a future where AI can more closely mimic the intricacies of human intelligence and decision-making is fraught with challenges. Yet, with visionaries like Dan McCreary leading the charge and fostering collaborations between industry giants like TigerGraph and NVIDIA, the path forward seems not only clearer but also significantly more promising. As we look ahead, the continued innovation in graph database technology and hardware optimization heralds a new era for AI—one that is more intelligent, efficient, and capable of understanding the complex web of relationships that define our world.

As the Head of Marketing at TigerGraph, I’m thrilled to extend a warm invitation to all enthusiasts, professionals, and curious minds to join us at the upcoming Graph + AI Summit 2024. This event is an absolute must for anyone interested in leveraging the transformative potential of graph technology and artificial intelligence. Here is why:

1. Exclusive Announcements and Sneak Peeks: As a participant of Graph + AI, you’ll be among the first to hear about our latest product announcements, updates, and future roadmap plans. Get exclusive access to sneak peeks, beta releases, and insider information straight from the source.

2. Great Networking Opportunities: At Graph + AI Summit, you’ll have the chance to connect with the industry leaders, innovators, and experts in the fields of graph databases and artificial intelligence. Rub virtual shoulders with professionals from organizations like Mastercard, KPMG, and JPMorgan Chase & Co, among others. Whether you’re a seasoned professional or just starting your journey, networking with like-minded individuals can open doors to collaborations, partnerships, and invaluable insights.

3. Cutting-Edge Insights and Case Studies: Our event will feature keynote speeches, panel discussions, and workshops led by top thought leaders and practitioners. Gain firsthand knowledge from real-world case studies showcasing how leading organizations are leveraging the synergy between graph databases and AI to drive innovation, solve complex problems, and unlock new opportunities.

4. Exclusive Insights from Industry Visionaries: We are honored to have Hamid Azzawe, TigerGraph’s CEO, to present at the event. With a wealth of experience from Meta, Amazon, Microsoft, Bloomberg, RBC, AMFAM, and IBM, Hamid brings a unique perspective to the table.

5. Hands-On Workshops and Demos: Explore the practical applications of graph technology and AI through interactive workshops and live demonstrations. Whether you’re interested in building recommendation systems, fraud detection algorithms, or knowledge graphs, our workshops will provide you with the tools, techniques, and best practices you need to succeed.

Graph + AI Summit isn’t just another event—it’s a gathering of passionate individuals united by a common goal: to unlock the full potential of graph technology and artificial intelligence. Join us on this exciting journey of discovery, collaboration, and innovation. We can’t wait to see you there!

Save the Date: May 1-2, 2024

Location: virtual

Registration Link

Many executives are pondering difficult decisions about making large investments in AI. For many of them, their lack of a technical background makes it difficult for them to visualize the impact of AI on their customers, their products, and their employees. To help executives make the right strategic decisions, we need powerful storytelling in terms they can understand and remember.

I have been creating a set of stories and metaphors to guide executives when they need to make strategic decisions about AI investment. After testing, my Jellyfish and Flatworm story has been remarkably effective at helping them guide their peers. I would appreciate feedback from my readers if this story is sticky enough to guide your leaders.

At the core, this story is about why Knowledge Representation (KR) must be the core of any cost-effective long-term AI strategy. We will see how Large-Language Models (LLMs), Knowledge Graphs (KGs), and Reference Frames (RFs) are moving us closer to general AI and how building hybrids of these three knowledge representation strategies is the best path.

At the end of this story, you can start to ask if your organization is more like a jellyfish or a flatworm. Clues about how much you need to invest in AI will be clear. Let’s begin our story.

The Evolution of Animal Intelligence

About 600 million years ago, animals evolved cells that helped them react to environmental changes. Let’s start with the elegant jellyfish. Jellyfish live in the open ocean, far away from complex structures. A jellyfish only needs simple rules to navigate its environment. Jellyfish might move to depths that allow them to capture more prey and avoid their predators. But they are not hunters. They depend on fish drifting into their tentacles.

Jellyfish live in a relatively simple environment and need to be efficient with their energy use. They really don’t need a complex centralized nervous system to help them navigate the ocean. If jellyfish had a big complex brain that required energy, they would quickly starve. To survive, they needed to keep things simple.

In contrast, on the ocean floor, things were getting much more complicated. To seek their prey and avoid predators, animals like flatworms started to develop muscles to help them move around. They also developed more cells on their skin that could process complex signals such as light, temperature, and smell. They used these sensory systems to get detailed information about their environment. And like the jellyfish, they also developed rules to survive. But not all the rules stayed simple. Knowing both what rules to follow and when to follow them became more complex.

Flatworms are thought to be the first hunters.

Movement and the Evolution of the Central Nervous System in Flatworms

Movement requires animals to sense where they are in their environment and to remember when to execute rules. Image: DALL-E 3.

Then something really interesting started to happen. Flatworms started to centralize where these rules were executed. Putting them all near their front-facing sensors made sense. We now call that the “head” of our animals. They started to evolve a complex network of centralized nerve cells, which we now call a centralized nervous system or CNS in their heads. These networks of communicating nerve cells evolved to become the brains of animals that move about in the world.

So why did they need to build such complex and energy-consuming cells? The key thing to understand is that movement makes executing rules complicated. Like an anemone, a plant sits in a single location on the ocean floor. It does not need to understand how things change if it moves. But any animal that moves needs to start to learn the structure of its environment. If it turns around 180 degrees, it needs to know that this helps it move away from predators. The bottom line is that we can’t really understand animal intelligence without having a deep appreciation for understanding how intelligence and models of our world are tied to motion and, importantly, maps and structure.

The Evolution World Models in Brains

The flatworm had one of the first centralized nervous systems to help it navigate its world. Image credit: https://vimeo.com/37417377

Let’s explore why storing models of the world around it gave these flatworms a competitive advantage over their siblings. We ask, how can we have more precise ways to know what rules to execute and when to execute these rules in order to survive?

Imagine two flatworms. One that had a precise model of the world around them in their brain, and another that did not have a precise model. As these animals crawled out of their holes to seek their prey, those with a more precise model would remember where the best food was. They could also remember where predators threatened them. You can think of these models as internal maps of a flatworm’s surroundings. They used these models to give themselves a competitive advantage. They had more offspring, and these offspring also built more precise models of the world around them. We call these models “world models” because their structure represents the world around them.

The key point here is that these early nervous systems evolved into many other much more complex systems that have become our brains. Humans out-competed our extinct ancestors because we could model the world and predict how our actions could help us survive and out-compete our rivals. Modeling what is in our prey or predator’s brain can also be very helpful for survival. Does that mammoth think strategically about the consequences of being headed toward a cliff?

In summary, animals have brains that are predictive organs that must model their world and build mental maps of their world. These models advise us on what actions to take to help us survive. They also give us advice on the consequences of we don’t think strategically about the complex systems around us.

Let’s apply what we learned about jellyfish and flatworms to our organization.

Language Models Are Not World Models

Now, you might ask, “What does this all have to do with AI?” Much discussion has been about LLMs and how they are used to generate text. But these language models are fundamentally different from the world models in our brains. Let’s consider how they are different.

LLMs are used to predict the next word given a sequence of preceding words. They were never designed to store accurate models of the real world. Language is a collection of symbols we use to describe our world. When we need to communicate ideas between people, we generate sequences of words that fit within patterns called grammar and syntax. Although tools like ChatGPT and Llama 2 are incredibly useful, they were never designed to model the world and simulate the impact that our actions would have on the future states of our worlds.

Don’t get me wrong here. I love my GPT-4! But we must be clear. Modeling language is only a shadow of how we communicate about the world. It really is not a precise model of the world. It can be complemented with actual models of the world, but fundamentally, the knowledge representation distributed through billions of weights in a neural network has severe limitations with precision, reproducibility, truthfulness, performance, and explainability.

Knowledge Graphs ARE World Models

Many of my readers know that I have been deeply involved in building large-scale Enterprise Knowledge Graphs (EKGs) for the last six years. Before that, I wrote books on the tradeoffs of using various NoSQL databases. I am a person who can quickly visualize how knowledge is represented in computers, and my goal is to explain the tradeoffs of these alternative representations.

Knowledge graphs are the closest thing we have today to modeling our world in computers. Oh, and by the way, if you pick the right graph database, you can get it to scale out over hundreds of servers. Google, Amazon, LinkedIn (Microsoft), and even Pinterest have proven this for over ten years.

Just like the flatworm needed to model the structure of their environment by building precise maps, knowledge graphs are also the best way for us to manage structure. This can be the structure of our customers, our products, and our competition.

Animal Brains Use Reference Frames

Now, we come to the most interesting fact. Our brains don’t really store data like large-language models or knowledge graphs. We store knowledge in a form that evolved from building maps of our world. These are called Reference Frames and are described clearly in Jeff Hawkin’s book A Thousand Brains. Unlike an LLM, their knowledge can be continually updated. And just like scale-out distributed knowledge graphs, their processing is done in parallel. I won’t go into too much detail on reference frames here but look to innovative companies like Numenta to combine reference frames with LLMs to build new AI systems.

The take-home point is that reference frames can teach us many things about intelligence and how we need to use maps and structure to help us make better predictions. There will be more to come on this topic in future blog posts.

Measuring Complexity In Your Organization

Imagine a simple company. You make a single product for a single customer and have no competition. Image by the author.

So, should you be building a model of the world in your internal computer systems? Let’s take a look at what a simple company might be.

Imagine you supply a single specialized part to another manufacturer. You are good at what you do and get the same contract every year. You don’t really have any competition. I would describe this company as living in a simple environment, much like the jellyfish living in the open ocean. We can call this company a “jellyfish company.” You can probably model your organization using a spreadsheet or a relational database that uses flat file representation of the world with a few very slow JOIN operations if things get complicated. Your IT department doesn’t need a huge budget.

Now, let’s look at a more complex company. One that has many customers sells many products, and these products each have many competitors. Their structure might look like the following:

You can see that you need a complex model of your world to sell your products in a highly competitive landscape to many types of consumers. You are more like a flatworm company than a jellyfish company. You need complex models that include structure, relationships, precision, explainability, and the ability to add new complexity at will.

How you manufacture and market your products can be dauntingly complex. Can you simulate the impact of a price increase on one of your products? Are you modeling customer behavior? Can you predict the impact of a new marketing campaign? Can you explain why sales of some items are dropping off? If you can’t do this today, it might be that your model of the real world is too simple and too flat without structure. You might need to invest in using a combination of knowledge graphs and LLMs to accelerate your ability to predict the future.

Conclusion

Today, we are seeing unprecedented investments in artificial intelligence. The first wave is mostly investment in tools to make it easier for firms to build intelligent agents that help worker productivity. But all the agent software in the world might not help if your data is trapped in spreadsheets and siloed data. Knowledge needs to be centralized and connected.

Today, jellyfish companies are exceedingly rare. Most companies must deal with rapidly evolving complexity and make precise predictions that require accurate models of the world around them. Companies must focus on building the foundations that will power thousands of intelligent agents working together on centralized knowledge. And remember, going to the cloud will not save you if you have 1,000 silos.

Let me know if this story works for you. Can you tell this story to executives and ask them “are we a jellyfish or a flatworm company”? Ask them if a centralized knowledge graph would help them answer hard questions about their customer, products and competitors.

If you would like to hear how TigerGraph can help your organization build a centralized nervous system, contact us at info@tigergraph.abstage.xyz. re

In the rapidly evolving world of customer-facing businesses, providing an exceptional omnichannel customer experience has become the key to success. As online retail sales have soared over the last decade, it has become evident that connecting data across various silos is essential for a true omnichannel approach. In this blog post, we will explore how TigerGraph, a powerful graph database platform, is helping large customer-facing businesses create a connected customer platform, enabling them to leverage data effectively, improve customer interactions, and boost profits.

The Challenge of Consolidating Customer Data

Creating a comprehensive and coherent dataset that integrates everything known about customers, their purchasing behavior, and service usage is the foundation of a connected customer platform. However, consolidating these datasets is often a daunting task, and many businesses have struggled to achieve it successfully.

Retailers often face the challenge of dealing with messy customer data, multiple accounts for a single customer, and inconsistencies when they have grown through acquisitions. Moreover, purchase decisions are made at the customer or household level, but the data is often at a device or account level, leading to potential inaccuracies in models and insights.

The Power of Graph Databases

While traditional databases have failed to effectively connect data across silos, graph databases have emerged as a game-changer. Unlike traditional tabular databases, graph databases work on networks of connected data, allowing businesses to structure their databases as vast networks of customer-related information.

Graph databases offer several advantages in consolidating data, including:

Flexibility: Graphs don’t impose a single structure on all records, enabling connections between data from different silos, even with inconsistent structures and many-to-many relationships.
Graph Entity Resolution: Using this technique, the graph can quickly identify likely matches between data from different systems, resolving them through fuzzy matching, address standardization, and machine learning.

TigerGraph: Transforming Customer Data and Driving Omnichannel Profits

TigerGraph has emerged as a leading graph database platform, delivering unparalleled performance and scalability. Its ability to handle real-life retail and banking datasets up to 30 times larger than its closest competitor and its remarkable speed, up to 1000 times faster, make it a perfect fit for large customer-facing businesses. A case in point of successful utilization of TigerGraph is demonstrated by these two enterprises, showcasing how they have effectively leveraged its capabilities.

Multichannel Retailer: By leveraging TigerGraph, a large multichannel retailer was able to bring together data from five legacy acquisitions and connect family units of customers using multiple devices, payment cards, and addresses. This allowed them to market consistently across all customer touch points and resulted in a 17% increase in customer engagement.
Global Media Conglomerate: Another success story involves a global multichannel media conglomerate that merged data from 15 independent divisions to create the first and largest identity graph in the advertising industry. This enabled them to target audiences with personalized commercials aligned with their interests, leading to improved advertising performance.

The importance of a connected customer platform cannot be underestimated in today’s customer-centric business landscape. TigerGraph is empowering large customer-facing businesses to consolidate data across silos, improve customer interactions, and drive omnichannel profits. As the platform continues to gain recognition and accolades, it remains a valuable asset for businesses seeking to deliver a true omnichannel customer experience.

If you’re interested in exploring how TigerGraph can transform your customer data and drive profits in your business, you can sign up for a free instance of TigerGraph Cloud at tgcloud.io or contact us at info@tigergraph.abstage.xyz.

The world has always been built around connections, but the world today is more hyper-connected than ever before.

Tapping into the power of these rich, growing networks – whether that be financial transactions, social media networks, recommendation engines or global supply chains – will make or break the bottom-line of tomorrow’s leading enterprises.

Given this critical importance of connections in the modern business environment, it’s about time that our database technology kept up.

Legacy databases (known as relational databases or RDBMS) were built for well-mapped, stable and predictable processes like finance and accounting. These databases use rigid rows, columns and tables that don’t require frequent modifications, but when the database model does need to change, it’s an expensive hassle.

But today’s business world is in regular flux – change is the only constant. When building software applications, business and user requirements change all the time. And yet, most legacy database software fights against these changes rather than evolving with them.

Enter graph databases. The graph database model is built to store and retrieve connections from the ground up. It’s more flexible, scalable and agile than RDBMS, and it’s the optimal data model for applications that harness artificial intelligence and machine learning. AI and ML thrive on connected data, and that’s exactly what graph technology delivers.

So, what’s a graph database and what’s it good for? I’m so glad you asked.

What Is a Graph Database?

A graph database stores two kinds of data: entities and the relationships between them.

Data entities are stored as vertices (or sometimes nodes) and data relationships are stored as edges. Vertices represent nouns: people, places, products, locations, payments, and more. Edges represent the verbs or relationships that connect various vertices. This network of interconnected vertices and edges is called a graph.

For example, a customer (vertex) has (edge) an shopping cart (vertex). The edge has connects the customer vertex and the shopping cart vertex.

Here’s another example: An app user (vertex) sends (edge) a payment (vertex) directed to (edge) another app user (vertex). The two app user vertices are connected to the payment vertex via the sends edge and the directed to edge, respectively.

In addition, vertices can have attributes which add more details to each record within a vertex. For instance, a customer vertex might have attributes like name, phone number and credit card number.

Graphs are often best understood visually. The images below are all graphs of vertices and edges that are stored in a graph database.

Graph database software stores all the records of these interconnected vertices, attributes, and edges so that they can be harnessed by various software applications. In other words, graph databases store networks of interrelated data.

What Is a Native Graph Database?

As graph technology grows in popularity, more and more database vendors offer “graph” capabilities alongside their existing data models (such as relational, document, wide column, key-value or other NoSQL stores). But the trouble with these graph add-on offerings is that they’re not optimized to store and query the connections between data entities.

If your application frequently needs to store and query data relationships, then you need a native graph database.

The key difference between native and non-native graph technology is what it’s created for. A native graph database – like TigerGraph – uses something called index-free adjacency to physically point between connected vertices in the database. This ensures connected data queries are highly performant.

Essentially, if a database model is specifically engineered to store and query connected data then it’s a native graph database. If the database was first engineered for a different data model and only added “graph” capabilities later, then it’s a non-native graph database.

Non-native graph data storage is often slower because all of the relationships in the graph have to be translated into a different data model (and then back again) for every graph query.

While these differences might not appear critically important, it all comes down to why you’re using a graph database in the first place.

Why Use a Graph Database?

If your application frequently queries and harnesses the relationships between users, products, locations, or any other entities, then you’re better off using a best-in-class native graph database. The same is true if your use case leverages network effects or requires multiple-hop queries across your data.

A graph database is quicker for your development team to modify and quicker for your application to query. Graph database technology also grows and evolves alongside your business and application requirements – it never lags behind or gets stuck in the past.

And it almost goes without saying that if your enterprise relies on graph analytics or graph data science, then you need a native graph database to ensure real-time performance for mission-critical applications.

What Are Graph Databases Used for?

The real question is what are graph databases not used for? The use cases for graph technology are vast, diverse and growing. Here’s a rundown of some of the most popular graph database use cases out there today:

Most Popular Graph Database Use Cases:

Artificial Intelligence & Machine Learning

Fraud Detection

Recommendation Engines

Increase Revenue:

Customer 360 / Master Data Management

Entity Resolution

Product & Service Marketing

Reduce Costs & Manage Risks:

Anti-Money Laundering

Risk Assessment & Monitoring

Cybersecurity Threat Detection

Improve Operational Efficiency:

Supply Chain Analysis & Management

Energy Management System & Analytics

Network Resources Optimization

Foundational Technology:

Graph Data Science

Time-Series Analysis

Geospatial Analysis

…and a lot more! Graph technology is a tool to build the future, so there’s no limit to the use cases you might discover.

Who’s Already Using Graph Databases?

Graph databases have been skyrocketing in popularity for more than a decade, and everyone from enterprises organizations to innovative startups is tapping into the power of graph technology.

Here are just some of the leading companies who are already using graph database technology to deliver value to end-users and dominate their industries:

Intuit: AI-powered knowledge graph
JPMorgan Chase: fraud detection
Microsoft Xbox: customer experience
Ford: entity resolution
Amgen: social network analysis for healthcare

Of course, these are only a few of the many cutting-edge organizations using graph databases to harness connected data. Discover more graph database users and use cases on the TigerGraph Customers page.

Conclusion

Our world is shaped – and powered – by connections, so it’s time your database software catches up to reality. In fact, graph databases mimic the pattern-matching functions of how the human brain maps the world through neurons (vertices) and synapses (edges). It’s this human-intuitive data model that makes graph technology so unique and powerful.

No matter what your enterprise’s core business, it can be enhanced with the power of connected data. And if your team can tap into the power of data relationships today, you’ll be well ahead of the competition come tomorrow.

If you have any questions, or would like a demo, please contact us at info@tigergraph.abstage.xyz.

by Parker Erickson and Victor Lee

Anti-Money Laundering (AML) refers to the set of laws, regulations, and procedures aimed at preventing and detecting money laundering – the concealment of illegally obtained funds, through the creative use of transactions. A common money laundering pattern called “layering” uses a mesh of mostly ordinary-looking transactions among colluding businesses and individuals that funnel money from one source to a target destination.

Money laundering layering

Money laundering through layering

To comply with government regulations, financial institutions build and operate systems that aim to flag potential money laundering behavior, typically done with hard-coded rules. However, due to the complexity of financial transactions and the various legitimate reasons for certain patterns of behavior, these systems often generate a significant number of false positive alerts. As the number of transactions grows worldwide, financial institutions are struggling to keep up with these alerts in a scalable manner, all while having to identify the changing patterns criminals undertake to prevent getting caught. To meet this challenge, more financial institutions are looking towards utilizing machine learning to help sift through the mountain of alerts in order to find suspicious activity.

The AML Investigation Process

The anti-money laundering process starts with the alerting process. In conventional AML systems, a set of alert rules are applied to transactions and the parties involved. A human investigator may need to take a look at flagged activity to judge if it is unusual for the party in question. If it is unusual, the party is elevated to a case status, where a further human review of the activity in question is used to determine if the activity is suspicious and should be reported to the appropriate government authorities. If that is the case, a Suspicious Activity Report (SAR) is filed, and the financial institution has completed their regulatory requirements. If a financial institution fails to report suspicious activity, they become open to being investigated and fined by the regulatory bodies.

AML Workflow

The AML investigation process

Financial institutions need to be as proficient and accurate as possible at detecting money laundering to satisfy regulators. The most efficient method is to improve the alert rules so that there are fewer false positives (alerts that turn out not to be real money laundering) and few false negatives (real money laundering that wasn’t detected). In practice, false positives are the bigger problem. Imperfect screening tends to flag too many transactions, and all those extra alerts (false positives) translate to higher human investigation costs.

Why Graph Machine Learning?

In order to reduce these false positive alerts while maintaining the rate in which suspicious activity is flagged, many institutions are turning towards machine learning (ML) to classify transactions and prioritize the alerts to be investigated. This is because machine learning is able to take a large number of data characteristics (or “features”) and then find the right balance of those factors that optimizes a goal, such as catching 98% of money laundering with a minimal amount of false positives. Conventional machine learning has proven to be helpful, but it is not enough.

Institutions are turning to graph machine learning to reduce their false positives and thereby lower their operating costs, by 10%, 20%, and more. Graph is able to provide this boost because it analyzes relationships. Money laundering only becomes apparent when the connections between parties and events are analyzed. Two types of connections are important:

Connections between individuals (e.g., beneficial ownerships, politically exposed persons, etc.)
Flows of money between accounts (e.g., cycles of transactions, mule accounts, etc.)

Graph machine learning is a subfield of machine learning that utilizes a graph data structure to either generate features for traditional ML models such as XGBoost, or native graph ML models such as Graph Neural Networks. The core idea is to incorporate information about the relationships between different entities in a complex system. The financial service space is a perfect space to apply these techniques, due to the core form of data – financial transactions – being inherently a relationship between different entities – accounts and parties. As AML is all about determining if relationships between accounts is suspicious activity, it makes it a natural fit to apply graph machine learning.

Graph Machine Learning Techniques

In the anti-money laundering space, there are a few different techniques for incorporating graph structure into ML pipelines. The first is to calculate graph features for existing ML models such as XGBoost, Random Forests, and more. These graph features can be standard graph algorithms, such as PageRank or Shortest Path, or they can be patterns that are tailored to the particular bank’s data model and customer behavior, such as “the number of previously flagged Parties within 3 degrees of separation from the Party in question, weighted by the size of the transactions.”. Another method is to use graph neural networks which natively incorporate relationships between entities in the machine learning model, without the need for explicit feature extraction.

graph-enhanced ML, 2 ways

Two methods of Graph Machine Learning. Pictured on the left is incorporating graph features with traditional ML models, while the right represents utilizing native graph ML models such as GNNs.

TigerGraph is a leader in graph feature generation due to its scalability (10s of Terabytes of graph data), as well as execution speed due to its natively distributed and massively parallel processing (MPP) architecture. TigerGraph’s support for Python and several ML libraries make it simple for data scientists to add graph features or graph machine learning to their existing ML workflows.

Graph Features And Traditional Machine Learning

Various graph features can be used to help traditional machine learning models perform in anti-money laundering and financial crime detection in general. Here, we are going to give an overview of 3 main classes of graph algorithms that can be used for graph features:

Shortest Path: Shortest path algorithms can answer questions like “how close is the party of interest to a party that has exhibited suspicious behavior in the past?”, which may provide a useful signal to the downstream machine learning model.

shortest path algorithms

An example of shortest path algorithms determining how close a party is to a known suspicious party.

Centrality: Centrality algorithms, such as PageRank, can measure the influence of a vertex on others within the graph, which can be a useful feature in the ML model, especially when used in conjunction with if the influential parties were known to be suspicious in the past.

centrality algorithms

An example of the output of a centrality algorithm, where larger scores are denoted by larger vertices. Parties that are more central have more transaction inflow and outflow, which may be a suspicious behavior.

Community Detection: Community algorithms, such as Weakly Connected Components (WCC) or Louvain, can be used to determine groups of vertices that share common characteristics or heavily interact with each other. Once the communities are determined, features such as the community size, the number/amount of transactions within the community, and more, can be calculated and passed to the ML model.

community detection algorithms

Example of communities detected within a graph. Parties that are within the same community as a suspicious party might be more likely to be suspicious as well.

Through these algorithms, we can then perform other manipulations or aggregations to the data. For example, how many parties within a community have had a SAR filed on them? How many unique addresses are within the community? What are the most influential parties within the community as measured by PageRank? All these different types of features usually result in top feature importance values when analyzed via SHAP or XGBoost feature importance measures.

Graph Neural Networks

Graph Neural Networks (GNNs) are a flavor of neural network architecture that operates on graph data structures. These algorithms take into account both the numerical attributes of vertices in the graph, but also the edges between the vertices explicitly. This method can result in an additional accuracy lift over just incorporating graph features into traditional machine learning models. Additionally, they can provide an extra layer of explainability to investigators, as the model not only provides feature importance, but also the importance of the edges between different vertices in the graph.

gnn neighbor aggregation

Example of how a Graph Neural Network makes its predictions.
Image Source: https://www.researchgate.net/figure/Given-an-input-graph-a-GNN-predicts-the-label-of-the-target-node-eg-the-blue-node_fig2_346143176

Conclusion

In the dynamic arena of Anti-Money Laundering (AML), the conventional rule-based approach to spotting potential financial crimes is yielding to the transformative potential of graph machine learning. The escalating volume of financial transactions and the prevalence of false positive alerts have catalyzed the adoption of innovative techniques that prioritize investigation efficiency. Graph machine learning, with its ability to decipher intricate relationships between entities in financial data, emerges as a pivotal solution. By leveraging graph features and advanced Graph Neural Networks, institutions can not only mitigate false positives but also enhance investigative accuracy. In this landscape, TigerGraph’s prowess shines – its scalability and performance in generating graph features have positioned it as a leader. As financial institutions navigate the complex terrain of AML, the convergence of graph machine learning and TigerGraph’s capabilities promises a more resilient defense against money laundering while optimizing resource allocation for investigations.

TigerGraph has demo examples of graph analytics for financial crime detection, which you can try for free on TigerGraph Cloud.

Download TigerGraph’s O’Reilly book Graph-Powered Analytics and Machine Learning. This book uses use case examples, including financial crime detection, to teach about graph analytics and graph machine learning.

by Parker Erickson

Fraud comes in all shapes and forms across many industries; from account takeovers, to transaction fraud, the financial services industry to healthcare, fraud is both prevalent and costly for many companies. As the world becomes further interconnected and increasingly digital, businesses must adapt to better fight fraudulent activities. In this blog, we will apply graph machine learning techniques to improve fraud detection by up to 20% in the Ethereum blockchain. Follow along with this Jupyter notebook.

Structure of Fraud

Fraudulent activities come in many forms, but a core idea behind fraud detection is to find individuals who behave similar to suspected fraudsters. This might mean using the same address or device id or conducting a similar series of transactions. Given these similarities, and known fraudulent behavior, one can find many fraudsters based on the knowledge of how one operated. The same logic applies to using machine learning for fraud detection. Data scientists need to extract features to feed to their ML pipelines that capture the similarity of various entities that they are trying to classify. In the case of this blog, we are going to be utilizing the structure of Ethereum blockchain transactions to detect account takeover fraud.

Why Graph Machine Learning?

As data scientists try to generate richer features that capture the similarity between different entities, they often look towards graph data structures. Graphs provide a very natural way to represent relationships between entities, such as “Person 1 used Device 36” or “Person 2 transacted with Person 723.” Making a profile of relationships and comparing profiles is the basis for many similarity determinations. If two persons use the same IP address, they will be connected in the graph: Person 1 → IP333 ← Person 2. Additionally, two accounts that transacted with one another would be connected as well. By analyzing these connections and relationships, data scientists can use graph algorithms to extract new features that they wouldn’t have been able to in a traditional tabular setting due to the compute cost of implementing the algorithms through complex JOIN operations. As TigerGraph is a very scalable and performant graph database, leading financial institutions use TigerGraph to compute these novel graph features.

graph-enhanced ML, 2 ways

Two methods of Graph Machine Learning. Pictured on the left is incorporating graph features with traditional ML models, while the right represents utilizing native graph ML models such as GNNs.

Graph Machine Learning Applied to Ethereum Fraud Detection

While graph features are a good starting point to detecting more fraud, “native” graph models, such as Graph Neural Networks, incorporate the relationship between data points in a more direct and holistic manner, thereby reducing the need for complex feature engineering pipelines. This gives rise to two different approaches to take when incorporating graph machine learning techniques into data pipelines: one in which graph features are extracted and then passed to traditional, tabular ML models such as XGBoost, or one that uses graph neural networks to make their predictions.

Ethereum Transaction Dataset

The dataset used in the demo comprises transactions on the Ethereum platform, forming a transaction graph for Ether, the second-largest cryptocurrency. Wallets (i.e., accounts) on the platform serve as vertices in the graph, while edges represent transactions between these accounts. With 32,168 vertices and 84,088 edges, the dataset is derived from the publicly available Ethereum dataset from XBlock. We will be predicting if an account is fraudulent or benign.

ethereum transaction schema

TigerGraph Schema for the Ethereum Dataset

As a baseline test, the notebook trains a model using features that a data scientist might normally calculate in a traditional tabular manner. This includes the amount an account has received, the amount that it has sent, the number of transactions that it has received and sent, and the minimum of the transactions that have been sent and received. Using these features, an XGBoost model achieves a 77% accuracy on the fraud detection task. We will now add more graph-derived features to the model to see how incorporating information from the graph structure of the dataset will improve the accuracy of the fraud detection.

Graph Features And Traditional Machine Learning

The next model in the notebook trains an XGBoost model that incorporates three different graph features: PageRank, Betweenness Centrality, and Weakly Connected Component size. These features are some of the more common algorithms to execute in fraud detection use cases. TigerGraph offers a library of over 50 built-in graph algorithms, organized into seven categories. The notebook’s three features fall into two of those categories described below.

Centrality: Centrality algorithms, such as PageRank and Betweenness, can measure the influence or closeness of a vertex to others within the graph. Many fraud patterns exhibit higher than average centralities, so these can be useful features for training an ML model.

centrality algorithms

An example of the output of a centrality algorithm, where larger scores are denoted by larger vertices. Entities that are more central have more transaction inflow and outflow, which may be fraudulent behavior.

Community Detection: Community algorithms, such as Weakly Connected Components (WCC) or Louvain, can be used to determine groups of vertices that share common characteristics or heavily interact with each other. Once the communities are determined, features such as the community size, the number/amount of transactions within the community, and more can be calculated and passed to the ML model.

community detection algorithms

Example of communities detected within a graph. Parties that are within the same community as a suspicious party might be more likely to be suspicious as well.

Another important graph algorithm category is Shortest Path. While not used in this demo, shortest path algorithms can answer questions like “how close is the entity in interest to a known fraudulent entity?”, which may provide a useful signal to the downstream machine learning model.

shortest path algorithms

An example of shortest path algorithms determining how close an entity is to a known fraudulent entity.

The different features calculated in the Ethereum dataset resulted in an increase of the XGBoost model accuracy. When used with the traditional features we calculated beforehand, the model achieved a 91% accuracy, a 14% accuracy improvement! In addition, we can view the importance of the features in the model’s decision, and see that PageRank and the size of the Weakly Connected Component are quite important in the model’s classification.

graph and other feature importance, for ethereum demo

Global feature importance scores from the XGBoost Model trained on graph-derived features.

Graph Embeddings

Graph embeddings are a way to capture a lot of information about the graph in an unsupervised manner, without the need for time-consuming feature engineering that some of the other approaches require. TigerGraph offers graph embedding algorithms in its graph data science library and in its pyTigerGraph Python library. The specific approach that we use here is called FastRP, or Fast Random Projection. The idea behind this algorithm is to perform operations on the graph’s adjacency matrix such that vertices that are highly connected in the graph are close to each other in the embedding space produced.

neighbors in graph embedding

Graph embeddings map vertices that are close together (well connected) in the graph to areas within the embedding space such that they remain close to each other.

Graph embeddings are the best way to incorporate the most amount of graph information into traditional ML algorithms such as XGBoost. However, they do come at a considerable memory and computational cost, so they are not always used. In the Ethereum notebook, we saw they contributed an additional 3% accuracy increase over the model incorporating graph features, for a 17% increase in accuracy over the baseline model. Additionally, as the figure below shows, the holistic nature of an embedding means that it contributes far more to the model than any other single feature.

embedding and other feature important, for ethereum demo

Global feature importance scores from the XGBoost Model trained on graph-derived features and FastRP embedding.

Graph Neural Networks

gnn neighbor aggregation

Example of how a Graph Neural Network makes its predictions.

Image Source: https://www.researchgate.net/figure/Given-an-input-graph-a-GNN-predicts-the-label-of-the-target-node-eg-the-blue-node_fig2_346143176/a>

By using the built-in GraphSAGE model in pyTigerGraph, we trained a GNN on the Ethereum dataset. This resulted in an accuracy of 97%, a 20% improvement in accuracy over the baseline model! Additionally, we can view the local feature importance and the subgraph used to analyze why the model made the prediction it did for the exact account we are interested in.

Conclusion

While the example of the Ethereum dataset used throughout the blog focused on one type of fraud and a relatively simple dataset, the ideas presented carry throughout various different industries and types of fraud. By incorporating information about the structure and relationships between data points, graph machine learning techniques provide a substantial accuracy improvement over traditional machine learning techniques. TigerGraph can compute graph-derived features scalably and performantly, leading to more actionable insights and improved outcomes for businesses.

Compilation of results from the Ethereum demo. As more graph structure is incorporated into the ML model, accuracy improves.

A note on TigerGraph

TigerGraph is the leading platform for analytics and machine learning on connected data.

TigerGraph Inc. was established in 2012 and is based in Redwood City, California. TigerGraph is successfully deployed and is adding value to Forbes 2000 customers all over the world. TigerGraph was included in the Gartner Magic Quadrant in 2022, and inducted into the JP Morgan Chase Hall of Innovation in 2021. Forrester Research calculated the average ROI from TigerGraph at over 600% with a payback of less than 6 months.

TigerGraph is the only graph database powerful enough to run graph algorithms at the scale and accuracy required for fraud detection in large enterprises like financial institutions. TigerGraph runs most graph algorithms 10x to 1000x times faster than its nearest competitors, providing more answers, more efficiency, and more time to converge to correct answers. When it comes to scale, there’s no competition. TigerGraph is the only answer for multi-terabyte graphs in a single, non-sharded database.

There are many ways to use TigerGraph – on cloud or on premises. You can sign up for a free instance of TigerGraph Cloud at tgcloud.io or contact us at info@tigergraph.abstage.xyz to find out more.

The Critical Role of Graph Databases in AI

Navigating the Representation Problem in AI

Leveraging NVIDIA’s RAPIDS cuGraph for Breakthroughs in Performance

The Synergy Between TigerGraph and NVIDIA: Pioneering the Future of AI Hardware

The Evolution of Animal Intelligence

Movement and the Evolution of the Central Nervous System in Flatworms

The Evolution World Models in Brains

Language Models Are *Not* World Models

Knowledge Graphs *ARE* World Models

Animal Brains Use Reference Frames

Measuring Complexity In Your Organization

Conclusion

What Is a Graph Database?

What Is a Native Graph Database?

Why Use a Graph Database?

What Are Graph Databases Used for?

Who’s Already Using Graph Databases?

Conclusion

The AML Investigation Process

Why Graph Machine Learning?

Graph Machine Learning Techniques

Graph Features And Traditional Machine Learning

Graph Neural Networks

Conclusion

Structure of Fraud

Why Graph Machine Learning?

Graph Machine Learning Applied to Ethereum Fraud Detection

Ethereum Transaction Dataset

Graph Features And Traditional Machine Learning

Graph Embeddings

Graph Neural Networks

Conclusion

A note on TigerGraph

Language Models Are Not World Models

Knowledge Graphs ARE World Models