Our Commitment to Building Safe Autonomous Trucks:
How TuSimple Identifies and Resolves Safety Issues

By TuSimple's editorial team
Oct 13

Share

As TuSimple develops industry-leading autonomous trucking technology, safety is our top priority. Autonomous vehicles have the potential to be one of the biggest improvements in transportation safety ever, and they require safe operations and safe technology development. Exhaustively finding and fixing potential safety issues is a critical part of building this technology, which has the potential to save thousands of lives every year. To do this, TuSimple deploys layers of verification and safety measures to identify, capture, and track issues through safe resolution.

The goal is to catch all safety-related items at the earliest point possible in the design, verification, and testing process. For those interested in learning what it takes to build safety into complex autonomous driving systems, this article gives a glimpse into our cutting-edge safety best practices.

Safety Analysis

Safety begins with identification of possible weaknesses that need analysis and resolution. This is a multifaceted process that involves the entire development team.

Our people are a critical line of defense

Identifying potential issues and bringing them to the forefront is the responsibility of every TuSimple employee. Specifically:

  • Our Product and Systems teams identify challenges in our intended areas of operation, and scaffold in the required functionalities.
  • Our Safety Engineers make sure we set and meet appropriate safety goals within our design.
  • Our Developers are responsible for unit testing their individual functions.
  • Our Verification & Validation team uses a combination of Test Driven Development and Structured Testing to methodically work through the tech stack and shake out latent and active failures.

Every employee is encouraged and responsible for reporting gaps or problems into our ticketing system.

Our detection methods are comprehensive, diverse and robust

We utilize a combination of tools to root out engineering problems.  This is because each tool provides a different lens for a different level of coverage and details.  When combined, TuSimple achieves a multifaceted viewpoint of the system’s current capabilities.

Detections occur at different points of development, verifications, and operations, and all findings become issues tracked and prioritized in our ticketing system.

System Level Analysis

Our Systems & Safety Engineering (S&SE) experts use a suite of industry standard tools to methodically shake down the design and close gaps.  TuSimple uses the following system level detection methods, among others:

Model Based System Engineering (MBSE)

Provides top-down analysis of the interactions between subsystems, functions, and real-world actors. New findings are added as tickets, and when appropriate spawn their own follow-up projects.

Design Failure Modes & Effects Analysis (DFMEA)

Identifies high risk / low controllability failures. High Risk Priority Number (RPN) items get enhanced focus in downstream testing, such as further testing in our Systems Integration Lab.

Hazard Analysis and Risk Assessment (HARA) Techniques

Proactively identifies potential hazards, understand their impacts, and track, communicate, and drive mitigations to a safe resolution.

Engineering Design Reviews and Cross-Functional Deep Dives

Ensures requirements are understood and implemented correctly, and that design changes made for a focused area are well integrated into the whole.

Daily Findings Review

Used to review recent findings from Simulation, Regeneration, Systems Integration Lab, track testing, and road testing, to determine the severity, criticality, and likelihood of discovered issues. This review consists of experts in these disciplines, as well as managers, directors, and executives.

Testing

Pre-Release Testing

Our Independent Verification & Validation (IV&V) experts put new software and firmware through a gauntlet of testing before allowing road testing.

  • Developers ensure that the new feature or function operates correctly within design constraints using a combination of offline testing and unit tests.  Failures are tracked as new tickets.
  • Each build planned for the road goes through over 10,000 simulations that exercise the Autonomous Driving System (ADS) logic under different circumstances. When or if a failure occurs, they are tracked in aggregate against a build, and our Triage team identifies and creates failure tickets.
  • Each build planned for the road goes through regenerations, which feeds recorded real-world data through new builds to ensure the ADS reacts as expected.  Failures are tracked in aggregate against a build, and the Triage team identifies and creates failure tickets.
  • Where appropriate, the code then moves to a controlled Hardware in Loop (HiL) test bench that allows the code and hardware to be proven out together.  Failures are tracked as new tickets as part of the testing process.
  • Building on the HiL work, significant software build changes and new features are tested on the Systems Integration Lab, a complete Truck in Loop (TiL) instrumented and automated wherever possible to identify interactions that would be unsafe to test elsewhere (for example, testing loss of power or communications).
  • Adversarial Testing is an activity that springs from “what if” questions, asked by people outside the original design team.  Unlike with other test approaches, adversarial test architects don’t have to understand the details of the design.  They make assumptions on what they expect the system to reasonably do when presented with a challenge, and test to verify.  Adversarial testing is an effective layer because the third party viewpoint often identifies gaps.  The IT Security equivalent would be “penetration testing”.
  • Track Testing  is used to validate new functionality and safety features with a fully operational truck in a controlled, closed course environment.  For example, we use the track environment to intentionally inject faults into the system and prove, end to end, that the vehicle correctly detects each fault and where appropriate, performs safe countermeasures such as pulling over.

Real-Time Monitoring

There are many ways that TuSimple provides ongoing, real-time monitoring of system health.

ADS Health ManagerThe ADS Health Manager monitors hardware, compute functions, timing, rationality checks, warning and error thresholds. These errors are logged during every mission and the Triage team follows up by reviewing errors and ensuring they are associated with a new or existing ticket. The Health Manager is a critical continuous monitoring capability that can proactively trigger the truck to pull over when necessary. Health Manager warnings, detected errors, and outcomes are all logged and analyzed post-mission, and findings are added to tracked tickets.
Safety Drivers and TechniciansThe Safety Drivers and Technicians monitor operation of the ADS before, during, and after an autonomous mission. The technician annotates the data each time they witness any problem, which automatically generates a new ticket for the Triage team to review.
Compute SystemThe Compute Systems (main compute units, vehicle control units) continuously monitor each other and associated hardware subsystems for sanity and proper operation, both triggering appropriate responses and logging issues for the Triage team to review.

Post Mission

Whenever we have a finding during a mission, TuSimple may, as appropriate, perform the following types of analysis:

Triage AnalysisWhere a team of experts review all available data to localize and group the problem with known issues. Our Triage engineers are experts at reviewing mission data and helping the organization focus on critical or recurring issues.
Mission Data to RegenerationWhere we rerun the recorded mission data from an event through the simulation environment to see what would have happened under slightly different circumstances. For example, what if the Safety Driver hadn’t taken over? Would the ADS have reacted in a healthy way? This is also known in the industry as “counterfactual analysis”, meaning, what would have happened if the facts had been slightly different? Regeneration efforts often lead to generating new findings tickets.
Mission Data to SimulationsWhere we create simulations similar to events seen on a mission. These simulations have purely simulated sensor data, and can be manipulated to make the simulated speeds, timing, and direction slightly worse or better to understand the exact failure point. These simulations often lead to creating new findings tickets. In many cases, the simulations we create to study a specific mission event become part of our ongoing library of regression testing for new versions of the software.

Tracking & Resolution

Our issue tracking methods are standard for safety critical industries

When potential safety risks are identified, it’s important to rigorously document and to track them to make sure a solution is implemented and verified. Less formal processes could allow issues to fall through the cracks. We use Jira as our primary ticketing system, a tool used throughout safety-critical industries.   Jira tickets may be associated with the formally recorded results of any engineering activity, such as design work, testing results, and other related issues.

Our process includes elevated management and oversight for tracking high-consequence issues

Failures of significant consequence are added to our Corrective Action Preventive Action (CAPA) tracking and resolution system.  This process is managed by our Quality Assurance team as one element of our Quality Management System (QMS).

As typical for safety critical industries, items loaded to the CAPA system are registered for tracking and then evaluated for one or more potential Corrective Actions that immediately correct the situation.  Any potential Preventive Actions, those that would have kept the problem from happening in the first place, are evaluated to improve our engineering and review processes.  For  example, fixing a broken component would be a corrective action, and upgrading the part quality for future builds is a possible preventive action.

Highest priority items get assigned as named, targeted projects

When issues are cross-functional, complex, and span multiple disciplines, we may assign an issue or a group of related issues to a named project.  Named projects are typically overseen by a Senior Manager or Director level leader, managed by a Program Manager or Technical Program Manager, and are formally managed into the workload as priority activities.

Saving Lives Is Our Mission

Like other safety critical industries including aerospace, medical, and power generation, autonomous technology is complex and requires layers of safety processes to identify and resolve safety risks. We deploy industry standard analysis, automated tools, and exhaustive simulation to detect and correct risks as early in the development process as possible. These activities are company-wide and we rely on the strength of our whole organization to solve for safety. 

Our goal is to improve transportation safety for all road users by delivering the safest driver on the road. Ultimately, through safe engineering, testing, and deployment, we have the opportunity to save thousands of lives every year.

Posted by TuSimple

Recent Blog Posts