2

I got the following documents of a collection (let's name it myCollection):

{
    "_id": {
        "$oid": "601a75a0c9a338f09f238816"
    },
    "Sample": "lie50",
    "Chromosome": "chr10",
    "Position": {
        "$numberLong": "47663"
    },
    "Reference": "C",
    "Mutation": "T",
    "Run": "Run_test",
    "SYMBOL": "TUBB8"
},
{
    "_id": {
        "$oid": "601a75a0c9a338f09f238817"
    },
    "Sample": "lie50",
    "Chromosome": "chr10",
    "Position": {
        "$numberLong": "47876"
    },
    "Reference": "T",
    "Mutation": "C",
    "Run": "Run_test",
    "SYMBOL": "TUBB8"
},
{
    "_id": {
        "$oid": "601a75a0c9a338f09f238818"
    },
    "Sample": "lie50",
    "Chromosome": "chr10",
    "Position": {
        "$numberLong": "48005"
    },
    "Reference": "G",
    "Mutation": "A",
    "Run": "Run_test",
    "SYMBOL": "TUBB8"
},
{
    "_id": {
        "$oid": "601a75a0c9a338f09f238819"
    },
    "Sample": "lie12",
    "Chromosome": "chr10",
    "Position": {
        "$numberLong": "48005"
    },
    "Reference": "G",
    "Mutation": "A",
    "Run": "Run_test",
    "SYMBOL": "TUBB8"
}

I am interested in printing the distinct count of the values from the fields Chromosome, Position, Reference, and Mutation. This means to count the unique fields of the following entries:

"Chromosome": "chr10", "Position": 47663, "Reference": "C", "Mutation": "T"
"Chromosome": "chr10", "Position": 47876, "Reference": "T", "Mutation": "C"
"Chromosome": "chr10", "Position": 48005, "Reference": "G", "Mutation": "A"
"Chromosome": "chr10", "Position": 48005, "Reference": "G", "Mutation": "A"

which should be 3 distinct rows.

I have checked multiple questions like this one on how to print the distinct values for one field or using $unwind/$project.

For the latter, I thought why not concatenate the 4 fields and then print the number using length with $unwind/$project?

I managed to get that far:

db.myCollection.aggregate(
[
  {
    $group:
    {
      _id: null,
      newfield: {
        $addToSet:
        {
          $concat:
          [
            "$Chromosome",
            "_",
            {"$toString":"$Position"},
            "_",
            "$Reference",
            "_",
            "$Mutation"
          ]
        }
      }
    }
  },
  {
    $unwind: "$newfield"
  },
  { 
    $project: { _id: 0 }
  }
]).length

However, adding .length to this query does not return anything but without returns:

{ "newfield" : "chr10_47663_C_T" }
{ "newfield" : "chr10_47876_T_C" }
{ "newfield" : "chr10_48005_G_A" }

For information, my actual data contains 2 billion documents.

1 Answer 1

4

The fields should pass in _id in $group stage, and also use $count stage to get total elements instead of returning all documents,

db.myCollection.aggregate([
  {
    $group: {
      _id: {
        Chromosome: "$Chromosome",
        Position: "$Position",
        Reference: "$Reference",
        Mutation: "$Mutation"
      }
    }
  },
  { $count: "count" }
])

Playground

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.