I'm trying to figure out why the following happens.
We have a Amazon keyspaces table with a few GBs of data. The table structure does not have any performance issues and works perfectly on our on-premise deployment of a Cassandra...
I'm trying to figure out why the following happens.
We have a Amazon keyspaces table with a few GBs of data. The table structure does not have any performance issues and works perfectly on our on-premise deployment of a Cassandra cluster.
However, when migrating to Amazon keyspaces, we get PerConnectionRequestRateExceeded errors for this table in CloudWatch when we try to increase the load of the service to our production workload. The service which reads and writes from/to the database is a standard Java 17 Spring application, without any custom settings. We just basically bombard the service with 2-3k/sec HTTP GET requests that translate to reading from the database, and from this specific table in particular.
The issue we have is with the reads and the SELECT statements in particular. The error is reported to the client application as read timeout, which is pretty standard.
What we tried so far to mitigate it:
- disabling hostname validation
- increasing connection heartbeat in the Cassandra driver
- increasing connection pool size
- increasing instances of the client application (8 instances x 550 connections - 4400 connections, 3 instances per 1500 connections = 4500 connections)
We are talking about a peak of 2-3k HTTP req/sec. The only thing that is perhaps special in this case is that in most of those queries, we have IN statements. I read in the Keyspaces documentation that IN statements are basically translated like this:
SELECT * FROM table_name WHERE field IN ("A", "B")
this is translated to
SELECT * FROM table_name WHERE field = "A"
SELECT * FROM table_name WHERE field = "B"
In some cases we have around 100 elements in the IN statements, which I assume means that for each SELECT query, we have 100 queries to Keyspaces.
So for a load of 3000 req/sec, we should have about 300k CQL queries. For the amount of connections we have to Keyspaces (connection pool = 4400/4500), we should support a few million of CQL queries per second. I'm trying to figure out what's missing here, or at least what else to try.
Thanks in advance.
As your data size grows I would also expect the query with the id to end up faster (or at the least to scale better), especially if you have a cluster with more than "replication factor" nodes in it, as specifying the mail_id will target the...
As your data size grows I would also expect the query with the id to end up faster (or at the least to scale better), especially if you have a cluster with more than "replication factor" nodes in it, as specifying the mail_id will target the query to a single node. But until you data set is pretty big having the mail_id in the query will actually slow it down because it means more comparisons need to occur to find the answer. You can use the TRACE feature in CQL to see that more work has to be done to filter out responses in the case of adding extra parameters to the query vs just returning the output of the ANN search.
Pull Request #272 on apache/cassandra-website merged by smiklosovic
#272 CASSANDRA-19670 add documentation about logging to code style
commented Pull Request #272 on apache/cassandra-website
@michaelsembwever I am looking for +1, thanks
opened Pull Request #272 on apache/cassandra-website
#272 CASSANDRA-19670 add documentation about logging to code style
Python 2 support was officially removed as of CASSANDRA-17242, which was merged into the alpha release of Cassandra 4.1. Therefore, the version of cqlsh bundled with Cassandra 4.1 can only be run on Python 3.x.
The docs do not explicitly state a "compatible" version other than "the latest version" because Python 2.7 is very old and has reached the end of its life, officially sunset on January...
The docs do not explicitly state a "compatible" version other than "the latest version" because Python 2.7 is very old and has reached the end of its life, officially sunset on January 2020.
It is pointless to even bother testing Python 2.7.5 given it's over 10 years old (released in 2013). Even Python 2.7.18 which is the last and latest version was released all the way back to 2020.
In any case, Python is not required to run Cassandra as a server. It is only needed by certain clients, specifically cqlsh. Being a client, there is no requirement to have Python-compatibility with Cassandra 3.11 and 4.1 for cqlsh since it is not part of the Cassandra server.
If Python versions is such an issue in your environment, consider using a Python virtual environment to run cqlsh. Cheers!
I'm upgrading Cassandra from 3.11 to 4.1. One of the requirements is the latest python 2.7 updates. I want to upgrade Cassandra with no downtime, so I need to know which python 2.7 update is compatible with both Cassandra 3.11 and Cassandra...
I'm upgrading Cassandra from 3.11 to 4.1. One of the requirements is the latest python 2.7 updates. I want to upgrade Cassandra with no downtime, so I need to know which python 2.7 update is compatible with both Cassandra 3.11 and Cassandra 4.1. Datastax doc on upgrading apache cassandra asked for the latest python 2.7, but it doesn't explicitly say which version and whether that python version is compatible with both versions of cassandra 3.11 and 4.1
I tried python 2.7.5 which work for cqlsh of Cassandra 3.11, but it does not work for cqlsh of Cassandra 4.1.
I'm observing substantial differences in query performance while executing vector similarity search queries in Cassandra. Here's the context and details:
CREATE TABLE cycling.feature (
mall_id bigint,
place_id bigint,
...I'm observing substantial differences in query performance while executing vector similarity search queries in Cassandra. Here's the context and details:
CREATE TABLE cycling.feature (
mall_id bigint,
place_id bigint,
hardware_id bigint,
feature_desc_id bigint,
occur_at timestamp,
vc vector<float,256>,
PRIMARY KEY ((mall_id), place_id, hardware_id, occur_at, feature_desc_id)
) WITH CLUSTERING ORDER BY (place_id ASC, hardware_id ASC, occur_at DESC, feature_desc_id DESC);
CREATE INDEX IF NOT EXISTS feature_ann_index_cos
ON cycling.feature(vc) USING 'sai'
WITH OPTIONS = { 'similarity_function': 'cosine' };
With mall_id Filter:
SELECT similarity_cosine(vc, ?) AS sim
FROM cycling.feature
WHERE mall_id = ?
ORDER BY vc ANN OF ? LIMIT 1;
Without mall_id Filter:
SELECT similarity_cosine(vc, ?) AS sim
FROM cycling.feature
ORDER BY vc ANN OF ? LIMIT 1;
The query with the mall_id filter is significantly slower than the one without, even though both are performing vector similarity searches.
I'm expecting the query with the mall_id filter to perform faster than the one without,
commented Pull Request #3331 on apache/cassandra
Not a full review yet. 1) https://github.com/apache/cassandra/pull/3331/commits/1070c8a3c437260b5d49e4c6cbae60f31d9ab835— Caleb suggested we verify that the merge of the bounds is okay for Accord (in regards to CASSANDRA-18100). You are in the chat with Ariel. I will look into that big commit once we confirm it. 2) The other two commits - the simplifications look great. Thank you! 3) Did you start fixing some of the null handling? Shouldn't we leave this for the other ticket? 4) Checked the CI results - all seem to be known failures
Pull Request #91 on apache/cassandra-accord merged by aweisberg
#91 Accord barrier/inclusive sync point fixes
opened Pull Request #3340 on apache/cassandra
#3340 CASSANDRA-19669: Audit Log entries are missing identity for mTLS connections
This looks like a bug. Since exactly such a bug existed in the Python CQL driver several years ago (see https://github.com/scylladb/scylladb/issues/8203, fix in
This looks like a bug. Since exactly such a bug existed in the Python CQL driver several years ago (see https://github.com/scylladb/scylladb/issues/8203, fix in https://github.com/datastax/python-driver/commit/1d9077d3f4c937929acc14f45c7693e76dde39a9), and in the Cassandra Java driver (https://github.com/apache/cassandra-java-driver/pull/1544) - and I suspect the gocql driver has the same bug too.
I opened a new issue in Scylla's fork of the gocql driver (which I hope you're using): https://github.com/scylladb/gocql/issues/180, and I hope they'll fix it.
Scylla's test suite has a test, test_filtering_contiguous_nonmatching_partition_range, to verify that Scylla itself does not have this bug - that you can scan with filtering a long partition with all but the last row being filtered out, and the iteration doesn't stop on the first empty page - it continues until producing the match at the end. Because of this test I strongly suspect that the bug is in gocql, not in Scylla.
I'm working airflow into the Docker container on WindowsPC. I have some problems with apache airflow spark submit operator. I want to write data to a remote Cassandra server.
When I was using df.write.save() I was getting...
I'm working airflow into the Docker container on WindowsPC. I have some problems with apache airflow spark submit operator. I want to write data to a remote Cassandra server.
When I was using df.write.save() I was getting An error occured while calling o41.save.
The strange thing is, that I can read data and show schema, but I can't save the data. Is there any opinion about it?
And also I want to share my spark configuration;
spark = SparkSession \
.builder \
.master ("local[*]")
.appName("example")
.config("spark.cassandra.connection.host","10.0.0.1") \
.config("spark.cassandra.connection.port","9042") \
.config("spark.cassandra.auth.username","pc1") \
.config("spark.cassandra.auth.password","1234") \
.config("spark.jars.packages","/opt/spark/jars/spark-cassandra-connector_2.12-3.4.0.jars") \
.getOrCreate()
I want to write CSV data from Docker to the remote Cassandra server with airflow and spark process.