git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Deprecating/removing PropertyFileSnitch?


Sorry, maybe my spam filter got them or something, but I have never seen a JIRA number mentioned in the thread before this one.  Just looked back through again to make sure, and this is the first email I have with one.

-Jeremiah

> On Oct 22, 2018, at 9:37 PM, sankalp kohli <kohlisankalp@xxxxxxxxx> wrote:
> 
> Here are some of the JIRAs which are fixed but actually did not fix the
> issue. We have tried fixing this by several patches. May be it will be
> fixed when Gossip is rewritten(CASSANDRA-12345). I should find or create a
> new JIRA as this issue still exists.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10366&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=W_HfejhgW1gmZ06L0CXOnp_EgBQ1oI5MLMoyz0OrvFw&e=
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10089&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=CNZK3RiJDLqhsZDG6FQGnXn8WyPRCQhp4x_uBICNC0g&m=lI3KEen0YYUim6t3VWsvITHUZfFX8oYaczP_t3kk21o&s=qXzh1nq2yE27J8SvwYoRf9HPQE83m07cKdKVHXyOyAE&e= (related to it)
> 
> Also the quote you are using was written as a follow on email. I have
> already said what the bug I was referring to.
> 
> "Say you restarted all instances in the cluster and status for some host
> goes missing. Now when you start a host replacement, the new host won’t
> learn about the host whose status is missing and the view of this host will
> be wrong."
> 
>   - CASSANDRA-10366
> 
> 
> On Mon, Oct 22, 2018 at 7:22 PM Sankalp Kohli <kohlisankalp@xxxxxxxxx>
> wrote:
> 
>> I will send the JIRAs of the bug which we thought we have fixed but it
>> still exists.
>> 
>> Have you done any correctness testing after doing all these tests...have
>> you done the tests for 1000 instance clusters?
>> 
>> It is great you have done these tests and I am hoping the gossiping snitch
>> is good. Also was there any Gossip bug fixed post 3.0? May be I am seeing
>> the bug which is fixed.
>> 
>>> On Oct 22, 2018, at 7:09 PM, J. D. Jordan <jeremiah.jordan@xxxxxxxxx>
>> wrote:
>>> 
>>> Do you have a specific gossip bug that you have seen recently which
>> caused a problem that would make this happen?  Do you have a specific JIRA
>> in mind?  “We can’t remove this because what if there is a bug” doesn’t
>> seem like a good enough reason to me. If that was a reason we would never
>> make any changes to anything.
>>> I think many people have seen PFS actually cause real problems, where
>> with GPFS the issue being talked about is predicated on some theoretical
>> gossip bug happening.
>>> In the past year at DataStax we have done a lot of testing on 3.0 and
>> 3.11 around adding nodes, adding DC’s, replacing nodes, replacing racks,
>> and replacing DC’s, all while using GPFS, and as far as I know we have not
>> seen any “lost” rack/DC information during such testing.
>>> 
>>> -Jeremiah
>>> 
>>>> On Oct 22, 2018, at 5:46 PM, sankalp kohli <kohlisankalp@xxxxxxxxx>
>> wrote:
>>>> 
>>>> We will have similar issues with Gossip but this will create more
>> issues as
>>>> more things will be relied on Gossip.
>>>> 
>>>> I agree PFS should be removed but I dont see how it can be with issues
>> like
>>>> these or someone proves that it wont cause any issues.
>>>> 
>>>> On Mon, Oct 22, 2018 at 2:21 PM Paulo Motta <pauloricardomg@xxxxxxxxx>
>>>> wrote:
>>>> 
>>>>> I can understand keeping PFS for historical/compatibility reasons, but
>> if
>>>>> gossip is broken I think you will have similar ring view problems
>> during
>>>>> replace/bootstrap that would still occur with the use of PFS (such as
>>>>> missing tokens, since those are propagated via gossip), so that doesn't
>>>>> seem like a strong reason to keep it around.
>>>>> 
>>>>> With PFS it's pretty easy to shoot yourself in the foot if you're not
>>>>> careful enough to have identical files across nodes and updating it
>> when
>>>>> adding nodes/dcs, so it's seems to be less foolproof than other
>> snitches.
>>>>> While the rejection of verbs to invalid replicas on trunk could address
>>>>> concerns raised by Jeremy, this would only happen after the new node
>> joins
>>>>> the ring, so you would need to re-bootstrap the node and lose all the
>> work
>>>>> done in the original bootstrap.
>>>>> 
>>>>> Perhaps one good reason to use PFS is the ability to easily package it
>>>>> across multiple nodes, as pointed out by Sean Durity on CASSANDRA-10745
>>>>> (which is also it's Achilles' heel). To keep this ability, we could
>> make
>>>>> GPFS compatible with the cassandra-topology.properties file, but
>> reading
>>>>> only the dc/rack info about the local node.
>>>>> 
>>>>> Em seg, 22 de out de 2018 às 16:58, sankalp kohli <
>> kohlisankalp@xxxxxxxxx>
>>>>> escreveu:
>>>>> 
>>>>>> Yes it will happen. I am worried that same way DC or rack info can go
>>>>>> missing.
>>>>>> 
>>>>>> On Mon, Oct 22, 2018 at 12:52 PM Paulo Motta <
>> pauloricardomg@xxxxxxxxx>
>>>>>> wrote:
>>>>>> 
>>>>>>>> the new host won’t learn about the host whose status is missing and
>>>>> the
>>>>>>> view of this host will be wrong.
>>>>>>> 
>>>>>>> Won't this happen even with PropertyFileSnitch as the token(s) for
>> this
>>>>>>> host will be missing from gossip/system.peers?
>>>>>>> 
>>>>>>> Em sáb, 20 de out de 2018 às 00:34, Sankalp Kohli <
>>>>>> kohlisankalp@xxxxxxxxx>
>>>>>>> escreveu:
>>>>>>> 
>>>>>>>> Say you restarted all instances in the cluster and status for some
>>>>> host
>>>>>>>> goes missing. Now when you start a host replacement, the new host
>>>>> won’t
>>>>>>>> learn about the host whose status is missing and the view of this
>>>>> host
>>>>>>> will
>>>>>>>> be wrong.
>>>>>>>> 
>>>>>>>> PS: I will be happy to be proved wrong as I can also start using
>>>>> Gossip
>>>>>>>> snitch :)
>>>>>>>> 
>>>>>>>>> On Oct 19, 2018, at 2:41 PM, Jeremy Hanna <
>>>>>> jeremy.hanna1234@xxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Do you mean to say that during host replacement there may be a time
>>>>>>> when
>>>>>>>> the old->new host isn’t fully propagated and therefore wouldn’t yet
>>>>> be
>>>>>> in
>>>>>>>> all system tables?
>>>>>>>>> 
>>>>>>>>>> On Oct 17, 2018, at 4:20 PM, sankalp kohli <
>>>>> kohlisankalp@xxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> This is not the case during host replacement correct?
>>>>>>>>>> 
>>>>>>>>>> On Tue, Oct 16, 2018 at 10:04 AM Jeremiah D Jordan <
>>>>>>>>>> jeremiah.jordan@xxxxxxxxx> wrote:
>>>>>>>>>> 
>>>>>>>>>>> As long as we are correctly storing such things in the system
>>>>>> tables
>>>>>>>> and
>>>>>>>>>>> reading them out of the system tables when we do not have the
>>>>>>>> information
>>>>>>>>>>> from gossip yet, it should not be a problem. (As far as I know
>>>>> GPFS
>>>>>>>> does
>>>>>>>>>>> this, but I have not done extensive code diving or testing to
>>>>> make
>>>>>>>> sure all
>>>>>>>>>>> edge cases are covered there)
>>>>>>>>>>> 
>>>>>>>>>>> -Jeremiah
>>>>>>>>>>> 
>>>>>>>>>>>> On Oct 16, 2018, at 11:56 AM, sankalp kohli <
>>>>>> kohlisankalp@xxxxxxxxx
>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Will GossipingPropertyFileSnitch not be vulnerable to Gossip
>>>>> bugs
>>>>>>>> where
>>>>>>>>>>> we
>>>>>>>>>>>> lose hostId or some other fields when we restart C* for large
>>>>>>>>>>>> clusters(~1000 instances)?
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Oct 16, 2018 at 7:59 AM Jeff Jirsa <jjirsa@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We should, but the 4.0 features that log/reject verbs to
>>>>> invalid
>>>>>>>>>>> replicas
>>>>>>>>>>>>> solves a lot of the concerns here
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jeff Jirsa
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Oct 16, 2018, at 4:10 PM, Jeremy Hanna <
>>>>>>>> jeremy.hanna1234@xxxxxxxxx>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We have had PropertyFileSnitch for a long time even though
>>>>>>>>>>>>> GossipingPropertyFileSnitch is effectively a superset of what
>>>>> it
>>>>>>>> offers
>>>>>>>>>>> and
>>>>>>>>>>>>> is much less error prone.  There are some unexpected behaviors
>>>>>> when
>>>>>>>>>>> things
>>>>>>>>>>>>> aren’t configured correctly with PFS.  For example, if you
>>>>>> replace
>>>>>>>>>>> nodes in
>>>>>>>>>>>>> one DC and add those nodes to that DCs property files and not
>>>>> the
>>>>>>>> other
>>>>>>>>>>> DCs
>>>>>>>>>>>>> property files - the resulting problems aren’t very
>>>>>> straightforward
>>>>>>>> to
>>>>>>>>>>>>> troubleshoot.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We could try to improve the resilience and fail fast error
>>>>>>> checking
>>>>>>>> and
>>>>>>>>>>>>> error reporting of PFS, but honestly, why wouldn’t we deprecate
>>>>>> and
>>>>>>>>>>> remove
>>>>>>>>>>>>> PropertyFileSnitch?  Are there reasons why GPFS wouldn’t be
>>>>>>>> sufficient
>>>>>>>>>>> to
>>>>>>>>>>>>> replace it?
>>>>>>>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>> dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>>>>>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
>>> For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xxxxxxxxxxxxxxxxxxxx
For additional commands, e-mail: dev-help@xxxxxxxxxxxxxxxxxxxx